More Data with every Article Revision and Probability of Revert

In this article, we detail recent enhancements to the Wikimedia Enterprise APIs, focusing on the metadata within the version object that accompanies every text payload response.

Previously existing fields such as revision comment, minor edit status, and editor information serve as part of our trust and credibility indicators. This metadata helps you decide whether to keep your existing data or update it with the new version. 

The version object now includes three new fields and an upgraded scores object, further enhancing the insights available to inform your decisions about data updates.

Noindex field – version.noindex

This field indicates whether the current version of an article is marked with a noindex directive, which is part of how Wikipedia manages the indexing of pages by search engines. In the Main article namespace, articles older than 90 days are automatically indexed. Articles younger than 90 days are not indexed unless patrolled, meaning they’re reviewed and conform to Wikipedia’s core content policies. This field allows you to decide whether or not to include the article’s contents into your dataset/models based on newness and Wikipedia editors’ confidence in the information.

Read more on what triggers changes to Wikipedia’s search engine indexing field.

Data Dictionary schema: version.noindex

Maintenance Tags – version.maintenance_tags

Maintenance Tags is a new feature that aggregates the occurrences of certain templates in the article body with each article revision. One potential use allows you to monitor how an article evolves. For example, if an article at revision r10 has maintenance_tags.citation_needed_count value of 2, and at revision r20, the value becomes 10, you could infer that the article is growing but has fewer citations than usual for the content added.

At the moment, this feature is limited to English Wikipedia and namespace 0. Currently the templates monitored are:

  • Citation Needed count – used to identify claims in articles, particularly if questionable, that need a citation to a reliable source. 
  • POV (point of view) count – used to identify an unbalanced or non-neutral article which is one that does not fairly represent the balance of perspectives of high-quality, reliable secondary sources.
  • Clarification Needed count – used to identify requests for other editors to clarify and source text that is difficult to understand.
  • Update count – used to identify articles or sections that have old or out-of-date information.

"maintenance_tags": {
  "citation_needed_count": 8,
  "pov_count": 1,
  "clarification_needed_count": 1,
  "update_count": 100

We plan to extend the feature to other language Wikipedias and add tracking of other templates as well based on your feedback, so let us know how you’re using this and what you’d like to see!

Data Dictionary schema: version.maintenance_tags

Breaking News – version.is_breaking_news

The “Breaking News” feature identifies articles related to major global news events with a true boolean field. This indicator signals high edit volatility and the potential for frequent updates, often due to evolving situations, allowing you to incorporate timely and relevant data into your dataset/models. This field enables you to strategically manage information from fast-paced, developing news stories. To read into the technical aspects of is_breaking_news see our wiki.

Data Dictionary schema: version.is_breaking_news

Revert Risk – version.scores

The new Revert Risk feature, now supported across most Wikipedia project languages, updates the existing scores object with a new vandalism detection score, called Revert Risk, via the more advanced Lift Wing machine learning platform. This change replaces the outdated damaging and good-faith scores from the ORES model, offering a more reliable assessment of whether an article revision is likely to be reverted.

With the improved model, you can decide whether to update your datasets or models with the latest edit or wait for the next revision, based on your risk tolerance for trust and credibility.

"scores": {
  "revertrisk": {
    "prediction": true,
    "probability": {
      "false": 0.2584042549133301,
      "true": 0.7415957450866699

If you’re interested in deeper understanding of Lift Wing and Revert Risk, read more about them here:

Data Dictionary schema: version.scores

The Wikimedia Enterprise team is continually improving the APIs to simplify the usability of Wikimedia project data for AI, machine learning applications, and whatever else you’re building. With these new features, you gain valuable signals to assess data trustworthiness, allowing for better informed decisions and greater control over data quality and risk management.

We invite you to sign up for free access to our APIs and stay tuned for more updates as we keep improving our tools to help you use and understand data more clearly. In the meantime reach out via the feedback form in your customer dashboard, or via the contact form; let us know what you’re working on!

Chuck Reynolds, Product Marketing Manager at Wikimedia Enterprise

Photo credits: IvanStudenovCC BY-SA 4.0, via Wikimedia Commons

Stay informed with news and features by grabbing our RSS feed.

Back to Blog | Top ↑