New API Features in Wikimedia Enterprise for Spring 2023

20 Apr 2023

We’d like to introduce you to some new features and quality of life improvements now available in Wikimedia Enterprise APIs.

Wikipedia Article Summaries: a new data field summarizing the article.
More Credibility Signals: more metadata to inform your content decisions.
Realtime API Improvements: unified event timeline and latency improvements.
Dedicated Metadata Endpoints: grab project codes and language info faster.
Filter All The Things!: find project data faster and customize responses.
Parallel File Downloads: split up those large binary files.
NDJSON Responses Everywhere: JSON or NDJSON? Your choice.

If it’s your first time here or perhaps you need a quick refresher; Wikimedia Enterprise provides reliable and robust modern APIs for high-volume reusers of Wikimedia content. Signup is free and provides instant access to get started with Enterprise APIs.

Let’s get into the new features.

Wikipedia Article Summary

Summarizing a Wikipedia article has historically been a challenge for reusers of Wikimedia APIs. Wikimedia Enterprise APIs now include a new data field called abstract which provides a concise summarized description of Wikipedia article content without requiring reusers to parse the entire article body themselves. This saves significant time and compute resources, and reduces barriers to understanding the content within articles. The abstract field is available in most Wikipedia language articles now, and we are working to expand this feature to other languages and projects in the future.

Additional Credibility Signals

Credibility Signals data is part of an ongoing effort by Wikimedia Enterprise to enhance the quality of the entire API dataset. Credibility Signals provide additional qualitative metadata in each article revision, empowering reusers to make better-informed decisions in real-time about how they might handle returned data.

For example, Credibility Signals data for a specific article revision may exceed a reuser’s threshold for vandalism, potential for mis/disinformation, and inaccuracies. This data can help the reuser decide whether they want to wait for a new revision that meets their acceptance criteria before updating internal knowledge graphs or otherwise reusing new content. This process also helps reusers combat the spread of misinformation to their end-users.

All fields that are part of the Credibility Signals data set are now tagged “credibility” in the Data Dictionary. Additional context explaining what each field represents is also available for existing fields, as well as the following newly added fields:

watchers_count – Number of editors watching the article page.
date_previously_modified – Timestamp of the last revision prior to this one.
New metadata in the Version object specific to the latest revision of the article
- version.has_tag_needs_citation – Has this article been flagged as needing citations? When an editor deems an assertion made in an article needs to be supported with a reference, it will carry this tag.
- version.editor.is_admin – Admins are editors who are trusted with the ability to perform highest order actions like “delete a page” or “add protections” – administrators are among the most trusted editors within a specific project.
- version.editor.is_patroller – Patrollers are editors that are allowed to mark incoming revisions as quality. They have earned a certain amount of editorial trust. Is this editor a patroller on this specific Wikimedia language project?
- version.editor.has_advanced_rights – There are many rights and permissions across the projects that editors can earn and attach to their reputation. Advanced rights give general permission to do specific functions across the sites. While not everyone is an administrator, typically if an editor has advanced rights, it is a sign that they are trusted.
- version.number_of_characters – Number of the characters calculated from Wikitext.
- version.size – Size of the article_body.wikitext in bytes.
- version.size.unit_text – The unit of measure represented by the size value.
- version.size.value – The number representing the size of an object.
  
  (These three version.size subfields all correspond to the length of information in the whole article at its current revision.)

Update: More credibility signals added to version object including upgraded scores data using Revert Risk.

Realtime API Improvements

The Realtime streaming API provides a rich dataset of all changes within a Wikimedia Project, delivered to you in real-time. Previously, consuming multiple streams made it difficult to verify the order in which received events actually happened. The Realtime API now provides a single, reliable, linear event stream which eliminates any confusion about the order or type of events received in the stream. This has allowed us to reduce event latency and increase the overall speed of revision delivery. Additionally, the Realtime API can now return NDJSON if requested and benefits from new response filtering (read more below).

Dedicated Metadata Endpoints

We’ve added two new endpoints to easily find the metadata needed to filter and query the information you’re looking for. Prior to this, you would have to know the specific project code or reference wiki pages to find them. We’ve done that hard work and they’re readily available along with a new Project Metadata page in our docs.

New Codes endpoint, /codes/, contains project types, which we’ve covered in detail on our Projects page, and related project metadata.
New Languages endpoint, /languages/, contains language metadata in English alongside the project’s localized language name.
While the /projects/ and /namespaces/ endpoints haven’t technically moved anywhere, we’ve better organized them along with Metadata in our Docs. Namespaces now also include a description field, which we previously only displayed in our Data Dictionary.

Filter all the things!

In this release, we’ve introduced response filters for your requests, streamlining data retrieval across projects. These filters work with all APIs, and filters are built by referencing the metadata found in the Metadata endpoints mentioned above.

Filters now allow you to request a specific project, namespace, and language when choosing what binary responses you want to get from Snapshot and Batch APIs.

These new filters are also useful for On-demand and Realtime streaming APIs. You may filter requests by project code, language, article title, namespace, or almost any other field, and also specify fields to return in the response body. For example, you can request only the article_body or abstract to save bandwidth and improve response times, or you could quickly locate all articles about Albert Einstein across all projects, or filter your responses by a specific language; lots of possibilities with filters!

Parallel File Downloading

Snapshot and Realtime Batch APIs now support HTTP HEAD requests, providing essential file metadata without actually requesting the file. These headers include content-length, last-modified, and accept-ranges (bytes), enabling support for Range HTTP request headers. With this feature, you can adjust the chunk size using Range request headers when querying Snapshots or Realtime Batch API to parallelize file downloads, offering a more efficient alternative to downloading entire project files at once.

NDJSON Response Available in all APIs

Since launch, our binary API endpoints (Snapshot and Realtime Batch) have returned NDJSON in packaged response files. In contrast, our On-demand and Realtime streaming HTTP APIs have returned plain JSON. As a Snapshot user with an existing NDJSON parser, updating articles from On-demand would necessitate a slightly different parser when retrieving the same article data.

This release introduces the option to specify Accept: application/x-ndjson in your request header for On-demand and Realtime APIs. This change results in an NDJSON-formatted response, allowing you to use the same parser for all responses. If the header is not requested, plain JSON remains the default format. This update ensures consistency across all API response data, simplifying the process of receiving and parsing data according to your preferences.

All of these features are Available Now

We’re excited to release these features and optimizations to you and hear your feedback. If you already have an account, this new functionality is already enabled, and if you don’t, new user signup is easy and free and allows you instant access to Snapshot and On-demand APIs and the new filters. Our API reference documentation has also been updated accordingly. If you have any questions, our support team is here to assist you in your dashboard, and if you need more access, our sales team is available to help you out. To learn more about how these products and services fit within the ecosystem of the Wikimedia movement and existing APIs, read our FAQ at metawiki.

— Chuck Reynolds, Staff Product Manager, Growth

Photo Credits
Benjamin Gimmel, CC BY-SA 3.0, via Wikimedia Commons

← Back to Blog | Top ↑

posted in: Releases

tagged: On-demand API, Realtime API, Snapshot API

Receive our news and updates using RSS.

New API Features in Wikimedia Enterprise for Spring 2023

Wikipedia Article Summary

Additional Credibility Signals

Realtime API Improvements

Dedicated Metadata Endpoints

Filter all the things!

Parallel File Downloading

NDJSON Response Available in all APIs

All of these features are Available Now

Latest Articles:

Nomic AI’s NOMAD Projection uses Enterprise Datasets to Visually Map Multilingual Wikipedia

Wikipedia Kaggle Dataset using Structured Contents Snapshot

Wikimedia Enterprise Partners with ProRata.ai to Champion Sustainable Search Engine Practices