Structured Contents extends to Snapshot API

We’re excited to announce the early beta release of Structured Contents in Snapshot API. This first release, available to testing partners, includes parsed Wikipedia articles outputted as structured JSON files (NDJSON format compressed in tar.gz), with a consistent schema. This release includes six Wikipedia languages: English, French, German, Italian, Spanish, and Portuguese.

Overview

In September 2023 we launched the beta Structured Contents On-demand API endpoint and have since received invaluable feedback from developers. A common request was the need to access the structured information in bulk via the Snapshot API; we’re fulfilling that ask now.

Similar to the On-demand endpoint, the Structured Contents Snapshot endpoint includes parsed Wikipedia data such as abstracts, Wikidata QIDs, short descriptions, main images, infoboxes, and article sections with links. We are actively evaluating additional parsed elements—such as references, lists, and tables—to incorporate in future updates.

General Access

This early beta release is intended for QA testing with our testing partners to help refine the endpoint before a broader beta release. If you are interested in becoming a testing partner, please express your interest to our sales team.

Releasing on Hugging Face

Alongside this release, we’re also making available a Hugging Face dataset of the new beta Structured Contents snapshots and inviting the general public to freely use and provide feedback. All of the information regarding the Hugging Face dataset is posted on our blog here.

Relevant Reference links:

Chuck Reynolds, Staff Product Manager, Growth