Structured Contents Initiative

More Structured Data, less BLOBs

Wikipedia uses wikitext, a markup language designed for formatting page content. While it has proven useful for editors authoring wiki articles, it creates complexity for developers parsing articles at scale.

When Wikimedia Enterprise launched in 2021, we built it to serve high-volume, high-frequency users of Wikimedia data. As part of that effort we also improved parsing by providing HTML blobs, a format that developers are more familiar with and for which many parsing libraries already exist.

The Structured Contents Initiative is the next step in serving easy-to-parse Wikimedia data. Currently in beta, it extracts infoboxes, sections, tables, references, and more from raw wikitext and HTML and delivers them as structured, machine-readable JSON.

What’s Available Now

Structured Contents currently extracts the following article pieces into JSON:

  • abstract
  • description
  • infobox
  • sections
  • citations & references
  • tables

For a full breakdown and explanation of the structured contents data schema in responses see our Data Dictionary: Beta section.

Wikitext blob compared with Structured Contents JSON

Showcase: BLOBs vs Structured Contents

Below are examples using Josephine Baker‘s English Wikipedia article. Each feature is shown side by side, comparing the raw HTML and wikitext BLOBs versus the clean JSON output from Structured Contents. These examples make it clear how the data is transformed and why it is easier for developers to use. Some of the payload output in these examples have been truncated (using […]).

Article Description

[...]\u003e\u003cdiv class=\"shortdescription nomobile noexcerpt noprint searchaux\" style=\"display:none\" about=\"#mwt1\" typeof=\"mw:Transclusion\" data-mw='{\"parts\":[{\"template\":{\"target\":{\"wt\":\"short description\",\"href\":\"./Template:Short_description\"},\"params\":{\"1\":{\"wt\":\"American-born French entertainer (1906–1975)\"}},\"i\":0}}]}' id=\"mwAg\"\u003eAmerican-born French entertainer (1906–1975)\u003c/div[...]
{{short description|American-born French entertainer (1906–1975)}}\n
"description": "American-born French entertainer (1906–1975)"

Article Sections

id=\"mwIQ\"\u003eDuring her early career, Baker was among the most celebrated performers to headline the revues of the \u003cspan title=\"French-language text\" about=\"#mwt45\" typeof=\"mw:Transclusion\" data-mw='{\"parts\":[{\"template\":{\"target\":{\"wt\":\"lang\",\"href\":\"./Template:Lang\"},\"params\":{\"1\":{\"wt\":\"fr\"},\"2\":{\"wt\":\"[[Folies Bergère]]\"},\"italic\":{\"wt\":\"no\"}},\"i\":0}}]}' id=\"mwIg\"\u003e\u003cspan lang=\"fr\" style=\"font-style: normal;\"\u003e\u003ca rel=\"mw:WikiLink\" href=\"./Folies_Bergère\" title=\"Folies Bergère\"\u003eFolies Bergère\u003c/a\u003e\u003c/span\u003e\u003c/span\u003e\u003clink rel=\"mw:PageProp/Category\" href=\"./Category:Articles_containing_French-language_text\" about=\"#mwt45\" id=\"mwIw\"/\u003e in \u003ca rel=\"mw:WikiLink\" href=\"./Paris\" title=\"Paris\" id=\"mwJA\"\u003eParis\u003c/a\u003e.[...]
\n\nDuring her early career, Baker was among the most celebrated performers to headline the revues of the {{lang|fr|[[Folies Bergère]]|italic=no}} in [[Paris]]. [...]
"sections": [{
  "type": "paragraph",
  "value": "During her early career, Baker was among the most celebrated performers to headline the revues of the Folies Bergère in Paris. [...]",
  "links": [
    {
      "url": "https://en.wikipedia.org/wiki/Folies_Bergère",
      "text": "Folies Bergère"
    },
    [...]
  ],
  "citations": [
    {
      "identifier": "cite_note-4",
      "text": "[4]"
    },
    [...]
  ]
}]

Article Infoboxes

data-mw-deduplicate=\"TemplateStyles:r1295905060\" typeof=\"mw:Extension/templatestyles mw:Transclusion\" about=\"#mwt6\" data-mw='{\"name\":\"templatestyles\",\"attrs\":{\"src\":\"Module:Infobox/styles.css\"},\"body\":{\"extsrc\":\"\"},\"parts\":[{\"template\":{\"target\":{\"wt\":\"Infobox person\\n\",\"href\":\"./Template:Infobox_person\"},\"params\":{\"name\":{\"wt\":\"Josephine Baker\"},\"image\":{\"wt\":\"File:Baker Harcourt 1940 2.jpg\"},\"caption\":{\"wt\":\"Baker in 1940\"},\"birth_name\":{\"wt\":\"Freda Josephine McDonald\"},\"birth_date\":{\"wt\":\"{{birth date|mf=yes|1906|06|03}}\"},\"birth_place\":{\"wt\":\"[[St. Louis]], Missouri, U.S.\"}
[...]
{{Infobox person\n| name               = Josephine Baker\n| image              = File:Baker Harcourt 1940 2.jpg\n| caption            = Baker in 1940\n| birth_name         = Freda Josephine McDonald\n| birth_date         = {{birth date|mf=yes|1906|06|03}}\n| birth_place        = [[St. Louis]], Missouri, U.S.\n| [...]
"infoboxes": [{
  "name": "Infobox person",
  "type": "infobox",
  "has_parts": [
    {
      "name": "Josephine Baker",
      "type": "section",
      "has_parts": [
        {
          "type": "image",
          "value": "Baker in 1940",
          "images": [
            {
              "content_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Baker_Harcourt_1940_2.jpg/250px-Baker_Harcourt_1940_2.jpg",
              "caption": "Baker in 1940",
              "height": 250,
              "width": 250
            }
          ]
        },
        {
          "name": "Born",
          "type": "field",
          "value": "Freda Josephine McDonald June 3, 1906 St. Louis, Missouri, U.S.",
          "links": [
            {
              "url": "https://en.wikipedia.org/wiki/St._Louis",
              "text": "St. Louis"
            }
          ]
        },
        [...]

How to Access Structured Contents

Structured Contents is currently available in two of our APIs:

On-demand API: Request individual articles from any project with structured JSON. Best for testing, post-training, or lightweight use.

Snapshot API: Get a compressed file of all articles in a project as structured JSON snapshots. Best for pre-training, indexing, and high-scale applications.

Shaping Structured Contents Together

In order to help us strengthen current features and shape new ones we welcome and encourage feedback on Structured Contents. Signing up for an account to our APIs provides the latest features, but to make experimentation easy we have also shared early versions of Structured Contents snapshots on open dataset platforms Hugging Face and Kaggle.

Wikimedians can also access beta Structured Contents through their Wikimedia Cloud Services accounts.