Parsed Article Sections and Short Descriptions

We have recently expanded the data available in our On-Demand Structured Contents endpoint by introducing two significant features: Article Body Sections and short Descriptions

The addition of parsed article content sections allows for a more granular retrieval of Wikipedia article information, making it easier for high-volume data reusers to access specific segments of content in a machine-readable format. Alongside this, the new short Description field provides a single line summary of the article, further facilitating quick data utilization.

These latest features build on the success of our initial beta release, where we launched the structured contents endpoint featuring Infobox sidebar content in JSON format. As part of our ongoing efforts to refine and expand the capabilities of this beta endpoint, we continue to implement improvements based on the positive feedback received from our API users and customers.

Article Body Sections

Part of our ‘more structured data not blobs‘ objective is parsing all of the contents of a Wikipedia article into JSON to be more machine-readable. Parsing the blob of Wikitext or HTML of an entire article can be difficult and taxing in various ways, and that’s why we’ve done the hard work for you by parsing that into Sections arrays with plain text of all the individual paragraphs and associated links.

The Article_Sections array includes the main sections of the page, using the header tag as each section name, and outputs the paragraph content and links into well-organized JSON as has_parts arrays. If a section has additional subheadings, then those too are included as nested sections that all follow the same format. Links are pulled into their own array with both the URL and the link text used in the related paragraph.

Data Dictionary Reference: Article Sections

Examples:

Image of JSON response showing the new article sections with paragraphs and start of a second section
Article Sections object with first section’s two paragraphs in JSON with links in separate arrays.

Short Description field

To accompany the Abstract text released previously, we’ve added a short Description field that offers a one-line description of the article. Curated by Wikipedia and Wikidata editors, the short Description field is a brief summary of the article’s contents providing a quick and rapid overview that helps with topic disambiguation. 

Data Dictionary Reference: Description

Examples:

The NASA Wikipedia page:

"description": "American space and aeronautics agency",

The Josephine Baker Wikipedia page

"description": "American-born French dancer, singer and actress (1906–1975)",

Wrap up

The structured contents beta endpoint now includes:

  • article short description
  • article abstract
  • article body sections
  • article infobox (sidebar) contents
  • article main image

Alongside these feature enhancements, we have also been improving the parsing quality of infoboxes and abstracts. Additionally, we are in the process of integrating tables within sections and accommodating sections that hold references and lists.

As we continue to refine these offerings, our efforts are also focused on expanding the availability of structured contents to more projects and languages, ensuring a broader reach and greater applicability.

Sign up for an account to use the APIs for free and stay tuned for more updates. In the meantime reach out via the feedback form in your customer dashboard, or via the contact form; let us know what you’re working on!

Chuck Reynolds, Product Marketing Manager at Wikimedia Enterprise


Stay informed with news and features by grabbing our RSS feed.

Back to Blog | Top ↑