Some of Wikipedia’s most contextually rich and factual data resides within thousands of tables across millions of articles. These data tables contain music charts, sports statistics, election results, filmographies, and so much more. They visually make it easy for humans to quickly understand the information within, but extracting them at scale as structured data is a difficult task for machines due to their unique complexity, variability, and formatting. Accessing this wealth of information has remained one of the most persistent challenges for anyone working with Wikipedia data at scale.
That changes today. We are introducing Parsed Tables to our Beta Structured Contents payloads. This phase one release makes it possible to access Wikipedia’s tabular data as ready-to-use JSON, improving downstream applications in AI/ML, knowledge graphs, search, machine learning, and large-scale research.
What’s in this article:
Why Wikipedia Tables Matter for Developers and Data Teams
Wikipedia contains millions of high-quality tables that provide valuable information on a wide range of topics. Over 60% of the information stored in Wikipedia tables is not accessible anywhere else. These tables are human-curated, timely, and often the only source for specific domain facts.
Key use cases include:
Knowledge Graphs
Populate graphs with award winners, sports statistics, discographies, and more.
AI and ML Datasets
Enrich your training data with deeply contextual machine-readable tables.
Search and Discovery
Index structured facts for more relevant, granular answers.
Fact-checking
Validate data across pages, languages, or snapshots to improve reliability.
Academic Research
Analyze and compare tables at scale across topics and languages.
Yet for most, these tables have remained out of reach due to technical hurdles. Scraping HTML, reverse-engineering templates, or writing brittle parsers is costly and error-prone, and all of these methods require high levels of maintenance, as table structures can—and often do—change at any time.
Parsed Tables Explained
Parsed Tables is a new feature in Wikimedia Enterprise Structured Contents beta endpoints. It converts Wikipedia data tables from HTML and wikitext blobs into structured, reliable JSON ready for direct integration at scale. Parsed Tables are available in all Wikipedia languages. That said, tables have been tested more extensively for English, German, French, Spanish, Italian, Portuguese, and Dutch Wikipedias.
Each article section with a table will have a new table_references
array that includes:
- the unique identifier of any parsed table in that section.
- a
confidence_score
to indicate how structurally consistent the table layout is.
Each table:
- is a clean structured JSON object inside a
tables
array. - includes a unique identifier that refers back to the article section the table resides in for full context.
- includes a
confidence_score
to indicate how structurally consistent the table layout is.
Confidence Score Explained
Each parsed table includes a confidence_score
field. This value reflects how well the table’s HTML is structured and, in turn, how likely the parser extracted it accurately. It is not a measure of the quality or reliability of the data itself.
A high score (e.g. 0.9) generally means the parser is highly likely to have extracted the table accurately, while a low score (e.g. 0.35) suggests structural issues such as irregular column counts, nested rows, or merged cells that reduce parsing accuracy.
In this initial release, only tables with a confidence score of 0.5 or higher are included in the parsed output. This threshold provides coverage of more than 70% of content tables across the top tested languages. Lower-scoring tables are excluded, but are still listed in the table_references
array for the relevant section with their identifier and score.
Examples of Parsed Tables
Source table: Wikipedia: All-time Olympic Games Medals Combined Totals
This is the rendered HTML version of a table that you’re used to seeing on Wikipedia:

Here’s that same table in the HTML source code. A web scraper would have to extract the table data from this constantly changing, non-standardised structure.

And here is that same table as a Wikitext BLOB that’s from the foundational Wikipedia APIs (also included in Enterprise APIs):

Here’s the same table again, but as an HTML BLOB supplied exclusively by Wikimedia Enterprise APIs.

Finally, let’s see what this looks like in our Structured Contents beta payloads. The first mention of any parsed table will be included within the article sections
array referencing the identifier and confidence score.
{
"name": "combined_total_18962024",
"type": "section",
"has_parts": [
{
"type": "table",
"table_references": [
{
"identifier": "complete_ranked_medals_excluding_precursors.combined_total_18962024_table1",
"confidence_score": 0.9}
]
}
]
}
That table_references.identifier
can then be used to locate the contents of that table in the new tables
array.
{
"identifier": "complete_ranked_medals_excluding_precursors.summer_olympics_18962024_table1",
"headers": [
[
{ "value": "Rank" },
{ "value": "NOC" },
{ "value": "Gold" },
{ "value": "Silver" },
{ "value": "Bronze" },
{ "value": "Total" }
]
],
"rows": [
[
{ "value": "1" },
{ "value": "United States" },
{ "value": "1,105" },
{ "value": "879" },
{ "value": "781" },
{ "value": "2,765" }
],
[
{ "value": "2" },
{ "value": "Soviet Union*" },
{ "value": "395" },
{ "value": "319" },
{ "value": "296" },
{ "value": "1,010" }
],
[
{ "value": "3" },
{ "value": "China" },
{ "value": "303" },
{ "value": "226" },
{ "value": "198" },
{ "value": "727" }
],
[
{ "value": "4" },
{ "value": "Great Britain" },
{ "value": "298" },
{ "value": "339" },
{ "value": "343" },
{ "value": "980" }
],
[
{ "value": "5" },
{ "value": "France" },
{ "value": "239" },
{ "value": "278" },
{ "value": "299" },
{ "value": "816" }
],
[ ** truncated for brevity ** ],
],
"confidence_score": 0.9
},
The result: No scraping required. No additional parser codebase(s) to maintain. Just direct, reliable, structured JSON.
Get Started with Parsed Tables
Parsed Tables are available in Structured Contents endpoints in both the Snapshot API (7 supported languages) and the On-demand API (all languages):
1. If you don’t have a Wikimedia Enterprise account, sign up for free.
Free accounts allow 5,000 On-demand API requests per month, including Structured Contents endpoints.
2. Query the Structured Contents endpoints
The On-demand API endpoint /v2/structured-contents/{name} provides data for an individual article. The example below shows how to request the All-time Olympic Games Medals Combined Totals from English Wikipedia.
curl --location 'https://api.enterprise.wikimedia.com/v2/structured-contents/All-time_Olympic_Games_medal_table'
--header 'Content-Type: application/json'
--header 'Authorization: Bearer ACCESS_TOKEN'
--data '{"filters":[{"field":"is_part_of.identifier","value":"enwiki"}]}'
The Snapshot API endpoint /v2/snapshots/structured-contents/{identifier}/download provides bulk Structured Contents payloads for entire projects, but requires a paid account (contact sales) or access to Wikimedia Cloud Services.
What’s Next for Parsed Tables
Parsed Tables currently covers content tables in Wikipedia articles with 0.5 or higher confidence scores. Highly complex tables, tables inside infoboxes, cells with images and links, and certain advanced markup are not yet included. However, ongoing improvements will expand coverage and add more features in phased releases.
Your feedback as a developer or data scientist is important and will directly shape future enhancements to Structured Contents features. We ask you to share your ideas, questions, and feedback with us, as it helps us enable more of Wikipedia’s knowledge to be easily accessible and machine-readable. Please provide your input via the feedback link in your account dashboard (right column), or via your Wikimedia Enterprise representative. If you’d like to inquire about being a testing partner for Structured Contents in the Snapshot API, please contact sales.