A vibrant assortment of building blocks and toys scattered across a table, featuring a bright yellow clock

How Databricks Parsed Wikipedia to Markdown with Python

Databricks recently needed to ingest all of English Wikipedia’s roughly 7 million articles to support their data processing and modeling workflows. At that scale, the standard approach of parsing raw wikitext or HTML breaks down quickly: edge cases, broken formatting, and deeply nested templates make a clean text corpus expensive to extract.

To bypass this, Databricks engineers Sean Owen, Andrew Drozdov, and Michael Bendersky used Wikimedia Enterprise’s Structured Contents endpoints, which provide a pre-parsed, cleanly deconstructed JSON representation of Wikipedia articles.

“High-quality data that is well-structured is key for us. Structured Contents deeply decompose Wikipedia data, so re-composing the Markdown we wanted only took a day and a little bit of experimenting” –  Sean Owen, Research Scientist at Databricks

Here’s how they downloaded the Structured Contents snapshot they needed and used Apache Spark to convert the JSON into Markdown. The full parser is also available to explore and reuse in this Colab notebook.

1. Retrieving the Snapshot

Wikimedia Enterprise distributes data as Newline Delimited JSON (ndjson), a format that works well with Spark and can be parsed directly as structured data at scale. While a Python SDK is available, Databricks skipped it and used the requests library to retrieve an access token directly:

import requests
response = requests.post(
    "https://auth.enterprise.wikimedia.com/v1/login",
    json={
        "username": "USERNAME",
        "password": "PASSWORD"
    },
    allow_redirects=True
)
response.raise_for_status()
access_token = response.json().get("access_token")

With the access token, they initiated the download of the enwiki namespace snapshot using the GET method on the Structured Contents Snapshot download endpoint:

import requests 
download_url = "https://api.enterprise.wikimedia.com/v2/snapshots/structured-contents/enwiki_namespace_0/download"
headers = { "Authorization": f"Bearer {access_token}", "Accept": "application/json" } 
with requests.get(download_url, headers=headers, stream=True) as r: r.raise_for_status()

2. Re-composing Markdown from JSON

Because the Structured Contents output is already broken down into logical components (e.g. sections, paragraphs, lists, and tables), Databricks wrote a custom parser to map these JSON elements back into Markdown.

They used a recursive Python function to iterate through the article’s sections. Here is a simplified version of their logic for handling headers, paragraph sections, and lists:

from io import StringIO

def output_section(section: dict, out: StringIO, tables_map: dict, depth: int = 1):
    # Format the section header
    if 'name' in section:
        if section['name'] != "abstract":
            out.write(f"{'#' * (depth+1)} {section['name'].replace('_', ' ').title()}\n\n")
            
    # Iterate through the components of the section
    for section_part in section.get('has_parts', []):
        section_part_type = section_part['type']
        
        if section_part_type == "paragraph":
            if 'value' in section_part:
                out.write(f"{section_part['value']}")
            # Append citations cleanly
            for citation in section_part.get('citations', []):
                if 'text' in citation:
                    out.write(citation['text'])
            out.write("\n\n")
            
        elif section_part_type == "section":
            # Recursively process nested sections
            output_section(section_part, out, tables_map, depth=depth+1)
            
        elif section_part_type == "list":
            for list_item in section_part['has_parts']:
                if 'value' in list_item:
                    out.write(f"- {list_item['value'].replace(chr(10), ' ').strip()}\n")
            out.write("\n")

3. Scaling with Apache Spark

To run this parser across tens of gigabytes of data in parallel, Databricks wrapped the logic into a PySpark User Defined Function (UDF). This function takes the raw JSON string, processes the entire article tree, and returns a structured row containing metadata alongside the generated Markdown.

from pyspark.sql.functions import udf, col
from datetime import datetime
import json

@udf('struct<identifier:int,name:string,description:string,abstract:string,url:string,date_created:timestamp,date_modified:timestamp,language:string,markdown:string>')
def render_markdown(structured_json: str) -> str:
    article = json.loads(structured_json)
    out = StringIO()
    
    # Process the article tree (calling functions like output_section)
    output_article(article, out) 
    markdown = out.getvalue()
    
    return (
        article['identifier'],
        article['name'],
        article.get('description'),
        article.get('abstract'),
        article['url'],
        datetime.fromisoformat(article['date_created']) if 'date_created' in article else None,
        datetime.fromisoformat(article['date_modified']),
        article['in_language']['identifier'],
        markdown
    )

Finally, they applied this UDF directly to the ndjson dataset and saved the output as a new table for downstream model training:

# Read the raw ndjson structured content table
markdown_df = spark.read.table("main.wikimedia_structured_content.parsed") \
    .select(col("json"), render_markdown("json").alias("result")) \
    .select("json", "result.*")

# Save the newly parsed Markdown table
markdown_df.write.saveAsTable('main.wikimedia_structured_content.parsed_markdown')

Try It Yourself

Using the pre-parsed JSON structure eliminates the need for regex-heavy wikitext or HTML parsing, turning a complex ingestion problem into a straightforward data transformation task. The same approach scales to any Wikimedia project that is available as Structured Contents, in any of the available languages.

If you want to build a similar pipeline, train an LLM on Wikipedia data, or test what Structured Contents looks like before committing, you can sign up for a free Wikimedia Enterprise account and start querying the endpoints immediately. For higher-volume access or a conversation about fit, contact our sales team.

— The Wikimedia Enterprise Team

About Databricks

Databricks was founded in 2013 by the original creators of the lakehouse architecture and of landmark open-source projects such as Apache Spark™ and Delta Lake. Databricks has built its Data Intelligence Platform on top of open-source lakehouse architecture, aiming to provide seamless data management and governance to any organization. Their goal is to help companies take control of their data and put it to work with AI.

Photo Credits

Colorful collection of building blocks, by Shixart1985, CC BY 2.0, via Wikimedia Commons