Wikipedia Hugging Face Dataset using Structured Contents Snapshot

Wikimedia Enterprise has released an early beta dataset to Hugging Face for the general public to freely use and provide feedback for future improvements. The dataset is sourced from our Snapshot API which delivers bulk database dumps, aka snapshots, of Wikimedia projects—in this case, Wikipedia in English and French languages. Furthermore, it’s built using our newly released Structured Contents beta which includes more machine readable response payloads without needing to parse a massive blob of an article body.

What is Hugging Face?

Hugging Face is a leading platform in the AI and machine learning space, known for its tools and libraries that support the development and sharing of models and datasets. Specifically for datasets, it serves as an open-access hub where developers and researchers can upload, explore, and collaborate on datasets across various fields. Publishing our dataset on Hugging Face allows users to easily access and integrate it into their machine learning workflows, fostering innovation and enabling new applications for our data.

What’s in the dataset and usage

The dataset we’re publishing to Hugging Face contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files using a consistent schema compressed as zip. Each JSON object holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.). The current release includes fields listed below in the dataset fields section.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking.

Dataset Structure

An example of each line of JSON looks as follows (abbreviated data):

{
"name": "Josephine Baker",
"identifier": 255083,
"url": "https://en.wikipedia.org/wiki/Josephine_Baker",
"date_created": "...",
"date_modified": "...",
"is_part_of": {"..."},
"in_language": {"..."},
"main_entity": {"identifier": "Q151972",...},
"additional_entities": [...],
"version": {...},
"description": "American-born French dancer...",
"abstract": "Freda Josephine Baker, naturalized as ...",
"image": {"content_url": "https://upload.wikimedia.org/wikipedia/...",...},
"infobox": [{"name": "Infobox person",
    "type": "infobox",
    "has_parts": [
        {"name": "Josephine Baker",
        "type": "section",
        "has_parts": [
            {"name": "Born",
            "type": "field",
            "value": "Freda Josephine McDonald June 3, 1906 
                St. Louis, Missouri, US",                                
            "links": [{"url": 
                "https://en.wikipedia.org/wiki/St._Louis",	
                "text": "St. Louis"},...}],
"sections"[{"name": "Abstract",
    "type": "section",
    "has_parts": [
        {"type": "paragraph",
        "value": "Freda Josephine Baker (née McDonald; 
            June 3, 1906 - April 12, 1975), 
            naturalized as Joséphine Baker...",
        "links": [{"url": 
            "https://en.wikipedia.org/wiki/Siren_...",
            "text": "Siren of the Tropics"...}],
"license": [...],
}

Dataset Fields

Data fields included in every line/article are the following:

  • name – title of the article
  • identifier – ID of the article
  • abstract – lead section, summarizing what the article is about
  • version – metadata related to the latest specific revision of the article
  • version.editor – editor-specific signals that can help contextualize the revision
  • version.scores – numerical assessments by ML models on the likelihood of a revision being reverted
  • url – URL of the article
  • date_created – timestamp of the article creation event, or the article’s first revision
  • date_modified – timestamp of the last revision of the article
  • main_entity – wikidata QID the article is related to
  • is_part_of – wikimedia project this article belongs to
  • additional_entities – array of Wikidata entities used in this article
  • in_language – human language in which the article is written
  • image – the main image representing the article’s subject
  • license – relevant licenses that affect this article and content reuse
  • description – one-sentence description of the article for quick reference
  • infobox – parsed information from the side panel (infobox) on the Wikipedia article
  • sections – parsed sections of the full article, including links

Note: object excludes other media/images, lists, tables, and references or similar non-prose sections. A more full data field context and schema reference is outlined in our data dictionary.

Data Licensing and Attribution

All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

Attribution, as required by the Creative Commons license used for this dataset, is core to the sustainability of the Wikimedia projects. It is what drives new editors and donors to Wikipedia. With consistent attribution, this cycle of content creation and reuse ensures encyclopedic content of high-quality, reliability, and verifiability will continue being written on Wikipedia and ultimately remain available for reuse via datasets such as these. 

As such, we require all users of this dataset to conform to our expectations for proper attribution. Detailed attribution requirements for use of this dataset are outlined on Hugging Face.

Where to get the dataset

The Official Wikimedia Enterprise beta dataset can be found on the Hugging Face platform here huggingface.co/datasets/wikimedia/structured-wikipedia. Users can explore the dataset, download it, or directly integrate it into their machine learning pipelines using Hugging Face’s dataset tools and APIs.

Conclusion

The release of this beta Wikipedia dataset on Hugging Face marks an important step in making Wikimedia’s rich content more accessible and usable for AI and machine learning applications. By providing structured, machine-readable data from our Snapshot API, we’re opening up new possibilities for researchers, developers, and data scientists to leverage this vast knowledge base in innovative ways.

We encourage users to explore the dataset, provide feedback, and share their applications and insights. Your input will be invaluable in shaping future improvements and expansions of this dataset. As we continue to refine our Structured Contents beta and expand language coverage, we look forward to seeing the creative and impactful ways in which this data will be used to advance knowledge and technology.

— Wikimedia Enterprise Team