Wikipedia Hugging Face Dataset using Structured Contents Snapshot

19 Sep 2024

Wikimedia Enterprise has released an early beta dataset to Hugging Face for the general public to freely use and provide feedback for future improvements. The dataset is sourced from our Snapshot API which delivers bulk database dumps, aka snapshots, of Wikimedia projects—in this case, Wikipedia in English and French languages. Furthermore, it’s built using our newly released Structured Contents beta which includes more machine readable response payloads without needing to parse a massive blob of an article body.

What is Hugging Face?

Hugging Face is a leading platform in the AI and machine learning space, known for its tools and libraries that support the development and sharing of models and datasets. Specifically for datasets, it serves as an open-access hub where developers and researchers can upload, explore, and collaborate on datasets across various fields. Publishing our dataset on Hugging Face allows users to easily access and integrate it into their machine learning workflows, fostering innovation and enabling new applications for our data.

What’s in the dataset and usage

The dataset we’re publishing to Hugging Face contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files using a consistent schema compressed as zip. Each JSON object holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.). The current release includes fields listed below in the dataset fields section.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking.

Dataset Structure

An example of each line of JSON looks as follows (abbreviated data):

{
"name": "Josephine Baker",
"identifier": 255083,
"url": "https://en.wikipedia.org/wiki/Josephine_Baker",
"date_created": "...",
"date_modified": "...",
"is_part_of": {"..."},
"in_language": {"..."},
"main_entity": {"identifier": "Q151972",...},
"additional_entities": [...],
"version": {...},
"description": "American-born French dancer...",
"abstract": "Freda Josephine Baker, naturalized as ...",
"image": {"content_url": "https://upload.wikimedia.org/wikipedia/...",...},
"infoboxes": [{"name": "Infobox person",
    "type": "infobox",
    "has_parts": [
        {"name": "Josephine Baker",
        "type": "section",
        "has_parts": [
            {"name": "Born",
            "type": "field",
            "value": "Freda Josephine McDonald June 3, 1906 
                St. Louis, Missouri, US",                                
            "links": [{"url": 
                "https://en.wikipedia.org/wiki/St._Louis",	
                "text": "St. Louis"},...}],
"sections"[{"name": "Abstract",
    "type": "section",
    "has_parts": [
        {"type": "paragraph",
        "value": "Freda Josephine Baker (née McDonald; 
            June 3, 1906 - April 12, 1975), 
            naturalized as Joséphine Baker...",
        "links": [{"url": 
            "https://en.wikipedia.org/wiki/Siren_...",
            "text": "Siren of the Tropics"...}],
"license": [...],
}

Dataset Fields

Data fields included in every line/article are the following:

name – title of the article
identifier – ID of the article
abstract – lead section, summarizing what the article is about
version – metadata related to the latest specific revision of the article
version.editor – editor-specific signals that can help contextualize the revision
version.scores – numerical assessments by ML models on the likelihood of a revision being reverted
url – URL of the article
date_created – timestamp of the article creation event, or the article’s first revision
date_modified – timestamp of the last revision of the article
main_entity – wikidata QID the article is related to
is_part_of – wikimedia project this article belongs to
additional_entities – array of Wikidata entities used in this article
in_language – human language in which the article is written
image – the main image representing the article’s subject
license – relevant licenses that affect this article and content reuse
description – one-sentence description of the article for quick reference
infoboxes – parsed information from the side panel (aka infobox) on the Wikipedia article
sections – parsed sections of the full article, including links

Note: object excludes other media/images, lists, tables, and references or similar non-prose sections. A more full data field context and schema reference is outlined in our data dictionary.

Data Licensing and Attribution

All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

Attribution, as required by the Creative Commons license used for this dataset, is core to the sustainability of the Wikimedia projects. It is what drives new editors and donors to Wikipedia. With consistent attribution, this cycle of content creation and reuse ensures encyclopedic content of high-quality, reliability, and verifiability will continue being written on Wikipedia and ultimately remain available for reuse via datasets such as these.

As such, we require all users of this dataset to conform to our expectations for proper attribution. Detailed attribution requirements for use of this dataset are outlined on Hugging Face.

Where to get the dataset

The Official Wikimedia Enterprise beta dataset can be found on the Hugging Face platform here huggingface.co/datasets/wikimedia/structured-wikipedia. Users can explore the dataset, download it, or directly integrate it into their machine learning pipelines using Hugging Face’s dataset tools and APIs.

Conclusion

The release of this beta Wikipedia dataset on Hugging Face marks an important step in making Wikimedia’s rich content more accessible and usable for AI and machine learning applications. By providing structured, machine-readable data from our Snapshot API, we’re opening up new possibilities for researchers, developers, and data scientists to leverage this vast knowledge base in innovative ways.

We encourage users to explore the dataset, provide feedback, and share their applications and insights. Your input will be invaluable in shaping future improvements and expansions of this dataset. As we continue to refine our Structured Contents beta and expand language coverage, we look forward to seeing the creative and impactful ways in which this data will be used to advance knowledge and technology.

Go to Hugging Face dataset →

— Wikimedia Enterprise Team

← Back to Blog | Top ↑

posted in: Releases

tagged: Structured Contents

Receive our news and updates using RSS.

Wikipedia Hugging Face Dataset using Structured Contents Snapshot

What is Hugging Face?

What’s in the dataset and usage

Dataset Structure

Dataset Fields

Data Licensing and Attribution

Where to get the dataset

Conclusion

Latest Articles:

Nomic AI’s NOMAD Projection uses Enterprise Datasets to Visually Map Multilingual Wikipedia

Wikipedia Kaggle Dataset using Structured Contents Snapshot

Wikimedia Enterprise Partners with ProRata.ai to Champion Sustainable Search Engine Practices