Reflecting on Openness and Trust in AI: A NeurIPS 2025 Recap
At NeurIPS 2025 in San Diego, Wikimedia Enterprise hosted a social event aimed at a critical intersection in the technology world: the relationship between generative AI data and open, trusted datasets.
Representatives from Wikimedia Enterprise, MLCommons, and the AI Alliance gathered to share insights into how they are using technology to advance academic and social missions. The event featured enlightening discussions and presentations highlighting the goals, current projects, and ongoing challenges regarding trust and responsible data usage in AI.
Wikimedia Enterprise
Chia Hwu and Chris Petrillo from Wikimedia Enterprise highlighted the immense value and reach of Wikimedia project data. They noted that Enterprise has been cited over 26 thousand times on Google Scholar in 2025 alone, demonstrating the significant uptake of its data endpoints by the research community.
They emphasized that Wikipedia continues to be weighted heavily in most large-scale Large Language Models (LLMs). This growing reliance was highlighted during Eleuther.AI’s announcement of Common Pile v0.1, an 8TB dataset of public domain and openly licensed text. This new release features a significantly higher proportion of Wikimedia data than previous iterations, underscoring the industry’s deepening dependency on verifiable, community-curated content.
Through the Structured Contents initiative, and the constant updating of Wikimedia project data by volunteers to build consensus and ensure verifiability, Wikimedia Enterprise remains one of the premier sources of human-curated structured data for model training and grounding.
MLCommons
Neil Majithia from MLCommons continued the conversation, focusing on the infrastructure required for intelligent systems.
“Without machine-readable metadata, agentic AI is running blind”
— Neil Majithia, MLCommons
The MLCommons Association is a community engaged in collaborative engineering to create better AI for everyone. Majithia explained that MLCommons is focused on building and maintaining benchmarks and public datasets to make it easier for anyone to quickly understand the scope, biases, and intricacies of datasets.
The “star of the show” here was Croissant, a specification and set of tools and frameworks aimed at making it easier to publish, discover, and use datasets across Machine Learning platforms responsibly. Croissant provides detailed metadata for machine learning datasets and structures its resources for easy access and searchability.
The AI Alliance
Steven Pousty from Red Hat showcased the AI Alliance, a global partnership dedicated to building and supporting an open technology foundation for AI.
“To help push for open source AI, there first needs to be open data”
— Steven Pousty, Red Hat / AI Alliance
Pousty argued that open data needs to be governed with transparency and oversight to ensure datasets are safe, scalable, and reliable for training and deployment. He introduced the AI Alliance’s Open Trusted Datasets Initiative specification (OTDI), a specification that can quickly evaluate dataset metadata to find gaps and gauge trustworthiness.
As one of the AI Alliance’s key contributors, Wikimedia Enterprise has supported this mission by contributing several datasets to the Open Trusted Datasets Initiative. These datasets are designed to help developer communities by providing Wikimedia project data in new, developer-friendly, and machine-readable formats. Together, these additions reinforce the AI Alliance’s commitment to transparent, community-driven data sources for building responsible and equitable AI systems.
A Catalyst for Collaboration
This gathering at NeurIPS served as an important catalyst for deeper conversations about the critical role of transparency in the AI ecosystem.
By aligning the core values of Wikimedia Enterprise, MLCommons, and the AI Alliance, we are building a unified front for responsible, open data. We look forward to continuing this collaboration and invite you to join us at future events as we drive these vital initiatives forward.
— Wikimedia Enterprise Team
About NeurIPS
The NeurIPS conference was founded in 1987 and is now a multi-track interdisciplinary annual meeting that includes invited talks, demonstrations, symposia, and oral and poster presentations of refereed papers. Along with the conference is a professional exposition focusing on machine learning in practice, a series of tutorials, and topical workshops that provide a less formal setting for the exchange of ideas.

Previous years at NeurIPS:
2024 Highlights: We attended NeurIPS 2024 in Vancouver, BC, alongside the Common Crawl Foundation, exploring the intersections between nonprofit organizations and the tech community within the evolving AI and machine learning ecosystem, featuring two of the most widely used datasets for training large language models.
Photo Credits
San Diego Convention Center, CC BY-SA 3.0, via Wikimedia Commons
Article updated:

