LOGO

AI and Wikipedia: New Project Improves Data Accessibility

October 1, 2025
AI and Wikipedia: New Project Improves Data Accessibility

Wikimedia Launches Database to Enhance AI Access to Wikipedia Knowledge

On Wednesday, Wikimedia Deutschland revealed a novel database designed to broaden the accessibility of Wikipedia’s extensive knowledge base for artificial intelligence models.

Introducing the Wikidata Embedding Project

The system, known as the Wikidata Embedding Project, utilizes vector-based semantic search. This technique enables computers to comprehend the meaning and interrelationships between words.

It is applied to the existing data across Wikipedia and its associated platforms, encompassing almost 120 million entries.

Improved Accessibility for LLMs

The project, coupled with newly implemented support for the Model Context Protocol (MCP), facilitates more effective communication between AI systems and data sources.

This enhancement allows for improved responses to natural language queries originating from Large Language Models (LLMs).

Collaboration Behind the Project

Wikimedia’s German division spearheaded this initiative. Key partners included Jina.AI, a neural search company, and DataStax, a real-time training-data firm owned by IBM.

Beyond Keyword Searches

Wikidata has long provided machine-readable data from Wikimedia properties. However, previous tools were limited to keyword searches and SPARQL queries, a specialized language.

The new system is optimized for use with retrieval-augmented generation (RAG) systems. This allows AI models to integrate external information, grounding them in knowledge vetted by Wikipedia’s editorial community.

Semantic Context and Rich Data

The database is structured to deliver vital semantic context. For example, a query for “scientist” will yield lists of notable nuclear scientists and those affiliated with Bell Labs.

Furthermore, the database includes translations of “scientist” across multiple languages, a Wikimedia-approved image of scientists at work, and connections to related terms like “researcher” and “scholar.”

Public Access and Developer Webinar

The database is available to the public via Toolforge. Wikidata is also hosting a webinar for developers interested in learning more on October 9.

The Demand for High-Quality Data

This project arrives as AI developers actively seek high-quality data sources for model refinement. Training systems are becoming increasingly complex.

While often assembled as intricate training environments, they still require carefully curated data to perform optimally. Reliable data is especially crucial for applications demanding high accuracy.

Compared to broad datasets like the Common Crawl, a vast collection of web pages, Wikipedia’s data is demonstrably more factually focused.

Costly Consequences of Data Acquisition

The pursuit of high-quality data can be financially demanding for AI labs. In August, Anthropic reached a settlement in a lawsuit with authors.

The agreement involved a payment of $1.5 billion to resolve claims related to the use of their works as training material.

A Commitment to Open AI

Wikidata AI project manager Philippe Saadé underscored the project’s independence from major AI labs and large technology corporations in a press statement.

“This Embedding Project launch demonstrates that powerful AI doesn’t need to be controlled by a select few companies,” Saadé stated. “It can be open, collaborative, and designed to benefit everyone.”

#AI#Wikipedia#data accessibility#machine learning#open data#knowledge graph