3 Sources
3 Sources
[1]
New project makes Wikipedia data more accessible to AI | TechCrunch
On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia's wealth of knowledge more accessible to AI models. Called the Wikidata Embedding Project, the system applies a vector-based semantic search -- a technique that helps computers understand the meaning and relationships between words -- to the existing data on Wikipedia and its sister platforms, consisting of nearly 120 million entries. Combined with new support for the Model Context Protocol (MCP), a standard that helps AI systems communicate with data sources, the project makes the data more accessible to natural language queries from LLMs. The project was undertaken by Wikimedia's German branch in collaboration with the neural search company Jina.AI and DataStax, a real-time training-data company owned by IBM. Wikidata has offered machine-readable data from Wikimedia properties for years, but the pre-existing tools only allowed for keyword searches and SPARQL queries, a specialized query language. The new system will work better with retrieval-augmented generation (RAG) systems that allow AI models to pull in external information, giving developers a chance to ground their models in knowledge verified by Wikipedia editors. The data is also structured to provide crucial semantic context. Querying the database for the word "scientist," for instance, will produce lists of prominent nuclear scientists as well as scientists who worked at Bell Labs. There are also translations of the word "scientist" into different languages, a Wikimedia-cleared image of scientists at work, and extrapolations to related concepts like "researcher" and "scholar." The database is publicly accessible on Toolforge. Wikidata is also hosting a webinar for interested developers on October 9th. The new project comes as AI developers are scrambling for high-quality data sources that can be used to fine-tune models. The training systems themselves have become more sophisticated -- often assembled as complex training environments rather than simple datasets -- but they still require closely curated data to function well. For deployments that require high accuracy, the need for reliable data is particularly urgent, and while some might look down on Wikipedia, its data is significantly more fact-oriented than catchall datasets like the Common Crawl, which is a massive collection of web pages scraped from across the internet. In some cases, the push for high-quality data can have expensive consequences for AI labs. In August, Anthropic offered to settle a lawsuit with a group of authors whose works had been used as training material, by agreeing to pay $1.5 billion to end any claims of wrongdoing. In a statement to the press, Wikidata AI project manager Philippe Saadé emphasized his project's independence from major AI labs or large tech companies. "This Embedding Project launch shows that powerful AI doesn't have to be controlled by a handful of companies," Saadé told reporters. "It can be open, collaborative, and built to serve everyone."
[2]
Wikimedia wants to make it easier for you and AI developers to search through its data
The late English writer Douglas Adams is best known as the author of the 1979 book The Hitchhiker's Guide to the Galaxy. But there is much more to Adams than what is written in his Wikipedia entry. Whether or not you need to know that his birth sign is Pisces or that libraries worldwide store his books under the same string of numbers -- 13230702 -- you can if you head to an overlooked corner of the Wikimedia Foundation called Wikidata. There, images, text, keywords, and other information related to Adams are stored both in a webpage and, for the robots among us, in formats designed for machines like JSON. Now, Wikidata is getting a new AI-friendly database that makes it easier for large language models to ingest the information. The database comes from the Wikipedia Embedding Project out of the German chapter of the Wikimedia Foundation, Wikimedia Deutschland, which oversees Wikidata. The Berlin-based team spent the past year using a large language model to turn the 19 million entries within Wikidata from clunkily structured data into vectors that capture the context and meaning around the Wikidata entry. In this vectorized format, information is best imagined like a graph with dots and interconnected lines -- Adams would be connected to "human" as well as the titles of his books, Lydia Pintscher, Wikidata portfolio lead, told The Verge. While the front-end user experience will remain the same -- no, Wikipedia is not becoming a chatbot, the project leaders say -- the back end will become easier for AI developers to access when building, for example, their own chatbots using the data. The goal of the project is to level the playing field for AI developers outside the monied core of Big Tech, Pintscher said. Companies like OpenAI and Anthropic have the resources to vectorize Wikidata, just like Pintscher and her team did. It's the smaller outfits that most benefit from the new access to curated data stored in the vaults of Wikidata. "Really, for me, it's about giving them that edge up and to at least give them a chance, right?" Pintscher said. She points to Govdirectory as an example project that harnessed Wikidata's vast data curated by volunteers for good. The platform allows users to find the social media handles and emails for public officials across the world. Most AI chatbots prioritize popular words and topics across the internet. In addition to giving Little Tech a leg up, the team hopes that easier access to Wikidata will result in AI systems that better reflect niche topics not widely represented across the internet, Pintscher said. This could be a better way to get information into ChatGPT, for instance, than "generating a ton of content and then waiting for the next time for ChatGPT to retrain, and maybe, or maybe not, taking into account what you contributed," Pintscher said. In practice, the vectors will allow AI systems to better access the context around information in addition to the information itself, Philippe Saadé, Wikidata AI project manager, told The Verge. The team used a model from AI company Jina AI to turn Wikidata's structured data, captured through September 18th, 2024, into vectors. IBM company DataStax currently provides the infrastructure to store the vector database to the project for free. The team is waiting for feedback from developers who use the database before updating it with information added over the last year. While the current database does not include entirely new information added in the last year, Saadé says small edits or tweaks to existing Wikidata will not diminish the database's usefulness. "At the end of the day, the vector that we're computing is like a general idea of an item, so if some small edit has been made on Wikidata, it's not going to be super relevant," he said.
[3]
Wikimedia Is Making Its Data AI-Friendly
The non-profit behind Wikipedia released today a new database designed for AI models. Wikimedia, the nonprofit behind Wikipedia and sister sites like Wikimedia Commons and Wikidata, just made it easier for AI models to tap into its massive knowledge base. Wikimedia Deutschland, the organization’s German chapter, released a new resource called the Wikidata Embedding Project. It takes the roughly 120 million open data points stored in Wikidata and converts them into a format that's simpler for large language models to actually use. Even though Wikidata’s structured data is already machine-readable, it hasn’t been directly compatible with generative AI systems, which are built to work with natural language. The new project translates Wikidata entries into vectors, which are basically numerical coordinates that show how different statements relate to each other. Think of it like a map where closely linked terms like “dog†and “puppy†cluster together, while unrelated ones like “dog†and “bank account" are much farther apart. This helps AI systems understand terms in context and process them more effectively in natural language. The project is designed to give AI models higher-quality information that leads to more reliable answers, Wikimedia Deutschland said in a press release. It said most AI systems currently rely on opaque datasets. A secondary goal is to level the playing field. By making Wikidata freely available, Wikimedia says it hopes smaller AI companies can compete with tech giants that would otherwise have the resources to vectorize the data themselves. “The launch of the embedding project shows that powerful AI does not have to be controlled by a handful of companies â€" it can be developed openly and collaboratively,†said Wikidata AI project manager Philippe Saadé in a statement. Wikimedia Deutschland has been working on the project since September 2024 in collaboration with Jina AI, which built the embedding system that turns Wikidata entries into vectors, and IBM’s DataStax, which stores those vectors in its database. In contrast, the release landed just a day after Elon Musk took to X to announce he’s building a Wikipedia rival called Grokipedia. “We are building Grokipedia @xAI,†Musk wrote on Tuesday. “Will be a massive improvement over Wikipedia. Frankly, it is a necessary step towards the xAI goal of understanding the Universe.†Musk has repeatedly derided Wikipedia as “Wokipedia†and complained that there’s no alternative aligned with more right-wing views. He also reposted Larry Sanger, the cofounder of Wikipedia, who quit in 2002 and has since tried to launch several competing projects. Sanger, a longtime critic of Wikipedia from the right, recently posted on X that Wikipedia has become too globalist, academic, secular, and progressive. Musk’s bid to build a rival encyclopedia stocked with his preferred facts just underscores why Wikimedia launched its own AI project in the first place. As AI continues to go mainstream, the quality and bias of the data these systems rely on could potentially hold influence over what millions of people believe to be true.
Share
Share
Copy Link
Wikimedia Deutschland launches the Wikidata Embedding Project, transforming Wikipedia's vast knowledge into an AI-friendly format. This initiative aims to democratize access to high-quality data for AI developers and improve the accuracy of AI models.
Wikimedia Deutschland, the German branch of the Wikimedia Foundation, has unveiled a groundbreaking project that promises to revolutionize how artificial intelligence (AI) interacts with Wikipedia's vast knowledge base. The Wikidata Embedding Project, announced on Wednesday, transforms nearly 120 million entries from Wikipedia and its sister platforms into a format more accessible to AI models
1
.Source: TechCrunch
The project employs vector-based semantic search, a technique that enhances computers' ability to understand the meaning and relationships between words. This approach, combined with support for the Model Context Protocol (MCP), allows for more effective natural language queries from Large Language Models (LLMs)
1
.The new system converts Wikidata's structured information into vectors, which can be visualized as a graph with interconnected dots and lines. This vectorization captures the context and meaning surrounding each Wikidata entry, making it easier for AI systems to process and understand the relationships between different pieces of information
2
.Source: Gizmodo
Wikimedia Deutschland collaborated with neural search company Jina.AI and IBM-owned DataStax to bring this project to fruition. The database is publicly accessible on Toolforge, and Wikidata is hosting a webinar for interested developers on October 9th
1
2
.A key goal of the Wikidata Embedding Project is to level the playing field for AI developers outside the well-funded tech giants. By providing easy access to high-quality, curated data, the project aims to give smaller companies and independent developers a chance to compete in the AI space
2
3
.Philippe Saadé, Wikidata AI project manager, emphasized the project's independence from major AI labs and large tech companies, stating, "This Embedding Project launch shows that powerful AI doesn't have to be controlled by a handful of companies. It can be open, collaborative, and built to serve everyone"
1
.Source: The Verge
Related Stories
The project comes at a time when AI developers are seeking high-quality data sources for fine-tuning their models. The Wikidata Embedding Project offers a more reliable alternative to catchall datasets like Common Crawl, potentially improving the accuracy of AI systems, especially for deployments requiring high precision
1
.Moreover, by making it easier for AI models to access niche topics not widely represented across the internet, the project could lead to more diverse and comprehensive AI systems
2
.The launch of the Wikidata Embedding Project coincides with Elon Musk's announcement of "Grokipedia," a proposed Wikipedia rival. While Musk's project seems to stem from ideological concerns, Wikimedia's initiative focuses on improving data accessibility and quality for AI development
3
.As AI continues to shape our information landscape, initiatives like the Wikidata Embedding Project underscore the importance of open, collaborative approaches to knowledge curation and dissemination in the age of artificial intelligence.
Summarized by
Navi
[3]