An essential component for building more trustworthy AI lies in creating a more solid data foundation on top of graph databases, knowledge graphs, and vector databases. Each element plays a fundamental role in representing the relationships between things, searching through these relationships, and presenting an appropriate subset to AI tools.
Now, the Wikimedia Foundation is adding a vector database to improve search and discovery on Wikidata, a sister platform to Wikipedia. I recently spoke with Jonathan Fraine, Head of Engineering and Philippe Saadé, AI/ML Project Manager at Wikimedia Deutschland, about the interplay between knowledge graphs, graph databases, and vector databases for the future of the semantic web.
Wikidata is widely used under the hood of search engines and other tools for creating knowledge cards and different use cases. Traditional keyword-based search methods couldn't adequately interpret the context or relationships between entities. Additionally, as the dataset expanded, Wikimedia needed to maintain low-latency search performance while managing the relationships between millions of entities in its knowledge graph.
It's also important to appreciate that the Wikimedia Foundation is committed to ensuring that everything from data to the tools used to process and access it is open. This has led to some challenging limitations. For example, the team behind their graph database was acqui-hired by Amazon six years ago, which stalled updates on the open source projects. However, this openness also makes it easier to ensure free access worldwide.
DataStax approached the Wikimedia team in San Jose, CA, after a presentation on building the foundations for more trustworthy AI. Datastax offered to support Wikimedia on their open source vector database engine. Saadé says,
They showed great interest in the project and helped us figure out what to do next. They are not only providing us with the scalable vector database, but also with a lot of interesting and different solutions that we may want to test out and see if we're going to also integrate them into our solution.
Knowledge graphs are one of the best ways to represent the relationship between things explicitly. However, they also require considerable expertise to define the ontologies and taxonomies that describe the meaning of things. The taxonomy spells out a hierarchical classification for organizing concepts or data. The ontology formally represents knowledge, including concepts, relationships, and constraints. Saadé explains how it works:
So Wikidata is a knowledge graph where you have entities or items that define a general domain, items that you can find anywhere in the world, and then it is a knowledge graph that means that these items are connected by properties that define the connection. The entity could be Obama, the connection could be the president, and the resulting connection would be the United States. And then you have information about the years he was president and stuff like this.
Wikidata is a knowledge base, and the Wikidata query service is a knowledge graph extension. The actual data is stored in a MariaDB relational database. Every two weeks, the Wikimedia team creates an extraction of the current status of Wikidata, which is stored and reformatted into a knowledge graph on top of a Blazegraph graph database. However, the Wikimedia Foundation is looking for an alternative since Blazegraph essentially stopped updates in 2018 when Amazon acquired the development team. Fraine says:
The foundation, in particular, has been maintaining the operation, and then we are now in a phase of researching our own solutions or partnering with other open source, free knowledge or organizations to build a new graph back end solution that can be sustainable for the next ten-ish years.
It is also important to appreciate that graph databases can represent relationships differently. For example, commercial graph databases such as Neo4J and Tiger Graph tend to take a more statistical approach. In contrast, semantic-oriented graph databases like Blazegraph take a more concrete approach to defining meaningful relationships. Fraine explains:
So the difference between the database is a stack of information, and the knowledge graph is the searchable connections between all of the items and properties, and then the word graph database, in my experience, usually refers to the back end technology, which in our case is Blazegraph.
One way to access a knowledge graph is by using a query language like SPARQL to find entities and navigate their connections. Ideally, a vector search approach would improve keyword search for an item to find the item you are looking for. For example, how do your e find related information when different sources use varying terms to describe the same thing? SPARQL is more precise but difficult. Saadé notes:
SPARQL is a bit difficult because there's a learning curve, and it's hard to use. You need to learn it to explore the graph and then keyword search. It has its disadvantages because you might not be able to find what you're looking for because it lacks context. That's where the vector database comes in.
However, creating and managing taxonomies and ontologies has traditionally required specialized knowledge and a rigid structure to ensure accuracy. Fraine says:
The knowledge graph perspective is very rigid on purpose. It's for complete knowledge. It's to give you everything connected to what you're looking for, as opposed to giving something around this subject matter. And so, we deliberately chose the rigid perspective because we wanted to have the complete knowledge. We wanted to have explicit references. We wanted to have explicit connections that were reproducible, that people could access information over and over again and connect it between languages.
In theory, gen AI could be used to automate the process of distilling relationships between things. However, this could also erode trust in the data. It's bad enough when it hallucinates connections, and even worse when other GenAI tools are trained on this faulty interpretation
So, Wikimedia is exploring how to use AI tools to access the graph easily with the vector database and support it in other ways. Saadé explains:
When it comes to creating the data, we would not use AI for that but rather encourage people to create the data available in Wikidata. One of the main reasons is model collapse, which we want to avoid because sometimes there's a minimal error that you might not see initially. If we use that same data to train other AI models, the errors will accumulate, and eventually, you will have illogical information, which is what we're trying to avoid.
In this case, the vector database is designed to access Wikidata more easily with natural language and semantic search on top of that. This could also empower the AI community with the flexibility to use both vectors and graphs to create more innovative and creative projects.
Another thing to appreciate is that every large language model and vector search engine formats natural language into a different vector embeddings. The same tool needs to be used to both train and query the vector database. Saadé explains how they are approaching the problem:
So, we have to be able to use the same model and hopefully create a vector database with an open source model so people can just download and use it from their side. But also, we would be providing an API, so if someone would want to do a semantic search and search for an item on the vector database, we would create the vector for them with the same model.
One of the challenges was transforming the items in Wikidata based on graph structure rather than simple text. So, the team is evaluating ways to combine and aggregate all the connections in Wikidata for a particular item to create a large text for that item that defines what it is. This will ensure the vectors are accurately representative of that item. For example, even though Wikidata is a graph, it also includes much textual information, such as label descriptions. The claims also have labels and descriptions as well. These must be combined to create more accurate and useful vector embeddings.
Saadé says they have explored using different approaches for representing data as text, including RDF, JSON, and natural language formatting. RDF has been the status quo in knowledge graphs for representing labels of items and their connected items using triplets that consist of a subject, predicate, and object. JSON also shows promise in helping an LLM distill how the structure and graph work and its compatible with modern programming techniques. Natural language formatting could make it more accessible to a wider audience.
Saadé has found that RDF tends to focus more on the triplets than the items they describe. So, LLMs processing this data did not fully understand that a list is talking about all the information that needs to be combined. The natural language approach is easier to use, and more compatible with existing LLM training data, but requires a bit more effort to ensure relationships are represented correctly.
Good results also depend on the LLM and training data. Most LLMs have not been explicitly trained for knowledge graphs. Eventually, it may make sense to train new LLMs for embedding relationship data more efficiently. In that case, it would not matter which approach they use since the same format would be used to train the LLM and query it. They are currently using Jina AI, an open source foundation model for search in over a hundred languages.
The ultimate goal is to improve the foundations for training LLMs and other generative AI models in the future. In the short term, they want to grow the prototype to more languages and see how things work with the languages.
In the long term, they hope to integrate it into their search functionality to allow users to find Wikidata items and use them for future projects easily. For example, it might help improve vandalism detection.
After reading, interviewing and writing about knowledge graphs for a decade, its still a fuzzy concept. Is it a graph database, an ontology management tool, or some other symbolic thing? There are certainly many ways to create knowledge graphs, and ten years on, it seems like the industry is still trying to vet best practices.
It was refreshing to chat with the Wikimedia team because they are on the cutting edge of operationalizing these in practice. And they are not trying to sell a buzzword. All their stuff is free. More significantly, they are also exploring the interplay between more symbolic approaches like RDF with the new LLM and vector database work.
The symbolic stuff is important for trust, yet very manual and requires some kind of ontology expertise. Outside of philosophy classes a few decades ago, I have not met a single ontologist or taxonomist. Who the heck even thinks about what data means these days? Meanwhile the GenAI stuff could sort of automate this stuff. Yet it's a tricky slope since it could also amplify the impact of hallucinations at scale. It's good that the Wikimedia team is thinking about how to address all of these concerns.