Technical Wikidata

Wikidata & AI, together again

19. February 2024
Daniel Erenrich
Wikidata and a brain that looks like circuits, in the Wikidata colors
Wikidata and AI meet again via the open-source framework LangChain.
Technical Wikidata

Leveraging the Wikidata knowledge graph with LangChain

With the growing popularity of large-language models (LLMs), researchers and hobbyists have been quick to tap into the power of Wikimedia projects using a new tool: generative artificial intelligence. In 2023, the Wikimedia Foundation launched the Wikipedia ChatGPT plugin, which lets ChatGPT use Wikipedia as a reference library to help answer questions. But until recently, no similar project has made use of Wikidata, whose wealth of structured information should be, if anything, easier for machines to process.

Now it’s Wikidata’s turn to be served by a new generative-AI platform: the recently launched LangChain offers support for interfacing Wikidata with LLMs. LangChain is an open-source framework for constructing applications that draw on LLM backends, allowing developers to hook LLMs of all kinds, including open-source models, into various tools, including Wikidata. Once integrated, these models gain the ability to retrieve data or even take actions.

With this new support, users can ask ChatGPT a question and have it reference data from the Wikidata knowledge graph before responding. For example, ChatGPT was given access to the new Wikidata tool, then given the following prompt:

Using just information found in Wikidata, answer this question: the author of the book I Am a Strange Loop attended two universities. Which is the older one?

Using Wikidata, ChatGPT performed a lookup of I Am a Strange Loop, its author and his alma maters and provided this answer:

The author of the book I Am a Strange Loop, Douglas Hofstadter, attended both Stanford University and the University of Oregon. Stanford University was founded in 1885, while the University of Oregon was founded in 1876. Therefore, the University of Oregon is the older of the two universities.

Data-driven results like this one contrast sharply with much of the LLM output seen in popular culture and media in recent months, which often contains “hallucinations”: responses that read as plausible but don’t line up with real-world facts or the user’s own input, and in some cases not even with assertions made within the same response. Results also tend to vary over time, even when the facts in question haven’t changed a bit. 

If a user asks ChatGPT the aforementioned question without giving it access to Wikidata, it often provides an answer that is not only wrong but somewhat disorienting: “Stanford University… was founded in 1885, while the University of Oregon was founded later, in 1876.” On the other hand, if ChatGPT has access to the aforementioned English Wikipedia tool, it gives no answer at all, owing to the fact that the relevant information doesn’t happen to appear in the first few paragraphs of the Wikipedia articles.

LangChain’s integration with Wikidata’s stockpile of thousands of properties and millions of items relies on the recently released Wikidata REST API, which smooths the way for applications to obtain information from Wikidata.

Support for Wikidata in LangChain is in its infancy, and there’s plenty of room and opportunity to obtain even more value from the platform using artificial intelligence. Eventually we may live in a world where an LLM can answer incredibly complicated questions by running SPARQL queries against Wikidata and interpreting the results.

No comments yet

Leave a comment