LOD-ifying Lexicographical Data
Valerie Wollinger
Interview with David Lindemann, 17.03.2023
by Valerie Wollinger, Community Communication Manager for Wikibase, Wikimedia Deutschland
This article is the first in a series on how people use Wikibase, the Linked Open Data software made by Wikimedia Deutschland. In conceiving this series of interviews, my goal has been to shed light on the people who use Wikibase, why they use it, and how they involve Wikibase in their own projects. More than that, I want to offer inspiration to newbies and spark curiosity by showing them what their peers are up to.
That’s my job at Wikimedia Deutschland – building bridges. I’m here to connect the Wikibase community with software development, to make sure the people who use Wikibase have the space and the opportunity to make their wishes known to the people who develop it.
David Lindemann (he/him) is my first interview partner in this blog series, and he knows what it’s like to be an intermediary. As a passionate linguist and a self-made expert in digital humanities, he mediates between words and software: he was one of the first users to apply for WBStack and has seen how the product evolved over the past few years.
David Lindemann and I spoke in German, but that only hints at how cosmopolitan he is. For the last 20 years, he has been living in the Basque Country and currently works at the University of the Basque Country. He likes to decode languages and regards them as a public good.
So, how did he get started with Wikibase?
„I used to work at ELEXIS, a H2020 research project about lexicography and dictionary research. My task was to create a digital bibliography with disambiguated entities: they didn’t want me to enter the authors‘ names as a literal value, like their first and last names as a string. They wanted me to do entity linking, so I proposed to use Linked Data technologies…. Meanwhile, the LexBib project has evolved to more than a bibliography – it can safely be called a knowledge graph about our domain.“
Today, David shares responsibility for six different Wikibases, one of which, Qichwabase, received the Best Project Prize at the 2022 Summer Datathon on Linguistic Linked Open Data. Quechua is a language family spoken primarily in the Andes region of South America. David focuses on two kinds of content in his Wikibase instances: lexical data and bibliographical data.
David is not a born techie. He loves languages and doesn’t want to spend his time as a system administrator, nor does he have the skills for it, he says. Instead, he prefers to focus on modeling and uploading content.
That’s why Wikibase Cloud, Wikimedia Deutschland’s software-as-a-service offering for Wikibase, turned out to be a good solution because of how easy it is to start using it.
There were some key elements that convinced David to dive into the world of Wikibase.
„The most important thing for me is that my community can see those entity pages and edit each individual statement. For example, an author can enter more information about themselves or about their publications in LexBib, or a task group can add usage examples to lexemes in Qichwabase. They can get in there and add, edit or delete each individual triple, which just isn’t possible on other platforms. Then there’s the fact that the whole thing is immediately, by definition, already compatible with Wikidata from the outset. It’s a piece of cake to take the additional step of shoveling my data over to Wikidata or federating with it. That’s another decisive advantage. So, I convinced the project leaders that we would use Wikibase.
So, how do you use your Wikibase?
„Some of my current projects are about converting digitized dictionaries into Linguistic Linked Data. Currently, I’m sharing responsibility for two projects where I provide technical support for the people who have the data; they are speakers of Quechua and Kurdish. They explain to me what kind of data they have, and I model it for Wikibase. After that, of course, I’m not in a position to interlink Kurdish data or to split and merge word senses. Only they can do that. But I set up the platforms for them and place the data inside, and I help write the documentation for data curation working packages, so they can invite their communities and do that work.“
Besides that, David explains:
„For the digital bibliography, I collect not only publication metadata, but also the full texts, the actual articles, and I process them by finding keywords from a SKOS vocabulary which I have in the Wikibase. Then I take the results and use them for a content-describing indexation. The entity that describes a publication gets index keywords that I’ve found in the data, and I write them into the entity data. That way you can include these in queries like ‚Where does this keyword appear?‘ and start doing combined queries like ‘When and at which conferences have we started to speak about Artificial Intelligence?’. I also use the SPARQL endpoint, among other things, to get all my data out of there and import it somewhere else. That was a condition of this project I was working on.“
As a data literacy enthusiast, I wanted to find out how David brings new people on board.
Wikibase is a complex piece of software that demands skills not everyone has, even though they might be interested in closely related topics such as Linked Open Data, for example.
People have different levels of technical knowledge, which means educational processes need to be individualized, David noted. Some don’t even know what Linked Open Data is, others have never used SPARQL, and still others know all of this, but don’t know how to transform their source data into a format that can be used to upload to Wikibase, or to handle a tool like OpenRefine for entity linking. When hosting educational workshops about Wikibase, he likes to refer to the famous five-star model by Tim Berners-Lee and the FAIR principles, and he proposes database queries that people would intuitively understand.
‘’To give an example: We had a Basque president who was once a famous footballer, so I can do queries like ‚Which heads of government were once famous footballers?‘ and other fun queries, and then they start to get what it’s about. You don’t expect footballers to be in the same database as heads of government, but they absolutely are in Wikidata. Then I can limit it further to people who have blue eyes, or two kids, or whatever. And at that moment, they get what’s going on with all these interlinked properties. And what is more: it also works with lexemes.“
He’s currently preparing two workshops, one of which, “LODifiying lexical data using Wikibase”, will take place this year in Vienna, on September 13.a pre-conference event for the Fourth Conference on Language, Data and Knowledge.
The other workshop will deal with converting bibliographical data from various formats into Wikibase- and Wikidata-compatible datasets. ‘’I prepare a similar workshop on bibliographical data. I’ve got various use cases which involve different formats. There’s one format that’s widely used in the library world: MARC XML. I’ve got one project where I’m taking that kind of data and modeling it for Wikibase, and it’s catching on well. I want to do something else with that too – I want to make it a little more generic, not just for this special use case I have going. But I think we’ll be able to do it.”
David has several touchpoints with the Wikibase community. He’s a supporter of the Wikibase Community User Group and tries to keep u with the discussions in the group’s Telegram channel, as well as the monthly live sessions. These sessions are tailored to bring users together to exchange and elaborate on their current Wikibase projects.
As we wrapped up, I wanted to find out what David is wishing for in the world of Wikibase features and where he’d like to see Wikibase heading in the future.
He explained that the greatest challenge he’s currently facing is the lack of reliable long-term data storage. As a professional researcher, he is frequently asked to assure data security for ten years or more. Unlike in paid European research projects, David notes with regret that Wikibase Cloud can’t yet formally guarantee that the data he puts into it every day will be still there ten years from now.
He also had this to say:
„One big advantage of Wikibase is that you have the wikitext pages. You can write in Wikipedia article format. … But I can’t include pictures on these pages, and I can’t embed SPARQL query results. It’d be enough if I could embed my own images. That would actually be enough for displaying query results: I could run a SPARQL query once a week and update the image. But right now, I can’t do this… it would be so easy to integrate!“
David’s kudos go to LeMyst for creating the WikibaseIntegrator tool (WBI) and offering their availability for support requests. „There’s WikibaseIntegrator for Python… it’s wonderful. I used to use my own clumsy bot, but now I use WBI for nearly all tasks. If people want to see it, I have all my messy scripts in this repository.“
Thank you, David, for sharing your journey with Wikibase. We’re grateful to have a community member like you who is inspired and motivated to support others and educate them on how to use Wikibase.
David’s favorite word is armiarma.
Noch keine Kommentare
Hinterlasse einen Kommentar