Using Wikidata to build an authority list of Holocaust-era ghettos


In the spring of 2017, the European Holocaust Research Infrastructure (EHRI, work package 11) began a pilot project to link the EHRI ghettos vocabulary set with Wikidata. One of the primary objectives of work package 11 is to build an authoritative vocabulary of ghettos that can be used to describe the diverse Holocaust-related archival materials that have been integrated in the EHRI Portal. By linking the EHRI ghettos vocabulary set with Wikidata, we sought to build a more robust vocabulary in EHRI and at the same time to enrich the information available in Wikidata about Holocaust-era ghettos.

Our pilot project involved the 1391 ghettos already included in EHRI, which were originally derived from The Yad Vashem Encyclopedia of the Ghettos During the Holocaust. Though it is not a complete list of ghettos present in Nazi-occupied Europe, it was a good test group because each entry has consistent data associated with it, including geographic coordinates, the name of the administrative territory the ghetto belonged to during the war, and additional contextual information.

Selecting Wikidata as a Tool

Early on in our discussions of the EHRI vocabulary sets, we determined that we needed an integration platform that incorporates various sources and supports collaborative work. Wikidata emerged as a promising tool since it permits users to cite information from many sources for each entry and to demonstrate relationships between entries. As an integration platform built on Wikipedia, Wikidata acts as a knowledge base incorporating various sources, including encyclopedias, catalogs, name authority lists, published research, and knowledgeable contributors. In other words, Wikidata has many facilities for linked open data that we can add to and extract from. Wikidata also musters a large and diverse user community that offers the benefits of continually-added crowdsourced data. Finally, it can also be easily queried to find specific information or extract large data sets.

All these advantages favored Wikidata as a good tool for our experiment. As a resource for our project, Wikidata held the initial benefit of having valuable information to extract, such as geographic coordinates and alternate names. We anticipate, in the long term, that Wikidata will operate as a knowledge repository for the authority lists we have created. Consequently, these ghetto entries will continue to be improved through crowd-sourced initiatives, information which may in turn be vetted and assimilated into the EHRI Portal. By incorporating descriptive information from Wikidata into EHRI and the identifiers from EHRI into Wikidata, both benefit from a complementary relationship: EHRI improves its authoritative descriptions and Wikidata gains reliable references.

Comparing Data on Ghettos

To begin our project, we determined the primary descriptive elements we wanted to include in the EHRI ghettos vocabulary list: geographic coordinates, administrative territories, and alternative place names. We compared the list of 1391 ghettos already included in EHRI with Wikidata (which only had 80 of the ghettos listed in EHRI). We then extracted from Wikidata place names relating to the EHRI list of ghettos and associated the appropriate city or village name and its Wikidata entity (i.e. its unique identifier) with the ghetto that existed in that place during the Holocaust. In order to ensure that we matched each ghetto with the correct city or village, we compared the coordinates listed in Wikidata for each place with coordinates data from The Yad Vashem Encyclopedia of the Ghettos During the Holocaust and location descriptions from the USHMM Encyclopedia of Camps and Ghettos.

Most ghettos carry the name of the city in which they were located. We used this feature to extend the multilinguality of labels. To do this we extracted the names of the places in different languages and, using separate rules for each language, we created labels for the ghettos in up to 15 different languages (if the name of the place in the language was provided). For example, we were able to generate alternative name labels in 10 languages for the Kamiensk Ghetto, which originally had only the English name listed in the EHRI Portal. By increasing the diversity of our cross-references for each vocabulary entry, we will be able to identify and label archival records in various languages more effectively within the EHRI Portal.

Once all the associated place names were verified, we had an import that included the following descriptions for each ghetto:

While the model for the EHRI Portal still needs to be expanded to include fields for some of these descriptive elements, we were able to contribute our collated data to Wikidata.

For each ghetto that we enhanced or contributed to Wikidata, we included:

  • the English name of the ghetto
  • a statement qualifying the entry as a “ghetto in Nazi-occupied Europe”
  • the place where the ghetto was located
  • the coordinates for the location
  • an EHRI-assigned unique identifier for the ghetto (aka its “catalog code”)
  • the associated unique identifiers from online resources
  • the multilingual labels generated from the name of the places

Since Wikidata is only as good as its sources, we also qualified each statement we contributed with a reference. This will indicate to the users that the description was derived from reliable sources. Additionally, the inclusion of the EHRI unique identifier serves to direct Wikidata users to the EHRI Portal and will pave the way for future EHRI participants to extract data from resources linked to our vocabularies.

Visualizing the Ghettos on a Map

During the process of updating Wikidata to include the EHRI ghettos vocabulary set of 1391 ghettos, we ran a few queries in Wikidata showing a mapped visualization of the ghettos listed in Wikidata before and after our contribution.

Before we contributed our data to Wikidata (individual ghettos are represented by orange dots):

Image contributed by Vladimir Alexiev

After we contributed our data to Wikidata:

Image contributed by Nancy Cooey

After updating Wikidata, we extracted all the items that were an instance of “Ghetto in Nazi occupied Europe” and synchronized them with the EHRI Portal. Because Wikidata is an open project with data that can be altered by any user, future synchronizations will require us to validate all changes made to the data since the last synchronization.

As a result of our initial synchronization with Wikidata, each entry in the EHRI Portal has been enhanced to include many more alternative labels, the geographic coordinates, a Google map of the location, and a link to the ghetto’s entry in The Yad Vashem Encyclopedia of the Ghettos During the Holocaust.

After the successful completion of the pilot project, we immediately began our next project: building an authoritative hierarchical list of concentration camps. Like the ghettos vocabulary set, we plan to collate data from several authoritative sources into a succinct list, import it into Wikidata, and synchronize the results with the EHRI Portal.

As EHRI continues to grow and expand, we are hopeful that our involvement with Wikidata will continue to feed back into the EHRI vocabulary sets, providing more detailed data about individual entries. Furthermore, we hope that improving the quality of data available in Wikidata about ghettos and concentration camps during the Holocaust will be of value to the millions of Wikidata users across the world and will also direct those users to the resources described in the EHRI Portal.

2 Comments Leave a reply

  1. Excellent work. The difference shown by the data visualization of the map is stunning. I had a couple of questions. Can the identifiers be extrapolated on the various language Wikipedias as Authority control items, using the identifiers on Wikidata? (2) Would a template be helpful, like an external link template, for the various language Wikipedias, but maybe specifically En Wiki and He Wiki? Something similar to Template:Enciclopédia Itaú Cultural (see:édia_Itaú_Cultural). Not sure exactly what it would be pointing to, maybe the EHRI page? Apologies I am having a hard time finding the Rig Ghetto page on EHRI, so thought this might help to link to digitized asset / entry. Again, great work! – Erika aka BrillLyle

Leave a Reply

Your email will not be published.