Exploratory Topic Modelling in Python

This post, originally entitled “Exploratory Topic Modelling Using R”, was first published by Mike Bryant in June 2016 on a now deactivated blog. We have since updated it to include more data and to explore similar tools in Python. The original blog post (Bryant, 2016) is still accessible through the Internet Archive’s Wayback Machine.

Continuing the work that started with the EHRI-2 project, some of our tasks in EHRI-3 involve investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This post explores a technique called topic modelling in the context of a Holocaust-related historical collection.

The post is accompanied by a GitHub repository that includes a Jupyter Notebook explaining each step of the process that was followed. The repository additionally hosts one of the models that were trained and links to the datasets needed to reproduce the result of this tutorial, which have been made available on Zenodo. Instructions on how to load the pre-trained model and on how to use it to make new predictions can be found in the provided Jupyter Notebook.


What is Topic Modelling?

Topic modelling is a technique by which documents within a corpus are clustered based on how certain groups of terms are used together within the text. The commonalities between such term groupings tend to form what we would normally call “topics”, providing a way to automatically categorise documents by their structural content, rather than any a priori knowledge system. Topic modelling is generally most effective when a corpus is large and diverse, so the individual documents within it are not too similar in composition. In EHRI, of course, we focus on the Holocaust, so documents available to us are naturally restricted in scope. It was an interesting experiment, however, to test to what extent a corpus of Holocaust-related documents was able to be topic modelled, and what “topics” emerged within it.

The specific type of topic modelling we are looking at is called latent Dirichlet allocation (LDA), subject of an influential paper by Blei, et al. (2003). It is important to keep in mind that topic modelling and in particular the LDA model was originally introduced by computer scientists as a method to process large collections of data and improve information retrieval tasks (Blei et al., 2003; Schmidt, 2012). However, when humanists use topic modelling in their research, they usually do it for exploratory purposes (Owens, 2012; Schmidt, 2012). Topic modelling is a good way to inspect a large collection of texts from a distance and discover what its contents might be. For example, if one has a vague idea of what the underlying structure of a corpus they are working with is, they might expect to see a certain number of easily distinguishable topics emerging within it. Training an LDA model on that corpus might verify these assumptions or indicate that there might be other patterns which the researcher was not anticipating. Such exploration could lead to interesting research questions which the researcher might want to address using close reading on a smaller part of the corpus.


The Data

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we created a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a relatively small dataset but nevertheless one we consider to be capable of providing interesting study results.

Most of the testimonies comprising our corpus come from survivors of the Holocaust. These testimonies are usually the result of an interview process, which typically follows a certain structure. For example, the Oral History Interview Guidelines published by the USHMM state that interviews with Holocaust survivors are usually structured in three parts: “prewar life, the Holocaust and wartime experiences, and postwar experiences” (United States Holocaust Memorial Museum, 2007, p. 26). There are limitations to what topic modelling can reveal about a collection of documents that relate to the same general subject (in this case, the Holocaust) and that follow a more-or-less similar structure, but the results are nonetheless interesting and potentially useful.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post. The dataset itself is also published on Zenodo.


The Tools

Several different tools are available that support topic modelling. For this tutorial, we chose to use the Gensim package (Řehůřek & Sojka, 2010) in Python (Van Rossum & Drake, 2009). We will not detail the exact code needed here but those interested can check the scripts in the Jupyter Notebook published on GitHub here. This notebook takes inspiration from the excellent work and tutorials previously published by Mattingly (2021, 2022) and Řehůřek (n.d.-b, n.d.-a).

Another tool which is worth exploring but is not covered in this tutorial is MALLET. If you prefer to work in R (another popular programming language), then the tm (Text Mining) and topicmodels packages will be of interest to you.


Processing the Corpus

The accompanying Jupyter Notebook details the steps needed to process the corpus and train the LDA model, but a summary is provided here:

  1. Using spaCy (Honnibal et al., 2020), a Natural Language Processing (NLP) Python library, we pre-process the corpus in the following ways:
  1. We tokenise each transcript (breaking its text into smaller units, such as words, punctuation marks, etc).
  2. We assign part-of-speech tags to each token.
  3. We check whether tokens belong to the part-of-speech classes that we are interested in keeping.
  4. We check whether the tokens are included in our list of stopwords (words such as “a”, “the”, etc, which are very frequent and carry little meaning— this includes terms that might not be very frequent in general but are very frequent in this specific corpus and we think are uninformative, such as repeating headers).
  5. We check if the token is a punctuation mark or a number.
  6. If the token is not a stopword, a number, or a punctuation mark and belongs to a part-of-speech class that we are interested in, we take its lemma and add it to the list of words that we are going to keep from the transcript. Lemma is the base form of the word as it would appear in the dictionary, for example the words “goes, going, went” all have the same lemma, which is go (Manning et al., 2009).
  7. We filter out words that appear in more than half of the transcripts (Řehůřek, n.d.-b), which is thought to improve the quality of the topics.
  8. We create a bag-of-words representation of the words in each document. Each document will get transformed into a vector, where each feature will represent the number of times a specific word appears in the document (Řehůřek, n.d.-a).
  9. We run the LDA algorithm on the bag-of-words representations to generate the desired number of “topics” (technically, a distribution over terms) for the entire corpus.
  10. We visualise and interpret the results.

The result is a model which we can use to ask questions such as:

  • What are the top N terms in each of the abstract topics that were derived from the corpus?
  • What is the probability that a particular document in our corpus belongs to a particular topic?

If we obtain some new data, i.e., additional transcripts that were not used for training the model, we can also create a new bag-of-words representation of them and determine their probabilistic topical content according to our model.


The Results

So how well does this work on our oral history transcripts? The results we obtained running LDA for 3 topics on the 1,873-transcript corpus, were as follows:

 TOPIC 1TOPIC 2TOPIC 3
1ghettocoursearmy
2girlkidcourse
3morningapartmentunit
4breadhusbandofficer
5soldierpointprisoner
6waterbitfact
7guysurvivororder
8streettodaysoldier
9barrackprogramorganization
10citybookmaterial
11factoryunclearea
12truckbusinessguy
13boydaughtercase
14clothefactgovernment
15piececousinsituation
Table containing the 15 most relevant words associated with each topic inferred by our three-topic model

It is really important to remember here that the “topics” identified by the LDA algorithm are abstract, and the algorithm itself has no knowledge of the actual meaning of any of the above terms. Notably though, generating three topics reliably produced one topic that tended strongly towards a theme of life in the ghettos and camps (Topic 1), a topic that seems to deal with the theme of family and life before or after the war (Topic 2) and a topic pertaining to the army and its apparatus (Topic 3). These topics are consistent with what we expected based on our prior knowledge concerning the typical structure of these interviews.

We can gain a bit more insight into these topics if we use the LDAvis visualisation package (and its associated pyLDAvis Python port) (Sievert & Shirley, 2014), which lets us generate a nice interactive viewer from an LDA model. Click here for the interactive visualisation we obtained from the model with three topics.

Interactive visualisation of our three-topic LDA model created with pyLDAvis (Sievert & Shirley, 2014).

When exploring a corpus with the help of LDA, it is common to experiment with different parameters to investigate how the results might change. In our case, when we ask the LDA to try to identify six topics, it seems that the topics tend to become narrower and more specific. You can observe the topics in detail through the interactive visualisation in this link but a table of the top words per topic is also provided below for convenience. We observe that Topic 2 from the three-topic model has now been split up into two topics (Topics 1 and 2 from the six-topic model). The same seems to be true for Topic 3 from the previous model, which apparently has been divided into two more specific topics (Topics 3 and 5). We also observe the emergence of a new topic (Topic 4), which pertains to concentration and death camps. However, Topic 6 of the six-topic model is almost the same as Topic 1 from the three-topic model, still purportedly addressing life in the ghettos and camps.

 TOPIC 1TOPIC 2TOPIC 3TOPIC 4TOPIC 5TOPIC 6
1programhusbandarmyprisonercourseghetto
2museumkidhmguardcommunitygirl
3todaybusinessofficernumberorganizationbread
4mombookguyconcentrationgovernmentmorning
5documentgirlsoldiermaterialideawater
6pointapartmentgunbodypointsoldier
7survivordaughtercommanderfactsituationcity
8questioncousinunitdeathproblemguy
9courseuncledateareamemberstreet
10hidingbittankpointcityfactory
11rightstoreorderofficerfactpiece
12uncleboytrainingpicturestudentwood
13photographpicturedivisionbarrackofficeclothe
14accountagefrontorderbitboy
15pagelettercompanysoldiercasebarrack
Table containing the 15 most relevant words associated with each topic inferred by our six-topic model 1

Interactive visualisation of our six-topic LDA model created with pyLDAvis (Sievert & Shirley, 2014).

The results of this tutorial show that applying the LDA model on a very subject-specific corpus still produces some interesting results. One might find, however, that subject matter expertise and a good understanding of the contents of a corpus are very important in successfully interpreting these results. While topic modelling is a good way to get a glimpse of the prevailing topics within a corpus and might lead to serendipitous discoveries, poor understanding of the LDA algorithm could also lead to unfounded conclusions (Owens, 2012; Schmidt, 2012).


Future directions

One thing we are particularly interested in trying to look at in future is the way people thought about and described their experience of the Holocaust, and how it changed over time. This would involve analysing the distribution of topics relative to the year in which a particular testimony was given. For any reader daring to try, the display date of the testimony is also present in the datasets provided. The EHRI team would be happy to receive future Document Blog contributions with the results of such experimentations.


References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Bryant, M. (2016, June). Exploratory Topic Modelling using R. EHRI DH Blog. https://dhblog.ehri-project.eu/exploratory-topic-modelling-using-r/

Honnibal, M., Montani, I., Van Landeghem, S., & Boyd, A. (2020). spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/10.5281/zenodo.1212303

Manning, C., Raghavan, P., & Schuetze, H. (2009). Introduction to Information Retrieval. Cambridge University Press.

Mattingly, W. J. B. (2021, February 23). What is Latent Dirichlet Allocation LDA (Topic Modeling for Digital Humanities 03.01). https://www.youtube.com/watch?v=o7OqhzMcDfs

Mattingly, W. J. B. (2022). Implementing LDA in Python—Introduction to Python for Humanists. In Introduction to Python for Digital Humanities. https://python-textbook.pythonhumanities.com/04_topic_modeling/03_03_lda_model_demo.html

Owens, T. (2012, November 19). Discovery and Justification are Different: Notes on Science-ing the Humanities. Trevor Owens. http://www.trevorowens.org/2012/11/discovery-and-justification-are-different-notes-on-sciencing-the-humanities/

Řehůřek, R. (n.d.-a). Corpora and Vector Spaces. Gensim: Topic Modelling for Humans. Retrieved 16 June 2022, from https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html

Řehůřek, R. (n.d.-b). LDA Model. Gensim: Topic Modelling for Humans. Retrieved 16 June 2022, from https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py

Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50.

Schmidt, B. M. (2012). Words alone: Dismantling topic models in the humanities. Journal of Digital Humanities, 2(1), 49–65.

Schofield, A., Magnusson, M., & Mimno, D. (2017). Pulling Out the Stops: Rethinking Stopword Removal for Topic Models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 432–436. https://aclanthology.org/E17-2069

Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70. https://doi.org/10.3115/v1/W14-3110

United States Holocaust Memorial Museum. (2007). Oral History Interview Guidelines. https://www.ushmm.org/m/pdfs/20121003-oral-history-interview-guide.pdf

Van Rossum, G., & Drake, F. L. (2009). Python 3 Reference Manual. CreateSpace.


Further reading

Schmidt’s (2012) Words Alone: Dismantling Topic Models in the Humanities on the ways in which topic modelling can be misleading.

Schofield’s (2019) PhD Dissertation on Text Processing for the Effective Application of Latent Dirichlet Allocation.

  1. It is worth noting here that Topic 1 from the six-topic model seems to be picking up certain words (“program, museum, today, interview, question”) which one would expect to find in transcripts of interviews, but which do not necessarily describe a meaningful topic. In this case, we can either choose to ignore this topic as it might be incoherent or try to add these words to our stopword list and rerun the model to see if we get a more coherent topic after we do this (a third option would be to post-process our results and remove stopwords after inference but this method is not described in the accompanying notebook (Schofield et al., 2017)).

Leave a Reply

Your email will not be published.