Integrating Text and Data Mining into a History Course

This article was authored by Mathew Toll and Marco Duranti.

Being introduced to conventional statistical programs can be daunting for students put off by talk of logarithmic transformations and least squares regression. Some might have even chosen their degree, in part, to avoid having to do any more calculations – a least maths criterion for degree selection.

Text and data mining (TDM) may be even more frightening; a suite of methods practiced by people with data and computer science backgrounds to extract information and patterns out of large archives of unstructured data.

Traditionally, to use these techniques for humanities research, you needed to work with a data scientist or develop your programming skills. That’s a high bar for entry. Yet, these techniques are now increasingly used to comb through large archives of texts and provide insights for historians and humanists. Digitisation has made available vast swathes of text that would make a close reading a wasteful commitment. Even if the digital humanities fail to revolutionise the entire field, text and data mining is part of the modern historian and humanists’ toolkit and critical familiarity with its methods are important for a well-rounded education.

This begs the question of how to introduce humanities students to these new possibilities. Could you bring text and data mining into a tutorial on the history of human rights? And could you do so without requiring extensive training in the use of new digital tools, instead devoting only a couple weeks of the semester to teaching these techniques? The answer is yes.

That is exactly what happened in HSTY2616: The Human Rights Revolution last semester, where students were introduced to topic modelling and geographical analysis to illuminate the history of controversies around Aboriginal rights, LGBTI+ rights, and refugee rights. How did we do this given the technical barrier?

A University of Sydney team, led by Dr Marco Duranti (History), with support from the DVC Education (via a strategic education grant), FASS and the Library, worked with ProQuest’s new text and data mining platform, still in development, to put together corpora for students to analyse. This team, comprising Duranti, Brian Bailey, Bec Plumbe, Chao Sun, and Mathew Toll is one of the first groups to trial the platform in a semester long course. The ProQuest platform offers a range of corpora and the ability to put together collections of documents from their newspaper archives and analyse them with methods like sentiment analysis and network analysis.

This doesn’t involve having to teach students programming languages, so one of the biggest barriers to entry has been removed. With the platform it’s now perhaps even easier than analysing world bank data in IBM SPSS with students. Keeping it simple is an important rule for teaching computational techniques – students aren’t necessarily invested in learning about methods just for the sake of it. They are interested in learning about their subject area, so learning techniques relevant to this like close reading, have value only if they help them understand history and develop a historian’s skills.

Students in HSTY2616 were taught about the logic of text and data mining, its usefulness for historians and its limitations during two lectures. Research strategies around distance reading and examples from work conducted around the construction and struggle over the definition of human rights were provided to keep things grounded. In order to provide students with insights into the iterative process by which a topic modelling interface is refined, we demonstrated an ‘in progress’ topic modelling project on British parliamentary records that we created independently of ProQuest.

Outside of the lecture hall, students worked independently and in teams to apply text and data mining tools to newspaper articles ranging from 1980 to the present. They were first asked to use ProQuest’s topic modelling tool to analyse changes over time and identify the most frequently reported topics in coverage of LGBTI+ rights in the New York Times. In the following week, they made use of ProQuest’s geographic analysis tool to track changes over space in news coverage of Aboriginal rights and refugee rights in Australian newspapers, as well as to identify localities associated with news topics on these subjects. Student analytics and responses were archived, including observations on the utility of text and data mining for their own research projects and their suggestions for improvements to our Canvas exercises and the ProQuest TDM interface.

Prior to the weeks that topic modelling and geographical analysis were used in the tutorials, students were given a walk through of the text and data mining tools and some multiple-choice quiz questions in Canvas which were based on the interpretation of outputs from the ProQuest’s platform. To answer these, students had to interpret these results and craft accounts to answer traditional historical questions about change over time and across space. Providing these quizzes before the class allowed students the time to absorb the material and think through the issues that can arise with employing topic modelling and geographical analysis. By the time they entered the tutorial room they had already been exposed to the platform and dealt with problem of interpretation. This allowed them space to critically evaluate the techniques and discuss the interpretations of history that they derived from them.

None of this would have been possible without interdisciplinary collaboration. Working closely with the Library and ProQuest, the team pooled its members’ expertise in the fields of data science, eLearning, library science, and history. This demonstrated to students how they might one day engage in similar teamwork, pairing their knowledge of qualitative contextual analysis with the skill set of individuals versed in text mining and computational techniques so as to take full advantage of new digital tools. The team now hopes to disseminate its text and data mining modules to units of study across the university that cover human rights topics from 1980 to the present. Unit coordinators interested in devoting one or more weeks of the semester to providing their students with digital literacy in the field of human rights, including genocide, are encouraged to email Marco Duranti at [email protected].

Written By
More from Mathew Toll