Report about the Big Data and Bad Data Workshop (2-3 September 2016) by Tobias Leonhardt (Bern)

Lausanne, 2-3 September 2016: Big Data and Bad Data: Challenges of quantitative and qualitative research methods in linguistics – a report by Tobias Leonhardt, University of Bern


This workshop focused on the methodological obstacles that present themselves when working with the increasing amounts of available data in diachronic as well as synchronic linguistics and brought together a small contingent of historical linguists and dialectologists of English varieties. It was highlighted that, besides many crucial differences, there are also many overlaps between the two strands in linguistics, and that therefore, at least to some degree, the methods used in corpus linguistic research are not restricted to those strands. Over two days, ten presentations and three discussion sessions helped further our understanding of not just the work of researchers in the other strand but also that of our own.


In the first block, Susan Fitzmaurice (University of Sheffield), Daisy Smith (University of Edinburgh), Moragh Gordon (Utrecht University), and Yasushi Miyazaki (Kwansei Gakuin University) held their presentations. Among other things, they raised questions as to the definition of what is and what is not a token, whether corpus research requires for a key term to be explicitly mentioned or if it suffices for it to be described, circumscribed or implied. They all reminded us, in different ways, that our linguistic sources, our geographical sites and our more or less known authors and contributors all constitute complex linguistic realities, and they made it apparent that the human component can never be taken away completely but is always a requirement in some ways, regardless of there being more corpora or more sophisticated tools for our analysis. These and other issues were taken up in the open discussion where it was reinforced that our objects of interest (contemporary spoken varieties, written varieties in letters and manuscripts, etc.) must not be confused with the corpora that contain them. Furthermore, using tools for the analyses of corpora that contain our objects of interest requires a lot of sensitivity to all kinds of factors and issues, which renders corpus analysis slower than it might be assumed at first.


The second day was started with presentations by Alexander Bergs (University of Osnabrück), Sarah Grossenbacher (University of Bern), and Tino Oudesluijs (University of Lausanne). From these presentations there emerged a common theme, namely the necessity for human intervention when working with corpus analysis tools, which made for a perfect continuation of the previous day and fed nicely into the subsequent discussions: Search outputs must not be interpreted as representative of linguistic realities, and they must not be confused with results but need manual correction – which can only be undertaken adequately when there is a profound understanding of the structure of the corpora, its compilation methods, and a myriad of other historical and sociolinguistic aspects that contribute to the complex linguistic realities we ultimately (probably) aim to capture and describe.


The last block featured presentations by Daniel Schreier (University of Zurich), Nadine Chariatte (University of Bern), and a joint paper by Tobias Leonhardt, Sara Lynch and Dominique Bürki (University of Bern). It was, once again, demonstrated how a profound understanding of the corpora enables the right questions to be asked and the right methodologies to be chosen in an attempt to answer them. The best results are gained by not forgetting the speakers, communities and histories behind the data they provide, and by being sensitive to individual factors and by identifying the possibilities as well as limitations of our corpora accordingly. This holds true for contemporary speech that can be recorded today as well as the historical data in manuscripts and letters from various archives that survives from decades or even centuries past. The last discussion session, then, was concluded with a more or less open question: Is there bad data at all, or is there only bad scholarship?


Doubtlessly, this workshop has been fruitful and engaging. A big ‘Thank You’ goes out to the organisers, Tino Oudesluijs (University of Lausanne) and Moragh Gordon (Utrecht University), for creating this opportunity and for being wonderful hosts!