Data Science and the Digital Humanities
The digital humanities has for some time been engaged in a paradoxical relationship with data science. Although digital humanists themselves do not consider their field a species of data science, and celebrate the diversity and indeed the indefinability of a field that includes topics ranging from digital critical editions to maker labs to post-colonial criticism, from an outsiders’ perspective the field is almost exclusively identified, accurately or not, with such schools as distant reading, macroanalysis, culturomics, and cultural analytics—each of which follow the path of data science and embrace the methods and tropes associated with big data and machine learning. All of the recent press the digital humanities has received in the United States, both good and bad, has focused on this identification.
From within the digital humanities, this paradox appears as an ambivalence toward data science, although one that has been surprisingly muted given its consequences. This ambivalence derives from two main sources. The first is that we have wedded ourselves to a set of methods that most of us, quite frankly, do not understand, even if we are able to master the use of their associated libraries in R and Python. In addition to the black-box of the computer we have added the opacity of advanced mathematical methods, methods which go far beyond basic probability and statistics. The second is that these methods on the whole have relegated the concept of the text itself to the back-seat. For the first thing that is lost with the datafication of text is the syntagmatic structure of the text as text—it’s bounded, internal, sequential structure as discourse and narrative, all of that which contributes to the text’s “sense of an ending” to use Kermode’s phrase. Lost also is the regard for the individual text as unique work—the aura of the text, to extend Benjamin. In practical terms, this back-grounding of the text has resulted in the de facto rejection of the work of legions of text encoders, whose rich markup is seen as noise to the text miner.
I note these facts not out of a nostalgia for the long-embattled hermeneutic notion of the text, nor to return pride of place to TEI markup, nor to bemoan the emergence of the division of labor within an imagined humanities commune, nor to dismiss the explanatory power of data science, in which I place great hopes (indeed, I am a member of Virginia’s new Data Science Institute). My intent, in the spirit of Teil and Latour’s brilliant essay, “The Hume Machine” (1995), is to train our attention on what ought to be the recurring starting point for the digital humanities’ relationship with data science: the critical and practical—and I dare say naïve—engagement with methodological origins, so that we may know more clearly what goods we are exchanging when we embrace the methods of the machine—for, within data science, the machine is the horizon of interpretation. We ought to embrace this origin as a kind of axis mundi, a fertile source from which we may develop new methods, and from whose point of view we may renew critical perspectives.
If one were to provide a single keyword to characterize the methodological origin of the digital humanities’ intersection with data science it would have to be co-occurrence. Not networks, for although the concept and image of the network has become a commonplace in the field—and, indeed, a core symbol in a wider intellectual cosmology that attributes magical powers to networks—the network is always a derivative notion, a generous metaphor overlaid upon a set of visualizations (both concrete and imaginary) all of which are generated by the same logic, that of displaying a collection of relations among items selected in the first place by their co-occurrence. All of the popular methods of text mining and machine learning—latent semantic indexing, topic modeling, disambiguation of named entities, nearest neighbor analysis—build on this foundational representation in which spatial contiguity is considered paramount, the “really real” substrate upon which all interpretation rests. Even the use of ngrams follow this rule, because although they introduce sequence at the lexical level, each ngram is interpreted as a unitary token that co-occurs with other such tokens within a space defined by national language and year.
[To be continued …]
* * *
Leave a Reply