Data Science and the Digital Humanities

The dig­i­tal human­i­ties has for some time been engaged in a para­dox­i­cal rela­tion­ship with data sci­ence. Although dig­i­tal human­ists them­selves do not con­sid­er their field a species of data sci­ence, and cel­e­brate the diver­si­ty and indeed the inde­fin­abil­i­ty of a field that includes top­ics rang­ing from dig­i­tal crit­i­cal edi­tions to mak­er labs to post-colo­nial crit­i­cism, from an out­siders’ per­spec­tive the field is almost exclu­sive­ly iden­ti­fied, accu­rate­ly or not, with such schools as dis­tant read­ing, macro­analy­sis, cul­tur­omics, and cul­tur­al analytics—each of which fol­low the path of data sci­ence and embrace the meth­ods and tropes asso­ci­at­ed with big data and machine learn­ing. All of the recent press the dig­i­tal human­i­ties has received in the Unit­ed States, both good and bad, has focused on this iden­ti­fi­ca­tion.

From with­in the dig­i­tal human­i­ties, this para­dox appears as an ambiva­lence toward data sci­ence, although one that has been sur­pris­ing­ly mut­ed giv­en its con­se­quences. This ambiva­lence derives from two main sources. The first is that we have wed­ded our­selves to a set of meth­ods that most of us, quite frankly, do not under­stand, even if we are able to mas­ter the use of their asso­ci­at­ed libraries in R and Python. In addi­tion to the black-box of the com­put­er we have added the opac­i­ty of advanced math­e­mat­i­cal meth­ods, meth­ods which go far beyond basic prob­a­bil­i­ty and sta­tis­tics. The sec­ond is that these meth­ods on the whole have rel­e­gat­ed the con­cept of the text itself to the back-seat. For the first thing that is lost with the datafi­ca­tion of text is the syn­tag­mat­ic struc­ture of the text as text—it’s bound­ed, inter­nal, sequen­tial struc­ture as dis­course and nar­ra­tive, all of that which con­tributes to the text’s “sense of an end­ing” to use Kermode’s phrase. Lost also is the regard for the indi­vid­ual text as unique work—the aura of the text, to extend Ben­jamin. In prac­ti­cal terms, this back-ground­ing of the text has result­ed in the de fac­to rejec­tion of the work of legions of text encoders, whose rich markup is seen as noise to the text min­er.

I note these facts not out of a nos­tal­gia for the long-embat­tled hermeneu­tic notion of the text, nor to return pride of place to TEI markup, nor to bemoan the emer­gence of the divi­sion of labor with­in an imag­ined human­i­ties com­mune, nor to dis­miss the explana­to­ry pow­er of data sci­ence, in which I place great hopes (indeed, I am a mem­ber of Virginia’s new Data Sci­ence Insti­tute). My intent, in the spir­it of Teil and Latour’s bril­liant essay, “The Hume Machine” (1995),  is to train our atten­tion on what ought to be the recur­ring start­ing point for the dig­i­tal human­i­ties’ rela­tion­ship with data sci­ence: the crit­i­cal and practical—and I dare say naïve—engagement with method­olog­i­cal ori­gins, so that we may know more clear­ly what goods we are exchang­ing when we embrace the meth­ods of the machine—for, with­in data sci­ence, the machine is the hori­zon of inter­pre­ta­tion. We ought to embrace this ori­gin as a kind of axis mun­di, a fer­tile source from which we may devel­op new meth­ods, and from whose point of view we may renew crit­i­cal per­spec­tives.

If one were to pro­vide a sin­gle key­word to char­ac­ter­ize the method­olog­i­cal ori­gin of the dig­i­tal human­i­ties’ inter­sec­tion with data sci­ence it would have to be co-occur­rence. Not net­works, for although the con­cept and image of the net­work has become a com­mon­place in the field—and, indeed, a core sym­bol in a wider intel­lec­tu­al cos­mol­o­gy that attrib­ut­es mag­i­cal pow­ers to networks—the net­work is always a deriv­a­tive notion, a gen­er­ous metaphor over­laid upon a set of visu­al­iza­tions (both con­crete and imag­i­nary) all of which are gen­er­at­ed by the same log­ic, that of dis­play­ing a col­lec­tion of rela­tions among items select­ed in the first place by their co-occur­rence. All of the pop­u­lar meth­ods of text min­ing and machine learning—latent seman­tic index­ing, top­ic mod­el­ing, dis­am­bigua­tion of named enti­ties, near­est neigh­bor analysis—build on this foun­da­tion­al rep­re­sen­ta­tion in which spa­tial con­ti­gu­i­ty is con­sid­ered para­mount, the “real­ly real” sub­strate upon which all inter­pre­ta­tion rests. Even the use of ngrams fol­low this rule, because although they intro­duce sequence at the lex­i­cal lev­el, each ngram is inter­pret­ed as a uni­tary token that co-occurs with oth­er such tokens with­in a space defined by nation­al lan­guage and year.

[To be con­tin­ued …]

* * *

Leave a Reply

Your email address will not be published. Required fields are marked *