Purity and Data

I had the plea­sure last week to attend a lec­ture by Kavi­ta Philip on “Data­bas­es and Pol­i­tics: Some Lessons from Doing South Asian STS,” part of a series spon­sored by UVa’s STS depart­ment (as well as, in this case, both Women’s Stud­ies and Mid­dle East­ern Stud­ies). Philip is a pro­fes­sor of his­to­ry at UC Irvine who spe­cial­izes in, among oth­er things, transna­tion­al his­to­ries of sci­ence and tech­nol­o­gy, and who has post­grad­u­ate train­ing in physics and social sci­ence (STS). Her back­ground and top­ic of research are espe­cial­ly inter­est­ing to me since they exem­pli­fy a new form of data schol­ar­ship, some­thing we are try­ing to devel­op here at UVa at the Data Sci­ence Insti­tute and the Cen­ter for the Study of Data and Knowl­edge. With the tech­ni­cal knowl­edge to under­stand the gory details of how socio-tech­ni­cal agents such as data­bas­es are built and func­tion, as well as mas­tery of a social sci­ence dis­course with which to con­tex­tu­al­ize this knowl­edge, his­tor­i­cal­ly and social­ly, one may pur­sue some inter­est­ing lines of research.

Philip’s argu­ment, as best as I can retell it from my notes, is as fol­lows. In 2011, after eschew­ing the equiv­a­lent of what in the US is called pos­tra­cial­ism, the peo­ple and gov­ern­ment of India decid­ed to rein­tro­duce the cat­e­go­ry of caste into the nation­al cen­sus for the first time since 1931. In cre­at­ing the data­base to cap­ture this infor­ma­tion – in the field, through form-dri­ven inter­views, and in the schema of a rela­tion­al data­base – the devel­op­ers drew, either direct­ly or indi­rect­ly, through imi­ta­tion of the 1931 cen­sus, from the works of British ethno­g­ra­ph­er and colo­nial admin­is­tra­tor, Her­bert Hope Ris­ley, includ­ing The Peo­ple of India. Appar­ent­ly, the think­ing among the soft­ware devel­op­ers was that by the 1930s Eng­lish anthro­pol­o­gists had reached a suf­fi­cient­ly advanced under­stand­ing of cul­ture, caste, and race that Ris­ley’s ideas would pro­vide a sound foun­da­tion for the data mod­el. After all, by this time, many anthro­pol­o­gists, such as Amer­i­can cul­tur­al anthro­pol­o­gists, had moved beyond the more egre­gious the­o­ries of race that char­ac­ter­ized the dis­ci­pline in the late 19th and ear­ly 2oth cen­turies. How­ev­er, Ris­ley was no Boas. Instead of view­ing human anatom­i­cal vari­a­tion, of fea­tures such as head shape, as results of envi­ron­men­tal con­di­tion­ing, Ris­ley was strong­ly com­mit­ted to notions of genet­ic deter­min­ism (of even IQ) and the effi­ca­cy of anthro­pom­e­try, and he saw caste as a reflec­tion of these dimen­sions. More­over, he believed endogamy to be more con­sis­tent­ly prac­ticed than we know, at least today, to be the case. Because of this, the data­base encodes not only a set of received cat­e­gories about caste, but a par­tic­u­lar under­stand­ing about the nature of cat­e­go­riza­tion itself. Here Philip makes a telling point — “endogamy is a data mod­el­er’s dream.” More on this below.

In effect, then, the 2011 cen­sus oper­a­tional­ized and there­by nat­u­ral­ized an anti­quat­ed and dan­ger­ous under­stand­ing of the caste system.

In effect, then, the 2011 cen­sus oper­a­tional­ized and there­by nat­u­ral­ized an anti­quat­ed and dan­ger­ous under­stand­ing of the caste sys­tem, encod­ing in its data mod­el a the­o­ry of caste to which no cur­rent stake­hold­er would sub­scribe, at least open­ly. As Philip puts its, the “infra­struc­tures of the [old­er] cen­sus passed into com­mon sense cul­tur­al belief even as its sci­en­tif­ic basis was erod­ed.” But, since no one real­ly ever ques­tions the data mod­el of a data­base, because there is no prac­tice or dis­course with which to have such a dis­cus­sion in the pub­lic sphere, the silence of the mod­el in effect estab­lish­es its transcendence.

The oper­a­tional­iza­tion of the­o­ry is com­mon among devel­op­ers of soft­ware and data­bas­es in domains which have been the­o­rized by social sci­en­tists. For exam­ple, the works of Lakoff and John­son have become the foun­da­tions for at least two projects I am aware of that attempt to rep­re­sent metaphors in data­bas­es. This is not a bad thing — in fact, it is won­der­ful that the work of schol­ars and social sci­en­tists does not remain lost in the ossuary of aca­d­e­m­ic pub­lish­ing but instead becomes reme­di­at­ed in new forms. But it is shock­ing to human­ists and social sci­en­tists the lev­el of author­i­ty that is effec­tive­ly accord­ed a set of ideas which are, from the point of view of a dis­ci­pline, nev­er defin­i­tive or even par­tic­u­lar­ly spe­cial. There is a con­cern for the undue ampli­fi­ca­tion of the­o­ry as it becomes encod­ed and back­ground­ed in this media form.

Philip con­cludes with a plea to revis­it the data mod­el and the meth­ods of data entry employed by the cen­sus, and to embrace an alter­nate mod­el to rep­re­sent caste, one more in line with the inde­ter­mi­na­cy and messi­ness of the facts as lived. Although she sug­gests fol­low­ing the mod­el of a crowd­sourced folk­son­o­my, she leaves the tech­ni­cal ques­tion to one side and asks anoth­er, more open one: “What would it mean to democ­ra­tize the process [in the first place]?” What would it mean to have a par­tic­i­pa­to­ry soft­ware devel­op­ment process applied at scale? At the scale of more than a bil­lion? And that’s a very inter­est­ing question.

What would it mean to have a par­tic­i­pa­to­ry soft­ware devel­op­ment process applied at scale? At the scale of more than a billion?

Although the ker­nel of Philip’s account is not new — that data­bas­es encode, repro­duce, and often ampli­fy bias is a com­mon­place  —  what is new is the con­text, and the var­i­ous dimen­sions of inquiry it opens up. One dimen­sion is his­tor­i­cal, since the data­base in ques­tion has a geneal­o­gy that goes back to the late 19th cen­tu­ry. Indeed, the con­nec­tion between data­bas­es and cen­sus tak­ing is one that goes far beyond India and touch­es on the ori­gins of data sci­ence itself. Recall that IBM’s* first “busi­ness machine,” Hol­re­i­th’s punched card tab­u­lat­ing machine, was designed pre­cise­ly to count cen­sus data — the 1890 cen­sus, whose unex­pect­ed­ly large num­bers result­ed from the sec­ond great wave of immi­gra­tion in US his­to­ry — and that, fur­ther, the entire field of sta­tis­tics, of which cen­sus tak­ing is a cen­tral activ­i­ty, was devel­oped as part of the project to man­age nation­al pop­u­la­tions. Many of the con­clu­sions drawn from Philip’s case may prof­itably be “abduct­ed” to the gen­er­al lev­el, at least to gen­er­ate hypothe­ses and inter­est­ing avenues of research.

Anoth­er dimen­sion of Philip’s case is what might be called the mechan­ics of bias.

There is an implic­it bias in data­bas­es towards puri­ty. That “dirty data” needs to be cleaned before it can inhab­it the cells of a table, and tables need to be nor­mal­ized, or care­ful­ly renor­mal­ized, for their con­tents to have integri­ty, are well known ele­ments of prop­er data­base hygiene. This “fool­ish con­sis­ten­cy,” as John Unsworth, quot­ing Emer­son, describes it, can have its ben­e­fits in the con­text of research, since it forces one to care­ful­ly think about one’s cat­e­gories of inter­pre­ta­tion . But in the vast­ly larg­er Indi­an polit­i­cal con­text, this design bias appears to col­lude with an exist­ing ide­ol­o­gy of puri­ty, ful­fill­ing a Durkheimi­an dream in which social struc­ture and cog­ni­tive log­ic mir­ror each otter. How­ev­er, because data­base tech­nol­o­gy, con­sid­ered as an agent in a socio-tech­ni­cal net­work, lacks the impro­vi­sa­tion­al capac­i­ty of human actors, the effect of this col­lu­sion is to ampli­fy an inflect­ed notion of puri­ty, one with­out its built in, although always unspo­ken, ambi­gu­i­ties. The ide­ol­o­gy of puri­ty in human hands is, as Lucy Such­man, fol­low­ing the eth­nomethod­ol­o­gists, would say, a resource for sit­u­at­ed action. There is always a slip­page between the overt, for­mal claims of puri­ty and the actu­al prac­tice in which these claims par­tic­i­pate in the nego­ti­a­tion of iden­ti­ty. This slip­page — this inevitable emer­gence of lim­i­nal­i­ty as a semi-unin­tend­ed con­se­quence of cat­e­gories in social use — is part of the rea­son the ide­al exists in the first place. But a data­base — the 600 pound goril­la in the office — doesn’t know that, at least the way data­bas­es are designed and used today.

In the Indi­an polit­i­cal con­text, data­base design appears to col­lude with an exist­ing ide­ol­o­gy of purity.

More gen­er­al­ly, we may observe that the ampli­fi­ca­tion and over­val­u­a­tion of puri­ty in this con­text is an effect of trans­duc­tive medi­a­tion, a prop­er­ty that attends the use of media forms in the pub­lic sphere. In trans­duc­tive medi­a­tion, a small dif­fer­ence at the lev­el of struc­ture at input time, as it were, is select­ed for and ampli­fied by a media form, such as a news­pa­per or a film or a data­base, as that dif­fer­ence sub­sti­tutes for what it rep­re­sents and is prop­a­gat­ed through­out the sys­tem in which is par­tic­i­pates. In this case the trans­duced struc­ture is puri­ty and the result is an inflex­i­ble ver­sion of that idea. The effect is sim­i­lar to what Jon Ander­son describes in his account of the his­to­ry of the Inter­net in the Mid­dle East, where­by the ear­ly adopters of the medi­um — dias­poric engi­neers and sci­en­tists who used news­groups and web­sites to exchange ideas of about reli­gion and pol­i­tics back home — favored sharia (law) over ule­ma (inter­pre­ta­tion) . Because these rep­re­sen­ta­tions of Islam — which were exchanged as part of an infor­mal and only qua­si-pub­lic dis­course — were stored in data­bas­es, this par­tic­u­lar form of Islam became more estab­lished in transna­tion­al cyber­space than in the tra­di­tion­al pub­lic sphere. When the Inter­net final­ly became wide­ly acces­si­ble in the Mid­dle East, around 2005, this pre-encod­ed form of Islam was already part of the dig­i­tal pub­lic sphere, and rein­forced an attach­ment to sharia devel­oped by oth­er groups for oth­er reasons.

Media ecolo­gies pro­vide selec­tive envi­ron­ments in which the sur­viv­abil­i­ty — the there­fore polit­i­cal effi­ca­cy — of an idea is a func­tion of its rep­re­sen­ta­tion­al fitness.

In oth­er words, media ecolo­gies pro­vide selec­tive envi­ron­ments in which the sur­viv­abil­i­ty — the there­fore polit­i­cal effi­ca­cy — of an idea is a func­tion of its rep­re­sen­ta­tion­al fit­ness. As data­base tech­nolo­gies and prac­tices become increas­ing­ly preva­lent in the pub­lic sphere — or, more accu­rate­ly, in the “dark” pub­lic sphere, since they are rarely dis­cussed as such — this par­tic­u­lar syn­drome of selec­tion, ampli­fi­ca­tion, dis­tor­tion, and prop­a­ga­tion will only become more com­mon. We should not only fig­ure out ways, as Philip says, to democ­ra­tize the process of data­base design, but to devel­op a prac­tice in which the trans­duc­tive effects of the medi­um are mod­i­fied and con­di­tioned to work in favor of the pub­lic good.

* Actu­al­ly, the machine was the cre­ation of the Tab­u­lat­ing Machine Com­pa­ny, one of three com­pa­nies that would lat­er become IBM.


, ,

Leave a Reply

Your email address will not be published. Required fields are marked *