Purity and Data
September 24th, 2014 Rafael AlvaradoI had the pleasure last week to attend a lecture by Kavita Philip on “Databases and Politics: Some Lessons from Doing South Asian STS,” part of a series sponsored by UVa’s STS department (as well as, in this case, both Women’s Studies and Middle Eastern Studies). Philip is a professor of history at UC Irvine who specializes in, among other things, transnational histories of science and technology, and who has postgraduate training in physics and social science (STS). Her background and topic of research are especially interesting to me since they exemplify a new form of data scholarship, something we are trying to develop here at UVa at the Data Science Institute and the Center for the Study of Data and Knowledge. With the technical knowledge to understand the gory details of how socio-technical agents such as databases are built and function, as well as mastery of a social science discourse with which to contextualize this knowledge, historically and socially, one may pursue some interesting lines of research.
Philip’s argument, as best as I can retell it from my notes, is as follows. In 2011, after eschewing the equivalent of what in the US is called postracialism, the people and government of India decided to reintroduce the category of caste into the national census for the first time since 1931. In creating the database to capture this information – in the field, through form-driven interviews, and in the schema of a relational database – the developers drew, either directly or indirectly, through imitation of the 1931 census, from the works of British ethnographer and colonial administrator, Herbert Hope Risley, including The People of India. Apparently, the thinking among the software developers was that by the 1930s English anthropologists had reached a sufficiently advanced understanding of culture, caste, and race that Risley’s ideas would provide a sound foundation for the data model. After all, by this time, many anthropologists, such as American cultural anthropologists, had moved beyond the more egregious theories of race that characterized the discipline in the late 19th and early 2oth centuries. However, Risley was no Boas. Instead of viewing human anatomical variation, of features such as head shape, as results of environmental conditioning, Risley was strongly committed to notions of genetic determinism (of even IQ) and the efficacy of anthropometry, and he saw caste as a reflection of these dimensions. Moreover, he believed endogamy to be more consistently practiced than we know, at least today, to be the case. Because of this, the database encodes not only a set of received categories about caste, but a particular understanding about the nature of categorization itself. Here Philip makes a telling point — “endogamy is a data modeler’s dream.” More on this below.
In effect, then, the 2011 census operationalized and thereby naturalized an antiquated and dangerous understanding of the caste system, encoding in its data model a theory of caste to which no current stakeholder would subscribe, at least openly. As Philip puts its, the “infrastructures of the [older] census passed into common sense cultural belief even as its scientific basis was eroded.” But, since no one really ever questions the data model of a database, because there is no practice or discourse with which to have such a discussion in the public sphere, the silence of the model in effect establishes its transcendence.
Philip concludes with a plea to revisit the data model and the methods of data entry employed by the census, and to embrace an alternate model to represent caste, one more in line with the indeterminacy and messiness of the facts as lived. Although she suggests following the model of a crowdsourced folksonomy, she leaves the technical question to one side and asks another, more open one: “What would it mean to democratize the process [in the first place]?” What would it mean to have a participatory software development process applied at scale? At the scale of more than a billion? And that’s a very interesting question.
Although the kernel of Philip’s account is not new — that databases encode, reproduce, and often amplify bias is a commonplace — what is new is the context, and the various dimensions of inquiry it opens up. One dimension is historical, since the database in question has a genealogy that goes back to the late 19th century. Indeed, the connection between databases and census taking is one that goes far beyond India and touches on the origins of data science itself. Recall that IBM’s* first “business machine,” Holreith’s punched card tabulating machine, was designed precisely to count census data — the 1890 census, whose unexpectedly large numbers resulted from the second great wave of immigration in US history — and that, further, the entire field of statistics, of which census taking is a central activity, was developed as part of the project to manage national populations. Many of the conclusions drawn from Philip’s case may profitably be “abducted” to the general level, at least to generate hypotheses and interesting avenues of research.
Another dimension of Philip’s case is what might be called the mechanics of bias.
There is an implicit bias in databases towards purity. That “dirty data” needs to be cleaned before it can inhabit the cells of a table, and tables need to be normalized, or carefully renormalized, for their contents to have integrity, are well known elements of proper database hygiene. This “foolish consistency,” as John Unsworth, quoting Emerson, describes it, can have its benefits in the context of research, since it forces one to carefully think about one’s categories of interpretation . But in the vastly larger Indian political context, this design bias appears to collude with an existing ideology of purity, fulfilling a Durkheimian dream in which social structure and cognitive logic mirror each otter. However, because database technology, considered as an agent in a socio-technical network, lacks the improvisational capacity of human actors, the effect of this collusion is to amplify an inflected notion of purity, one without its built in, although always unspoken, ambiguities. The ideology of purity in human hands is, as Lucy Suchman, following the ethnomethodologists, would say, a resource for situated action. There is always a slippage between the overt, formal claims of purity and the actual practice in which these claims participate in the negotiation of identity. This slippage — this inevitable emergence of liminality as a semi-unintended consequence of categories in social use — is part of the reason the ideal exists in the first place. But a database — the 600 pound gorilla in the office — doesn’t know that, at least the way databases are designed and used today.
More generally, we may observe that the amplification and overvaluation of purity in this context is an effect of transductive mediation, a property that attends the use of media forms in the public sphere. In transductive mediation, a small difference at the level of structure at input time, as it were, is selected for and amplified by a media form, such as a newspaper or a film or a database, as that difference substitutes for what it represents and is propagated throughout the system in which is participates. In this case the transduced structure is purity and the result is an inflexible version of that idea. The effect is similar to what Jon Anderson describes in his account of the history of the Internet in the Middle East, whereby the early adopters of the medium — diasporic engineers and scientists who used newsgroups and websites to exchange ideas of about religion and politics back home — favored sharia (law) over ulema (interpretation) . Because these representations of Islam — which were exchanged as part of an informal and only quasi-public discourse — were stored in databases, this particular form of Islam became more established in transnational cyberspace than in the traditional public sphere. When the Internet finally became widely accessible in the Middle East, around 2005, this pre-encoded form of Islam was already part of the digital public sphere, and reinforced an attachment to sharia developed by other groups for other reasons.
In other words, media ecologies provide selective environments in which the survivability — the therefore political efficacy — of an idea is a function of its representational fitness. As database technologies and practices become increasingly prevalent in the public sphere — or, more accurately, in the “dark” public sphere, since they are rarely discussed as such — this particular syndrome of selection, amplification, distortion, and propagation will only become more common. We should not only figure out ways, as Philip says, to democratize the process of database design, but to develop a practice in which the transductive effects of the medium are modified and conditioned to work in favor of the public good.
* Actually, the machine was the creation of the Tabulating Machine Company, one of three companies that would later become IBM.