by K Crawford · Cited by 4598 — It quickly. Page 6. 6 became the dominant vision of technological progress. ‘Fordism’ meant automation and assembly lines; for decades onward, this became the

73 KB – 32 Pages

PAGE – 1 ============
DRAFT VERSION boyd, danah and Kate Crawford. (2012). ÒCritical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.Ó Information, Communication, & Society 15:5, p. 662 -679. 1 Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon danah boyd Microsoft Research and New York University dmb@microsoft.com Kate Crawford University of New South Wales k.crawford@unsw.edu.au Technology is neither good nor bad; nor is it neutraltechnologyÕs interaction with the social ecology is such that technical developments frequently have environmental, social, and human consequences that go far beyond the immediate purposes of the technical devices and practices themselves. Melvin Kranzberg (1986, p. 545) We need to open a discourse !where there is no effective discourse now !about the varying temporalities, spatialities and materialities that we might represent in our databases, with a view to designing for maximum flexibility and allowing as possible for an emergent polyphony and polychrony. Raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care. Geoffrey Bowker (2005, p. 183 -184)

PAGE – 2 ============
2 The era of Big Data is underway. Computer scientists, physicists, economists, mathematicians, political scientists, bio -informaticists, sociologists, and other scholars are clamoring for acc ess to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, go vernment records, and other digital traces left by people. Significant questions emerge. Will large -scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will analytics be used to track protesters and suppress speech? Will large quantities of data transform how we study human communication and culture, or narrow the palette of research options and alter what ÔresearchÕ means? Big Data is, in many ways, a poor term. As Lev Manovich (2011) observes, it has been used in the sciences to refer to data sets large enough to require supercomputers, but what once required such mach ines can now be analyzed on desktop computers with standard software. There is little doubt that the quantities of data now available are often quite large, but that is not the defining characteristic of this new data ecosystem. In fact, some of the data e ncompassed by Big Data (e.g., all Twitter messages about a particular topic) are not nearly as large as earlier data sets that were not considered Big Data (e.g., census data). Big Data is less about data that is big than it is about a capacity to search, aggregate, and cross -reference large data sets.

PAGE – 3 ============
DRAFT VERSION boyd, danah and Kate Crawford. (2012). ÒCritical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.Ó Information, Communication, & Society 15:5, p. 662 -679. 3 We define Big Data 1 as a cultural, technological, and scholarly phenomenon that rests on the interplay of: 1) Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and co mpare large data sets. 2) Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. 3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that c an generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy. Like other socio -technical phenomena, Big Data triggers both utopian and dystopian rhetoric. On one hand, Big Data is seen as a powerful tool to addr ess various societal ills, offering the potential of new insights into areas as diverse as cancer research, terrorism, and climate change. On the other, Big Data is seen as a troubling manifestation of Big Brother, enabling invasions of privacy, decreased civil freedoms, and increased state and corporate control. As with all socio -technical phenomena, the currents of hope and fear often obscure the more nuanced and subtle shifts that are underway. Computerized databases are not new. The U.S. Bureau of the Census deployed the worldÕs first automat ed processing equipment in 1890 Ðthe punch -card machine (Anderson 1988). Relational databases emerged in the 1960s (Fry and Sibley 1974). Personal computing and the internet have made it possible for a wider rang e of people Ð 1 We have chosen to capit alized the term ÒBig DataÓ throughout this article to make it clear that it is the phenomenon we are discussing.

PAGE – 4 ============
4 including scholars, marketers, governmental agencies, educational institutions, and motivated individuals Ð to produce, share, interact with, and organize data. This has resulted in what Mike Savage and Roger Burrows (2007) describe as a cri sis in empirical sociology. Data sets that were once obscure and difficult to manage Ð and, thus, only of interest to social scientists Ð are now being aggregated and made easily accessible to anyone who is curious, regardless of their training . How we handle the emergence of an era of Big Data is critical. While the phenomenon is taking place in an environment of uncertainty and rapid change, current decisions will shape the future. With the increased automation of data collection and analysis Ð as wel l as algorithms that can extract and illustrate large -scale patterns in human behavior Ð it is necessary to ask which systems are driving these practices, and which are regulating them. Lawrence Lessig (1999) argues that social systems are regulated by fou r forces: market, law, social norms, and architecture Ð or, in the case of technology, code. When it comes to Big Data, these four forces are frequently at odds. The market sees Big Data as pure opportunity: marketers use it to target advertising, insuranc e providers use it to optimize their offerings, and Wall Street bankers use it to read the market. Legislation has already been proposed to curb the collection and retention of data, usually over concerns about privacy (e.g., the U.S. Do Not Track Online A ct of 2011). Features like personalization allow rapid access to more relevant information, but they present difficult ethical questions and fragment the public in troubling ways (Pariser 2011).

PAGE – 5 ============
DRAFT VERSION boyd, danah and Kate Crawford. (2012). ÒCritical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.Ó Information, Communication, & Society 15:5, p. 662 -679. 5 There are some significant and insightful studies currently being done that involve Big Data, but it is still necessary to ask critical questions about what all this data means, who gets access to what data , how data analysis is deployed, and to what ends. In this article, we offer six provocations to spark conve rsations about the issues of Big Data. We are social scientists and media studies scholars who are in regular conversation with computer scientists and informatics experts. The questions that we ask are hard ones without easy answers, although we also desc ribe different pitfalls that may seem obvious to social scientists but are often surprising to those from different disciplines. Due to our interest in and experience with social media, our focus here is mainly on Big Data in social media context. That sa id, we believe that the questions we are asking are also important to those in other fields. We also recognize that the questions we are asking are just the beginning and we hope that this article will spark others to question the assumptions embedded in B ig Data. Researchers in all areas Ð including computer science, business, and medicine Ð have a stake in the computational culture of Big Data precisely because of its extended reach of influence and potential within multiple disciplines. We believe that i t is time to start critically interrogating this phenomenon, its assumptions, and its biases. 1. Big Data Changes the Definition of Knowledge In the early decades of the 20th century, Henry Ford devised a manufacturing system of mass production, using s pecialized machinery and standardized products. It quickly

PAGE – 6 ============
6 became the dominant vision of technological progress. ÔFordism Õ meant automation and assembly lines; for decades onward, this became the orthodoxy of manufacturing: out with skilled craftspeople an d slow work, in with a new machine -made era (Baca 2004). But it was more than just a new set of tools. The 20th century was marked by Fordism at a cellular level: it produced a new understanding of labor, the human relationship to work, and society at larg e. Big Data not only refers to very large data sets and the tools and procedures used to manipulate and analyze them, but also to a computational turn in thought and research (Burkholder 1992). Just as Ford changed the way we made cars Ð and then transfo rmed work itself Ð Big Data has emerged a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community. ÔChange the instruments, and you will change the entire s ocial theory that goes with them,Õ Latour reminds us (2009, p. 9). Big Data creates a radical shift in how we think about research. Commenting on computational social science, Lazer et al argue that it offers Ôthe capacity to collect and analyze data wit h an unprecedented breadth and depth and scaleÕ (2009, p. 722). It is not just a matter of scale nor is it enough to consider it in terms of proximity, or what Moretti (2007) refers to as distant or close analysis of texts. Rather, it is a profound change at the levels of epistemology and ethics. Big Data reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and the categorization of reality. Just as du Gay and Pryke note that Ôaccounting

PAGE – 8 ============
8 creating a new ontological ÒepochÓ as a new historical constellation of intelligibilityÕ (Berry 2011, p. 12). We must ask difficult questions of Big DataÕs models of intelligibility before they crystallize into new orthodoxies. If we return to Ford, his innovation was using the assembly line to break down interconnected, holistic tasks into simple, atomized, mechanistic ones. He did this by designing specialized tools that strongly predetermined and limited the action of the worker. Similarl y, the specialized tools of Big Data also have their own inbuilt limitations and restrictions. For example, Twitter and Facebook are examples of Big Data sources that offer very poor archiving and s earch functions. Consequently, researchers are much more l ikely to focus on something in the present or immediate past Ð tracking reactions to an election, TV finale or natural disaster Ð because of the sheer difficulty or impossibility of accessing older data. If we are observing the automation of particular k inds of research functions, then we must consider the inbuilt flaws of the machine tools. It is not enough to simply ask, as Anderson has suggested Ôwhat can science learn from Google?Õ, but to ask how the harvesters of Big Data might change the meaning of learning, and what new possibilities and new limitations may come with these systems of knowing. 2. Claims to Objectivity and Accuracy are Misleading ÔNumbers, numbers, numbers,Õ writes Latour (2010). ÔSociology has been obsessed by the goal of becoming a quantitative science.Õ Sociology has never reached this goal, in

PAGE – 9 ============
DRAFT VERSION boyd, danah and Kate Crawford. (2012). ÒCritical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.Ó Information, Communication, & Society 15:5, p. 662 -679. 9 LatourÕs view, because of where it draws the line between what is and is not quantifiable knowledge in the social domain. Big Data offers the humanistic disciplines a new way to claim th e status of quantitative science and objective method. It makes many more social spaces quantifiable. In reality, working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth Ð particularly w hen considering messages from social media sites. But there remains a mistaken belief that qualitative researchers are in the business of interpreting stories and quantitative researchers are in the business of producing facts. In this way, Big Data risks reinscribing established divisions in the long running debates about scientific method and the legitimacy of social science and humanistic inquiry. The notion of objectivity has been a central question for the philosophy of science and early debates abou t the scientific method (Durkheim 1895). Claims to objectivity suggest an adherence to the sphere of objects, to things as they exist in and for themselves. Subjectivity, on the other hand, is viewed with suspicion, colored as it is with various forms of i ndividual and social conditioning. The scientific method attempts to remove itself from the subjective domain through the application of a dispassionate process whereby hypotheses are proposed and tested, eventually resulting in improvements in knowledge. Nonetheless, claims to objectivity are necessarily made by subjects and are based on subjective observations and choices.

PAGE – 10 ============
10 All researchers are interpreters of data. As Lisa Gitelman (2011) observes, data needs to be imagined as data in the first instance , and this process of the imagination of data entails an interpretative base: Ôevery discipline and disciplinary institution has its own norms and standards for the imagination of data.Õ As computational scientists have started engaging in acts of social s cience, there is a tendency to claim their work as the business of facts and not interpretation. A model may be mathematically sound, an experiment may seem valid, but as soon as a researcher seeks to understand what it means, the process of interpretatio n has begun. This is not to say that all interpretations are created equal, but rather that not all numbers are neutral. The design decisions that determine what will be measured also stem from interpretation. For example, in the case of social media data , there is a Ôdata cleaningÕ process: making decisions about what attributes and variables will be counted, and which will be ignored. This process is inherently subjective. As Bollier explains, As a large mass of raw information, Big Data is not self -explanatory. And yet the specific methodologies for interpreting the data are open to all sorts of philosophical debate. Can the data represent an Ôobjective truthÕ or is any interpretation necessarily biased by some subjective filter or the way that data is Ôcleaned?Õ (2010, p. 13) In addition to this question, there is the issue of data errors. Large data sets from Internet sources are often unreliable, prone to outages and losses, and these errors and gaps are magnified when multiple data sets are used together. Social scientists have a long history

PAGE – 11 ============
DRAFT VERSION boyd, danah and Kate Crawford. (2012). ÒCritical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.Ó Information, Communication, & Society 15:5, p. 662 -679. 11 of asking critical questions about the collection of data and trying to account for any biases in their data ( Cain & Finch 1981; Clifford & Marcus 1986) . This requires understanding the properties and limits of a dataset, regardless of its size. A dataset may have many millions of pieces of data, but this does not mean it is random or representative. To make statistical claims about a dataset, we need to know where data is coming from; it is similarly impor tant to know and account for the weaknesses in that data. Furthermore, researchers must be able to account for the biases in their interpretation of the data. To do so requires recognizing that oneÕs identity and perspective informs oneÕs analysis (Behar & Gordon 1996). Too often, Big Data enables the practice of apophenia: seeing patterns where none actually exist, simply because enormous quantities of data can offer connections that radiate in all directions. In one notable example, David Leinweber demon strated that data mining techniques could show a strong but spurious correlation between the changes in the S&P 500 stock index and butter production in Bangladesh (2007). Interpretation is at the center of data analysis. Regardless of the size of a data, it is subject to limitation and bias. Without those biases and limitations being understood and outlined, misinterpretation is the result. Data analysis is most effective when researchers take account of the complex methodological processes that underlie the analysis of that data.

73 KB – 32 Pages