by M Andrejevic · 2014 · Cited by 464 — From a research perspective, boyd and Crawford (2011) have noted the divide between “the Big. Data rich” (companies and universities that can generate or

96 KB – 17 Pages

PAGE – 1 ============
Interna t ional Journal of Communication 8 (20 14 ), 1673 1689 1932 8036/2014 000 5 Copyright © 2014 ( Mark Andrejevic ). Licensed under the Creative Commons Attribution Non – commercial No Derivatives (by – nc – nd). Available a t . The Big Data Divide MARK ANDREJEVIC 1 Pomona College, USA relationship between those who collect, store, and mine large quantities of data, and those whom data collec tion targets. It argues that this key distinction highlights differential access to ways of thinking about and using data that potentially exacerbate power imbalances in the digital era. Drawing on original survey and interview findings about public attitu des toward collection and use of personal information, it maintains that the inability to anticipate the potential uses of such data is a defining attribute of data – mining processes, and thus of the forms of sorting and targeting that result from them. K eywords: big data, data mining, privacy, digital divide, predictive analytics Between Me and My Data media guru Tim Berners – Lee recently issued a plea for Inter net users to be able to access their personal data. All people should have the resources for data – My phon , p ara. 3 ). Echoing a well – worn set of claims about the power of machines to know ourselves better than we do (e.g., Ga tes, 1995, on software agents or Negroponte, 1996, on digital butlers), Berners – Lee portrayed the database as a personal – t (Katz, 2012 , p ara. 1, embedded recording ). 1 This research was supported by the Australian Research Council’s Discovery Project’s funding scheme (DP109260 6). Mark Andrejevic: Date submitted: 2013 04 06

PAGE – 2 ============
1674 Mark Andrejevic International Journal of Communication 8 ( 2014 ) Of course , Google News and any number of aggregators and services are already hard at work providing t hese kinds of services, without users needing to get involved or reclaim access to their data trails. Berners – social – ices, since these form a personal informational nexus where all types of different data rub shoulders (a personal NSA, as it were): There are no programmes that I can run on my computer which allow me to use all the data in each of the social networking systems that I use plus all the data in my calendar plus in my running map site, plus the data in my little fitness gadget and so on to really pro vide an excellent support to me. (Katz, 2012 , p ara. 4 ) Berners – Lee is bemoaning a growing separation of peop le from their data that characterizes the lives of active users of interactive devices and services a form of data divide not simply between those who generate the data and those who collect, store, and sort it, but also between the capabilities availabl e to those two groups. Berners – Lee challenges one aspect of that divide: If we generate data that is th is separation between users and their data, and with it the separation between the different data silos we generate on various devices and platforms? Surely he has a point, but it raises a further one: Even if users had such access, what individuals can do with their data in isolation differs striking ly from what various data collectors can do with this same data . To take a familiar example, Berners – Lee mentions customized news delivery as one possible benefit of self – data – mining: if a computer knows what its users have read in the past, it might be able to predict which new s stories will interest them in the future (this news aggregators take into a but also those of everyone else about whom they collect data. This data trove enables them to engage in various forms of collaborative filtering that is, to consider what the othe interests are also interested in. Generalizing this principle from the perspective of data mining, it is potentially much more powerful to situate individual behavior patterns within the context of broader social patterns than to rely solely on the historical data for a particular individual. Put somewhat differently, allowing users access to their own data does not fully address the discrepancies associated with the data divide : that is, differential capacities for putting data t o use. Even if users had access to their own data, they would not have the pattern recognition or predictive capabilities of those who can mine aggregated databases. Moreover, even al conditional), they would lack the storage capacity and processing power to make sense of the data and put it to use. It follows that the structural divide associated with the advent of new forms of data – driven sense making will be increasingly apparent To characterize the differential ability to access and use huge amounts of data, this article proposes the notion of a big data divide by first defining the term, then considering why such a divide

PAGE – 3 ============
Internat ional Journal of Communication 8 (2014) The Big Data Divide 1675 merits attention, and then expl oring how this divide might relate to public concern about the collection and use of personal information. The sense of powerlessness that individuals express about emerging forms of data collection and data mining reflects both the relations of ownership and control that shape access to communication and information resources, and growing awareness of just how little people know about the ways in which their data might be turned back upon them. Although the following research will focus exclusively on pers onal data the type of data at the heart of current debates about regulation of data collection online the notion of a big data divide is meant to invoke the broader issue of access to sense – making resources in the digital era, and the distinct ways of thin king about and using data available to those with access to tremendous databases and the technology and processing power to put them to use. s and universities that can generate or purchase and store large datasets) and the the fact that a relatively small group with defined interests threate ns to control the big data research agenda. This article extends the notion of a big data divide to incorporate a distinction between ways of prediction over explanation and comprehension in ways that undermine the democratizing/empowering promise of digital media. Despite the rhetoric of personalization associated with data mining, it yields predictions that are probabilistic in character, privileging dec ision – making at the aggregate level (over – anticipatable but persistent patterns that can be used to make decisions that influence the life chances of individuals a nd groups. In online tracking and other types of digital – era data surveillance, the logic of data mining, which proposes to reveal unanticipated, unpredictable patterns in the data, renders notions such as informed claims, discussed in more detail in the following sections, reveal that big data holds promise for much more than targeted advertising: It is about finding new ways to use data to make predictions, and thus decisions, about everything from health care to policing, urban planning, financial planning, job screening, and educational admissions. At a deeper level, the big data paradigm challenges the empowering promise of the Internet by proposing the superiority of a post – explanatory pragmatics (available onl y to the few) to the forms of comprehension that digital media were supposed to make more accessible to the many . None of these concerns fits comfortably within the standard privacy – oriented framing of issues related to the collection and use of personal i nformation. A Big Data Divide In the sense of standing for more information than any individual human or group of humans can comprehend, the notion of big data has existed since the dawn of consciousness. The world and its universe are, to anything or a nyone with senses, incomprehensibly big data. The contemporary usage is distinct, however, in that it marks the emergence of the prospect of making sense of an incomprehensibly large trove of recorded data the promise of being able to put it to meaningful use even though no individual or group of individuals can comprehend it. More prosaically, big data denotes the moment when automated forms of pattern recognition known as data analytics can catch up with automated forms of data collection and storage. Suc h data analytics are distinct from simple searching and querying of large

PAGE – 4 ============
1676 Mark Andrejevic International Journal of Communication 8 ( 2014 ) data sources, a practice with a much longer legacy. Thus, for the purposes of this article, the big data moment and the advent of data – mining techniques go hand in hand. The magnitud e of what counts as big data, then, will likely continue to increase to keep pace with both data storage and data processing capacities. IBM, which is investing heavily in data mining and predictive analytics, notes that big data is not just about size but also about the speed of data generation and processing and the heterogeneity of data that can be dumped into combined databases. It describes these dimensions in terms of the three , para. 2 ). Big – data minin g is omnivorous, in part because it has embarked on the project of discerning structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights , para. 9 ). Data can be collected, sorted, and correlated on a hitherto unprecedented scale that promises to generate useful patterns far beyond the ility to detect or even explain. As data – mining consultant Colleen McCue (2007) puts it, searching well beyond the capacity of human analysts or even a te mining promises to generate patterns of actionable information that outstrip the reach of the unaided human brain. In his book Too Big to Know just giant computers but a network to connect them, to feed them, and to make their work Such observations trace the emerging contours of as putting the data to use requires access to and control over costly technological infrastructures , expensive data sets, and the software, processing power, and expertise for analyzing them . If, as Weinberger puts it, in the era of to the machines, the databases, and the algorithms. Assuming for the sake of argument that the big data prognosticators (e.g., Mayer – Schönberger & Cukier, 2012) are correct, the era of big data characterized by the ability to make use of databases too large for any individual or group of individuals to comprehend usher s in powerfu l new capabilities for decision making and prediction unavailable to those without access to the databases , storage, and processing power . In manifold spheres of social practice, then, those with access to databases, processing power, and data – mining expertise will find themselves advantageously positioned compared to those without such access. But the divide at issue is not simply – asymmetric sor ting processes and different ways of thinking about how data relate to knowledge and its application. The following sections consider each of these issues in turn. The Big Data Sort For those with database access, the ability to capture and mine tremendo us amounts of data considerably enhances and alters possibilities for engaging in what David Lyon (2002), building on the but also of assessing risks

PAGE – 5 ============
Internat ional Journal of Communication 8 (2014) The Big Data Divide 1677 processing power are positioned to engage in increasingly powerful, sophisticated, and opaque forms of – term [or newl y generated] social work, businesses as both em ployers and marketers) and those subjected to the sorting process. understand that these decisions are not really based on an assessment of who or what people are, but on what they will do in the future. The panoptic sort is not only a discriminatory technology, but it is one that (Gandy, 2005, p. 2). This observation remains as salient as ever in the era of data mining and predictive analytics, which, while deploying the rhetoric of personalization, which predictions seem so accurate that people can be arrested for crim misleading (Kakutani, 2013 , p ara. 14 ) . Predictive analytics is not, despite the hype, a crystal ball. As one commentator put it, h undreds of thousands to millions of people, and you are converging against the mean. I accuracy what one shopper is going to do if he or she looks exactly like one million o ther shoppers. (Nolan, 2012 , p. 15 ) But the confusion between fortune telling and forecasting is consequential, for decisions made at a probabilistic, aggregate level produce effects felt at an individual level: the profile and the person intersect. To s omeone who has been denied health care, employment, or credit, the difference between a probabilistic prediction and a certainty is, for all practical purposes, immaterial. Social sorting has a long history but comes into its own as a form of automated c alculus, as Gandy (1993) suggests, in the era of modern bureaucratic rationality. Thus , it is tempting to note the historical continuity between big data – driven forms of social sorting and earlier forms of data – based decision making, from Taylorist forms o – 20th – century forms of redlining in the banking, housing, and insurance industries. Raley (2013), for example, has noted that in an early account of computer – nce made by widespread, and simultaneously less visible many processes that already occur However, a qualitative shift in monitoring – new data – mining processes, which now can generate un – anticipatable and un – intuitable predictive patterns (e.g., Chakrabarti, 2009). That is, their systemic, structural opacity creates a divide between the kinds of In the following sections, I argue that emerging awareness of forms of asymmetrical power associated with both the tremendous accumulation of data and new techniques for putting it to work provides a possible explanation for public concern about the collection and use of personal data. Survey after survey, including my own (discussed below), has revealed a high level of concern about the

PAGE – 6 ============
1678 Mark Andrejevic International Journal of Communication 8 ( 2014 ) commercial collectio n and use of personal information online. For example, a 2012 Pew study in the United States revealed that a majority (65%) of people who use search engines did not approve of the use of behavioral data to customize search results, and that more than two – t hirds of all Internet users (68%) did not approve of targeted advertising based on behavioral tracking (Purcell, Brenner, & Rainie, 2012). Another nationwide U.S. survey found that 66% of respondents opposed ad targeting based on tracking (Turow, King, Hoofnagle, Bleakley, & Hennessy, 2009). In a U.S. study of public reaction given the choice. My own nationwide survey in Australia reveale d strong support for do – not – track legislation (95% in favor). Well over half of the respondents (56%) opposed customized advertising based despite their stated concerns, of services that collect and use their personal information is framed Horne, & Horne , 2007), and sometimes as evidence that people do not really care as much as the research indicates (e.g., Oppmann , 2010). Based on early results of qualitative research on privacy concerns, this article offers an alternative explanation: that people operate within structured power relations that they dislike but feel powerless to contest. On a somewhat more speculati ve level, I suggest that there is an emerging understanding on the part of users that the asymmetry and opacity of a big data divide augurs an era of powerful but undetectable and un – anticipatable forms of data mining, contributing to their concern about potential downsides of the digital surveillance economy. This asymmetry runs deep, insofar as it privileges a form of knowledge available only to those with access to costly resources and technologies over the type s of knowledge and information access tha 2 In a much discussed Wired magazine article, Chris Anderson (2008) claimed that the era of big that is, the coming irrelevance of model – based understandings of the world. As he put it, This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavi or, from linguistics to sociology. Forget taxonomy, ontology, and psychology . . . With enough data, the numbe rs speak for themselves. ( Anderson, 2008 , p ara. 8 ) This sweeping, manifesto – like claim invites qualification: Surely, statistical models remain necessary for developing algorithms, and other sorts of models are needed to shape the use of the information generated by increasingly loquacious data. Data scientists have emphasized the importance of domain – specific expertise in assessing the data that gets fed into mining algorithms and shaping the questions that might be put to the data. As McCue (2007) stated in her primer on data mining and predictive domain expert emerges 2 For a good overview of the celebratory, democratizing rhetoric surrounding the reception of the Internet, see Mosco (2004).

PAGE – 8 ============
1680 Mark Andrejevic International Journal of Communication 8 ( 2014 ) of data mining are often although not exclusively mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The pa tterns discovered must be meaningful in that they lead to some advantage, But numerous other types of advantages are terms of national security If knowledge is power, then foreknowledge [via predictive analytics] can be seen describe the breadth and depth of new forms of data capture, anticipates that insights gleaned from the database will help create a more healthy, secure, and efficient world for all: For society, the hope is that we can use this new in – depth understandin g of individual behaviour to increase the efficiency and responsiveness of industries and governments. For individuals, the attraction is the possibility of a world where everything is arranged for your convenience your health checkup is magically schedule d just as you begin to get sick, the bus comes just as you get to the bus stop, and there is never a line of waiting people at city hall. (Pentland, 2009, p. 79) Other benefits could involve new forms of transparency that make various kinds of public rec ords available so as to hold public officials and private entities more accountable. The era of big data mining concentrates a particular technique for genera ting actionable information (to be used for good or ill) in only a few hands, for the specific purpose of gaining some kind of advantage. 3 Tellingly, it posits a form of knowing that allegedly renders obsolete or outdated the very model of Internet empower ment that was supposed to help hold entrenched forms of power accountable by increasing access to forms of knowledge that allowed people to understand the world around them. 4 This ng the world through the careful, judicious, and informed study of available information is, for a growing range of applications, obsolete in the petabyte era, which promises to unearth powerfully useful patterns from bodies of information that are too lar ge for a single person or group of people to make sense of. At the very moment that the new technology enhances access to traditional forms of understanding and evidence, they are treated as ostensibly outdated. Even if Anderson is overstating the case an d understanding remains an important aspect of knowledge acquisition in the digital era, the point remains: The few will have access to useful forms of y but incomprehensible , in the sense described by Weinberger (2011) . This knowledge is unpredictable and inexplicable in the conventional sense (as in the Mercury example: a correlation without an underlying 3 Bi g data should not be understood as a static concept, for as more people gain access to data – mining support the latest technology and the largest d atabases. 4 For a discussion of the promise of Internet empowerment, see Andrejevic (2007, pp. 15 21).

PAGE – 9 ============
Internat ional Journal of Communication 8 (2014) The Big Data Divide 1681 explanation) and therefore opaque to those witho ut access to the database. Thus, individual users have no way to anticipate fully how information about them might prove salient for particular forms of decision making, including, for example, whether they might be considered a security risk, a good or ba d job prospect, a credit risk, or more or less likely to drop out of school . . Consider, for instance, the finding that had to be deliberately installe , p ara. 2 ). The finding is unexplained and unlikely to be anticipated by the applicants themselves, but it can significantly affect their lives neverthe less. As this example suggests, the forms of social sorting associated with big data mining will range far beyond the marketing realm, feeding into the decision – making processes of those with access to the information it provide s and thereby allowing them to affect the life chances of others in increasingly opaque but significant ways. Whereas it may still be possible to intuitively grasp the link between, for example, a particular brand of car and a political preference, the promise of data mining is to unearth correlations beyond the realm of such imagining. Reverse engineering an algorithmic determination can require as much expertise as generating it in the first place, and the results may have no direct explanatory power. When correlation displaces ca usality or explanation, the goal is to accumulate as comprehensive and varied a database as possible to generate truly surprising, non – intuitive results. Perhaps a particular combination of eating habits, weather patterns, and geographic location correlate s with a tendency to perform poorly in a particular job or susceptibility to a chronic illness that threatens employability. There may not be any underlying explanation beyond the pattern itself. The basis for the kind of sorting envisioned via big data m ining is likely to become increasingly obscure in direct proportion to the size and scope of the available data and the sophistication of the techniques used to mine it. At a recent meeting of the Organisation for Economic Co – operation and Development, one ukier , 2013 , para. 6 ). According to the participant, who is CEO of a data – mining company: There are machines that learn, t hat are able to make connections that are much, much finer than you can see and they can calibrate connections between tons and tons of different facets of information, so that there is no way you as a human can understand fully what is going on there. (J. Haesler, personal communication, February 26, 2013) To note these characteristics of data mining is not to discount the potential benefits of its anticipated benevolent uses. Yet the shadow of rationalization betokens asymmetrical control : a world in wh ich people are sorted at important life moments according to genetic, demographic, geo – locational, and previously unanticipated types of data in ways that remain opaque and inaccessible to those who are affected. In some instances, this is surely desirable : when, for example, a medical intervention is triggered just in time to avoid more severe complications. At the same time, it is easy to imagine ways in which this type of pre – emptive modelling what William Bogard (1996 , p. 1 can be abused. Imagine, for example a world in which private health insurers mine client data in an attempt to cancel coverag e just in time to avoid having to cover major medical expenses.

PAGE – 10 ============
1682 Mark Andrejevic International Journal of Communication 8 ( 2014 ) What People Talk About When They Talk About Privacy The apparent contradictions in public attitudes toward personal – data collection resolve somewhat when viewed against this account of the big data divide and its defining attributes. Those who judge people solely by their actions may conclude, for exampl e, that , acceptable balance between privacy and convenience, they give up some privacy and get a lot of , para. 11 ). This framing of the exchange assumes people are aware of the terms of the trade – off and it construes acquiescence to pre – structured terms of access as tantamount to a ready embrace of those terms. On closer examination, such assumptions fall short. The notion of informed consent is a vexed one in the online context, partly because few people read the terms of use they agree to upon joining or signing in. Research indicates that the vast majority of users only skim privacy policies Mulligan, & Hoofnagle , 2007), a fact that might be taken as evidence that people do not care about privacy, despite high levels of stated concern and the proliferation of technologies for data capture. A more plausible explanation, based on my research on collection and use of personal information in Australia, is a perceived lack of options combined with lack of knowledge about possible uses of personal information and the absence of any discernible negative impact of these uses (e.g., job applicants are likely unaware that their choic e of browser might decide whether they are hired). – á – vis the arrangements that structure the collection and use of personal information. Despite the persistent focus on privacy issues in both academic research and popular press coverage, privacy arguably takes a backseat to an underlying sense of powerlessness. As one focus group respondent said (eliciting ( female, scanning p (Byers, 2013 , para. 6 powerful uses that are not fully understood. The focu s group was one of three devoted to discussing the results of a nationwide telephone information. 5 The survey results paralleled research in other countries indicating a high level of concern 5 These survey findings are based on a national telephone survey conducted with N = 1,106 adults across Australia between November 17 and December 14, 2 011. Managed by the Social Research Centre in Melbourne, the project sourced respondents through random – digit phone number generation for landlines and mobile phones. The final sample consisted of 642 surveys taken via landline numbers and 464 taken via mo bile numbers. Reported data were proportionally weighted to adjust for design (chance of selection), contact opportunities (mobile only, landline, or both), and demographics (gender, age, education, and state). A complete summary of the findings and method ology is available online at – information – project . The sur vey was followed up by an ongoing series of interviews and focus group discussions. As of this writi ng, 27 structured interviews were conducted at

PAGE – 11 ============
Internat ional Journal of Communication 8 (2014) The Big Data Divide 1683 about the collection and use of personal information: 59% of respondents said websites collect too much information about people. 6 They also revealed a very high level of support for stricter controls on information coll ection, including a do – not – track option (92% support), a requirement to delete personal data upon request (96% support), and real – time notification of tracking (95% support). 7 Well over half of the respondents (56%) said they opposed customized advertising based on tracking. The survey results also indicated that people are palpably aware that they know little about how their information is used: 73% of respondents said they needed to know more about the ways websites collect and use their information. 8 T that is, not between those who comprehend the correlations and those who do not, but betwee n those who are able to extract and use un – anticipatable and inexplicable (as described above) findings and those who find their lives affected by the resulting decisions. This formulation can aid consideration of the ways that the post – survey findings fro m the follow – up focus groups challenge the dominant framing of issues in contemporary discussions of If you have something that you e to know, it Bradley, 2012 , para. 3; , para. 1 three sites across Australia (Melbourne, Sydney, and Brisbane). Recruited randomly in public spaces for 30 – to 45 – minute discussions, respondents were screened to include only experienced Internet users. The p reliminary interview sample skews young and female, consisting of 19 female respondents and 8 male respondents, all between the ages of 19 and 37. As the project develops, respondents will be selected to counter this skew. Focus group participants were sim ilarly recruited in public spaces at the three research sites and received a $20 iTunes gift card to participate in a 50 – minute group discussion. A similar skew applies to the focus group participants: 16 women and 6 men, ages 20 31. The focus group struct ure was tested on students in an undergraduate seminar, and some of their comments were included. 6 Thinking now about the personal information gathered by ONLINE companies about their consumers, would you say they gather too much , about the right amount or not enough 7 Survey questions: Do you think: 1. There should be a law that requires websites and advertising companies to delete all stored information about an individual, if requested to do so? 2. There sh – not – option that would prevent them from gathering information about people? 3. There should be a law requiring companies to notify people at the time when they collect data about them online? 8 How would you describe your understanding of the ways in which companies collect and use the information they gather about people online? Do you feel that you already know as much as you need to know about what companies do in this regard or need to know more about what companies

96 KB – 17 Pages