752
In this excerpt from Six Provocations for Big Data, a conference presentation at Oxford University in September 2011 as part of “A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society,” danah boyd and Kate Crawford provoke debate by making six potentially controversial claims. These claims are that (1) automating research changes the definition of knowledge, (2) claims to objectivity and accuracy are misleading, (3) bigger data are not always better data, (4) not all data are equivalent, (5) just because it’s accessible doesn’t make it ethical, and (6) limited access to Big Data creates new digital divides. Given the focus of this chapter, this excerpt includes the paper’s introduction, part of the section devoted to the first claim, and all of the section devoted to the fifth claim. danah boyd wears many hats. A principal researcher at Microsoft Research and founder of Data & Society Research Institute, she has academic affiliations with the Department of Media, Culture, and Communication at New York University (NYU), the University of New South Wales (UNSW) in Australia, and the Berkman Center for Internet & Society at Harvard University. Her work has focused on the intersection of technology and society, particularly how young people use social media in their daily lives. More recently, her research has shifted toward the topic of this selection, Big Data and privacy. Like boyd, Kate Crawford, her coauthor, is also a principal researcher at Microsoft Research as well as a visiting professor at the MIT Center for Civic Media, a senior fellow at NYU’s Information Law Institute, and an associate professor in the Journalism and Media Research Center at UNSW. Crawford is also a composer who has released three albums of electronic music. Her research focuses on power and ethics with respect to social media. As you read, be alert to how boyd and Crawford define the term Big Data and why they think all of us should be concerned about it and its uses.
Six Provocations for Big Data
danah boyd AND KATE CRAWFORD
754
Technology is neither good nor bad; nor is it neutral . . . technology’s interaction with the social ecology is such that technical developments frequently have environmental, social, and human consequences that go far beyond the immediate purposes of the technical devices and practices themselves.
— Melvin Kranzberg (1986, p. 545)
755
We need to open a discourse — where there is no effective discourse now — about the varying temporalities, spatialities, and materialities that we might represent in our databases, with a view to designing for maximum flexibility and allowing as possible for an emergent polyphony and polychrony. Raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care.
— Geoffrey Bowker (2005, pp. 183–184)
The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and many others are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing information from Twitter, Google, Verizon, 23andMe, Facebook, Wikipedia, and every space where large groups of people leave digital traces and deposit data. Significant questions emerge. Will large-scale analysis of DNA help cure diseases? Or will it usher in a new wave of medical inequality? Will data analytics help make people’s access to information more efficient and effective? Or will it be used to track protesters in the streets of major cities? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what “research” means? Some or all of the above?
756
Big Data is, in many ways, a poor term. As Lev Manovich (2011) observes, it has been used in the sciences to refer to data sets large enough to require supercomputers, although now vast sets of data can be analyzed on desktop computers with standard software. There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem. Big Data is notable not because of its size, but because of its relationality to other data. Due to efforts to mine and aggregate data, Big Data is fundamentally networked. Its value comes from the patterns that can be derived by making connections between pieces of data, about an individual, about individuals in relation to others, about groups of people, or simply about the structure of information itself.
This piece begins by making a definitional argument about Big Data. That definition informs all the other arguments that boyd and Crawford make. For more on arguments of definition, see Chapter 9.
Furthermore, Big Data is important because it refers to an analytic phenomenon playing out in academia and industry. Rather than suggesting a new term, we are using Big Data here because of its popular salience and because it is the phenomenon around Big Data that we want to address. Big Data tempts some researchers to believe that they can see everything at a 30,000-foot view. It is the kind of data that encourages the practice of apophenia: seeing patterns where none actually exist, simply because massive quantities of data can offer connections that radiate in all directions. Due to this, it is crucial to begin asking questions about the analytic assumptions, methodological frameworks, and underlying biases embedded in the Big Data phenomenon.
While databases have been aggregating data for over a century, Big Data is no longer just the domain of actuaries and scientists. New technologies have made it possible for a wide range of people — including humanities and social science academics, marketers, governmental organizations, educational institutions, and motivated individuals — to produce, share, interact with, and organize data. Massive data sets that were once obscure and distinct are being aggregated and made easily accessible. Data is increasingly digital air: the oxygen we breathe and the carbon dioxide that we exhale. It can be a source of both sustenance and pollution.
5 How we handle the emergence of an era of Big Data is critical: while it is taking place in an environment of uncertainty and rapid change, current decisions will have considerable impact in the future. With the increased automation of data collection and analysis — as well as algorithms that can extract and inform us of massive patterns in human behavior — it is necessary to ask which systems are driving these practices, and which are regulating them. In Code, Lawrence Lessig (1999) argued that systems are regulated by four forces: the market, the law, social norms, and architecture — or, in the case of technology, code. When it comes to Big Data, these four forces are at work and, frequently, at odds. The market sees Big Data as pure opportunity: marketers use it to target advertising, insurance providers want to optimize their offerings, and Wall Street bankers use it to read better readings on market temperament. Legislation has already been proposed to curb the collection and retention of data, usually over concerns about privacy (for example, the Do Not Track Online Act of 2011 in the United States). Features like personalization allow rapid access to more relevant information, but they present difficult ethical questions and fragment the public in problematic ways (Pariser, 2011).
757
There are some significant and insightful studies currently being done that draw on Big Data methodologies, particularly studies of practices in social network sites like Facebook and Twitter. Yet, it is imperative that we begin asking critical questions about what all this data means, who gets access to it, how it is deployed, and to what ends. With Big Data come big responsibilities. In this essay, we are offering six provocations that we hope can spark conversations about the issues of Big Data. Social and cultural researchers have a stake in the computational culture of Big Data precisely because many of its central questions are fundamental to our disciplines. Thus, we believe that it is time to start critically interrogating this phenomenon, its assumptions, and its biases.
1. AUTOMATING RESEARCH CHANGES THE DEFINITION OF KNOWLEDGE
In the early decades of the twentieth century, Henry Ford devised a manufacturing system of mass production, using specialized machinery and standardized products. Simultaneously, it became the dominant vision of technological progress. Fordism meant automation and assembly lines, and for decades onward, this became the orthodoxy of manufacturing: out with skilled craftspeople and slow work, in with a new machine-made era (Baca, 2004). But it was more than just a new set of tools. The twentieth century was marked by Fordism at a cellular level: it produced a new understanding of labor, the human relationship to work, and society at large.
Big Data not only refers to very large data sets and the tools and procedures used to manipulate and analyze them, but also to a computational turn in thought and research (Burkholder, 1992). Just as Ford changed the way we made cars — and then transformed work itself — Big Data has emerged as a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community. “Change the instruments, and you will change the entire social theory that goes with them,” Latour reminds us (2009, p. 9).
758
[. . .]
5. JUST BECAUSE IT IS ACCESSIBLE DOESN’T MAKE IT ETHICAL
In 2006, a Harvard-based research project started gathering the profiles of 1,700 college-based Facebook users to study how their interests and friendships changed over time (Lewis et al., 2008). This supposedly anonymous data was released to the world, allowing other researchers to explore and analyze it. What other researchers quickly discovered was that it was possible to de-anonymize parts of the data set: compromising the privacy of students, none of whom were aware their data was being collected (Zimmer, 2008).
10 The case made headlines, and raised a difficult issue for scholars: what is the status of so-called “public” data on social media sites? Can it simply be used, without requesting permission? What constitutes best ethical practice for researchers? Privacy campaigners already see this as a key battleground where better privacy protections are needed. The difficulty is that privacy breaches are hard to make specific — is there damage done at the time? What about twenty years hence? “Any data on human subjects inevitably raise privacy issues, and the real risks of abuse of such data are difficult to quantify” (Nature, cited in Berry, 2011).
Even when researchers try to be cautious about their procedures, they are not always aware of the harm they might be causing in their research. For example, a group of researchers noticed that there was a correlation between self-injury (“cutting”) and suicide. They prepared an educational intervention seeking to discourage people from engaging in acts of self-injury, only to learn that their intervention prompted an increase in suicide attempts. For some, self-injury was a safety valve that kept the desire to attempt suicide at bay. They immediately ceased their intervention (Emmens & Phippen, 2010).
Institutional Review Boards (IRBs) — and other research ethics committees — emerged in the 1970s to oversee research on human subjects. While unquestionably problematic in implementation (Schrag, 2010), the goal of IRBs is to provide a framework for evaluating the ethics of a particular line of research inquiry and to make certain that checks and balances are put into place to protect subjects. Practices like “informed consent” and protecting the privacy of informants are intended to empower participants in light of earlier abuses in the medical and social sciences (Blass, 2004; Reverby, 2009). Although IRBs cannot always predict the harm of a particular study — and, all too often, prevent researchers from doing research on grounds other than ethics — their value is in prompting scholars to think critically about the ethics of their research.
759
With Big Data emerging as a research field, little is understood about the ethical implications of the research being done. Should someone be included as a part of a large aggregate of data? What if someone’s “public” blog post is taken out of context and analyzed in a way that the author never imagined? What does it mean for someone to be spotlighted or to be analyzed without knowing it? Who is responsible for making certain that individuals and communities are not hurt by the research process? What does consent look like?
It may be unreasonable to ask researchers to obtain consent from every person who posts a tweet, but it is unethical for researchers to justify their actions as ethical simply because the data is accessible. Just because content is publicly accessible doesn’t mean that it was meant to be consumed by just anyone (boyd & Marwick, 2011). There are serious issues involved in the ethics of online data collection and analysis (Ess, 2002). The process of evaluating the research ethics cannot be ignored simply because the data is seemingly accessible. Researchers must keep asking themselves — and their colleagues — about the ethics of their data collection, analysis, and publication.
15 In order to act in an ethical manner, it is important that scholars reflect on the importance of accountability. In the case of Big Data, this means both accountability to the field of research and accountability to the research subjects. Academic researchers are held to specific professional standards when working with human participants in order to protect their rights and well-being. However, many ethics boards do not understand the processes of mining and anonymizing Big Data, let alone the errors that can cause data to become personally identifiable. Accountability to the field and to human subjects requires rigorous thinking about the ramifications of Big Data, rather than assuming that ethics boards will necessarily do the work of ensuring people are protected. Accountability here is used as a broader concept than privacy, as Troshynski et al. (2008) have outlined, where the concept of accountability can apply even when conventional expectations of privacy aren’t in question. Instead, accountability is a multi-directional relationship: there may be accountability to superiors, to colleagues, to participants, and to the public (Dourish & Bell, 2011).
There are significant questions of truth, control, and power in Big Data studies: researchers have the tools and the access, while social media users as a whole do not. Their data was created in highly context-sensitive spaces, and it is entirely possible that some social media users would not give permission for their data to be used elsewhere. Many are not aware of the multiplicity of agents and algorithms currently gathering and storing their data for future use. Researchers are rarely in a user’s imagined audience; neither are users necessarily aware of all the multiple uses, profits, and other gains that come from information they have posted. Data may be public (or semi-public) but this does not simplistically equate with full permission being given for all uses. There is a considerable difference between being in public and being public, which is rarely acknowledged by Big Data researchers.
760
ACKNOWLEDGEMENTS
We wish to thank Heather Casteel for her help in preparing this article. We are also deeply grateful to Eytan Adar, Tarleton Gillespie, and Christian Sandvig for inspiring conversations, suggestions, and feedback.
REFERENCES
Baca, G. (2004). Legends of Fordism: Between myth, history, and foregone conclusions. Social Analysis 48(3), 169–178.
Berry, D. (2011). The computational turn: Thinking about the digital humanities. Culture Machine 12. Retrieved from http://www.culturemachine.net/index.php/cm/article/view/440/470
Blass, T. (2004). The man who shocked the world: The life and legacy of Stanley Milgram. New York, NY: Basic Books.
Bowker, G. C. (2005). Memory practices in the sciences. Cambridge, MA: MIT Press.
boyd, d., & Marwick, A. (2011). Social privacy in networked publics: Teens’ attitudes, practices, and strategies. Paper presented at Oxford Internet Institute Decade in Time Conference, Oxford, England.
Burkholder, L. (Ed.). (1992). Philosophy and the computer. Boulder, CO: Westview Press.
Dourish, P., & Bell, G. (2011). Divining a digital future: Mess and mythology in ubiquitous computing. Cambridge, MA: MIT Press.
Emmens, T., & Phippen, A. (2010). Evaluating online safety programs. Harvard Berkman Center for Internet and Society. Retrieved from http://cyber.law.harvard.edu/sites/cyber.law.harvard.edu/files/Emmens_Phippen_Evaluating-Online-Safety-Programs_2010.pdf
Ess, C. (2002). Ethical decision-making and Internet research: Recommendations from the AOIR ethics working committee. Association of Internet Researchers. Retrieved from http://aoir.org/reports/ethics.pdf
Kranzberg, M. (1986). Technology and history: Kranzberg’s laws. Technology and Culture 27(3), 544–560.
Latour, B. (2009). Tarde’s idea of quantification. In M. Candea (Ed.), The social after Gabriel Tarde: Debates and assessments (pp. 145–162). London, England: Routledge.
761
Lessig, L. (1999). Code: And other laws of cyberspace. New York, NY: Basic Books.
Lewis, K., Kaufman, J., Gonzalez, M., Wimmer, A., & Christakis, N. (2008). Tastes, ties, and time: A new social network dataset using Facebook.com. Social Networks 30, 330–342.
Manovich, L. (2011). Trending: The promises and the challenges of big social data. In M. K. Gold (Ed.), Debates in the digital humanities. Minneapolis: University of Minnesota.
Pariser, E. (2011). The filter bubble: What the Internet is hiding from you. New York, NY: Penguin.
Reverby, S. M. (2009). Examining Tuskegee: The infamous syphilis study and its legacy. Chapel Hill: University of North Carolina.
Schrag, Z. M. (2010). Ethical imperialism: Institutional review boards and the social sciences, 1965–2009. Baltimore, MD: Johns Hopkins University.
Troshynski, E., Lee, C., & Dourish, P. (2008). Accountabilities of presence: Reframing location-based systems. CHI ’08 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 487–496. doi:10.1145/1357054.1357133
Zimmer, M. (2008). More on the “anonymity” of the Facebook dataset — It’s Harvard College [Blog post]. Retrieved from http://www.michaelzimmer.org/2008/09/30/on-the-anonymity-of-the-facebook-dataset/
RESPOND •
How, for boyd and Crawford, do automating research and using Big Data change the definition of knowledge? And what specific ethical issues does Big Data raise for these authors? Why?
boyd and Crawford quote Bruno Latour, who notes, “Change the instruments, and you will change the entire social theory that goes with them” (paragraph 8). What does Latour mean? How does it relate to boyd and Crawford’s stance at this point in the selection and throughout their argument? What sort of causal argument is this? (Chapter 11 discusses the forms of causal argument that we usually encounter.)
As in much academic writing, definitional arguments play an important role in boyd and Crawford’s discussion. Examine how the authors define the following terms and how they use them to further their own argument: Big Data (paragraphs 1–3), apophenia (paragraph 3), Fordism (paragraphs 7–8), self-injury (paragraph 11), and accountability (paragraph 15). How do boyd and Crawford both offer a definition or characterization of the concept and then employ that notion to support or explain a point they wish to make? By the way, note that in none of these situations do the authors quote Webster’s or some other dictionary; instead, they construct their definitions in other ways. (Chapter 9 on arguments of definition may prove useful in helping you answer this question.)
762
Another interesting (and common) feature of boyd and Crawford’s academic writing style is the use of figurative language. What sort of figurative language is being used in these examples?
Will it transform how we study human communication and culture, or narrow the palette of research options and alter what “research” means? (paragraph 1)
Data is increasingly digital air: the oxygen we breathe and the carbon dioxide that we exhale. (paragraph 4)
The twentieth century was marked by Fordism at a cellular level: it produced a new understanding of labor, the human relationship to work, and society at large. (paragraph 7)
How does the use of this figurative language help advance the arguments being made? (Chapter 13 on style in arguments discusses figurative language.)
Begin where boyd and Crawford end their essay. What is the difference between being in public and being public with respect to the sorts of Big Data that these authors discuss? Investigate two forms of social media that you are familiar with (or, better yet, that you use) to see what sorts of access those who frequent each site give the developers with respect to the use of data about their behavior on that site or others that are tracked. (In other words, when someone clicks “Agree” on one of these sites, what information is now “public” to the developers of the site?) Write an argument of fact in which you present your findings, linking them to the contrast between “being in public” online and “being public” online. (Chapter 8 will be useful here.) You may also wish to use this information to evaluate the situation (Chapter 10) or make a proposal about it (Chapter 12).
Click to navigate to this activity.