To react to certain events, it is necessary and/or practical to consider only relevant data over a certain time frame (“page views in the last hour” or “
81 KB – 33 Pages
PAGE – 2 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 © 201 4 Cloud Security Alliance Œ All Rights Reserved All rights reserved. You may download, store, display on your computer, view, print, and link to the Cloud Security Alliance fiBig Data Taxonomy fl paper at https://cloudsec urityalliance.org/research/big -data/ , subject to the following: (a) the Document may be used solely for your personal, informational, non -commercial use; (b) the Document may not be modified or altered in any way; (c) the Document may not be redistributed; and (d) the trademark, copyright or other notices may not be removed. You may quote portions of the Document as permitted by the Fair Use provisions of the United States Copyright Act, provided that you attribute the portions to the Cloud Security Alliance fiBig Data Taxonomy fl (201 4). © 201 4 Cloud Security Alliance – All Rights R eserved. 2
PAGE – 3 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 Acknowledgements Contributor s Praveen Murthy Anurag Bharadwaj P. A. Subrahmanyam Arnab Roy Sree Rajan Commentators Nrupak Shah, Kapil Assudani, Grant Leonard, Gaurav Godhwani, Joshua Goldfarb, Zulfikar, Aaron Alva Design/Editing Tabitha Alterman, Copyeditor Frank Guanco, Project Manager, CSA Luciano J.R. Santos, Global Research Director, CSA Kendall Cline Scoboria , Graphic Designer, Shea Media Evan Scoboria , Co -Fo under, Shea Media; Webmaster, CSA John Yeoh, Senior Research Director, CSA Abstract In this document, we propose a six -dimensional taxonomy for big data. The main objective of this taxonomy is to help decision makers navigate the myriad choices in compute and sto rage infrastructures as well as data analytics techniques, and security and privacy frameworks. The taxonomy has been pivoted around the nature of the data to be analyzed. © 201 4 Cloud Security Alliance – All Rights R eserved. 3
PAGE – 4 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 Table of Contents Acknowledgements .. .. .. .. .. 3 Abstract .. .. .. .. .. .3 Introduction .. .. .. .. 5 Data .. .. .. .. .. 6 Compute Infrastructure .. .. .. .. 10 Storage Infrastructure .. .. .. .. .. 17 Analytics .. .. .. .. . 22 Visualization .. .. .. .. . 27 Security and Privacy .. .. .. .. .. 29 Conclusion .. .. .. .. . 31 References .. .. .. .. 32 © 201 4 Cloud Security Alliance – All Rights R eserved. 4
PAGE – 5 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 Introduction The term big data refers to the massive amount of digital information companies and governments collect about us and our surroundings . This data is not only generated by traditional information exchange and software use via desktop computers, mobile phones and so on, but also from the myriads of sensors of various types embedded in various environments, whether in city streets (cameras, microphones) or jet engines (temperature sensors), and the soon -to-proliferate Internet of Things, where virtually every electrical device will connect to the Internet and produce data . Every day, we create 2.5 quintillion bytes of data –so much that 90% of the data in the world today has been created in the last two years alone (as of 2011 [1]) . The issues of storing, computing, security and privacy, and analytics are all magnified by the velocity, volume, and variety of big data, such as large -scale cloud infrastructures, diversity of data sources and formats, streaming nature of data acquisition and high volume inter -cloud migration. The six -dimensional taxonomy is shown in Figure 1. These six dimensions arise from the key aspects that are needed to est ablish a big data infrastructure. We will describe each of the dimensions in the rest of the document. Figure 1: Big data 6 -D taxonomy © 201 4 Cloud Security Alliance – All Rights R eserved. 5
PAGE – 6 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 Data The first question: What are the various domains in which big data arise? The reason for categorizing the domains in which data arise is in order to understand the infrastructural choi ces and requirements that need to be made for particular types of data. All fidatafl is not equivalent. The particular domain in which data arises will determine the types of architecture that will be required to store it, process it, and perform analytics o n it. There are several ways in which we can think about this question of data domains. Latency Requirements The first way to characterize data would be according to time span in which it needs to be analyzed: Real -time (financial streams, complex event processing (CEP), intrusion detection, fraud detection) Near real -time (ad placement) Batch (retail, forensics, bioinformatics, geodata, historical data of various types) Examples of fi Real -Timefl Applications Some of the many applications that involve data arriving fiin real -timefl include the following : On-line ad optimization (including real -time bidding) High frequency online trading platforms Security event monitoring Financial transaction monitoring and fraud detection Web analytics and other kinds of d ashboards Churn prediction for online games or e -commerce Optimizing devices , industrial plants or logistics systems based on behavior and usage Control systems related tasks ; e.g., the SmartGrid, nuclear plants Sentiment analysis of tweets pertaining to a topic In most of these applications, data is constantly changing. To react to certain events, it is necessary and/or practical to consider only relevant data over a certain time frame (fipage views in the last hourfl or fitransactions in the last hour/day/w eek/month–fl), instead of taking the entirety of past data into account. Key Attributes of Real -Time Applications Impacting Big Data Technology Solution s In order to select the appropriate approach and big data technology solution that is best suited to a p roblem at hand, it is important to understand some of the key attributes that impact this decision. In addition to latency requirements (the time available to compute the results), t hese could include the following : Event Characteristics o Including input/o utput data rate required by the application. Event Response Complexity o Processing complexity: What is the computational complexity of the processing task for each event? o Data Domain Complexity: What is the size of the data that has to be accessed to support such processing? © 201 4 Cloud Security Alliance – All Rights R eserved. 6
PAGE – 8 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 Big Data Technology Solutions for Real -Time Applications When considering an appropriate big data technology platform, one of the main considerations is the latency requirement. If low latency is not required, more traditional approaches that firs t collect data on disk or in memory and then perform computations on this data later will suffice. In contrast, low latency requirements generally imply that the data must be processed as it comes in. Structure Another way to map various domains for big d ata is in the degree of structure or organization they come with. Structured (retail, financial, bioinformatics, geodata) Semi -structured (web logs, email, documents) Unstructured (images, video, sensor data, web pages) It is useful to think of data as bei ng structured, unstructured or semi -structured. We provide some examples below, with the caveat that a formal definition that precisely delineates these categories may be elusive. Structured Data Structured data is exemplified by data contained in relational databases and spreadsheets. Structured data conforms to a database model, which is largely characterized by the various fields that data belongs to (name, address, age and so forth), and the data type for each field ( numeric, currency, alphabetic, name, date, address ). The model also has a notion of restrictions or constraints on each field ( for example, integers in a certain range), and constraints between elements in the various fields that are used to enforce a notion of consistency ( no duplicates, cannot be scheduled in two different places at the same time , etc .) Unstructured Data Unstructured Data (or unstructured information) refers to information that either does not have a pre -defined data model or is not organized in a predefin ed manner. Unstructured information is typically text -heavy, but may also contain data such as dates, numbers, and facts. Other examples include the firawfl (untagged) data representing photos and graphic images, videos, streaming sensor data, web pages, PDF files, PowerPoint presentations, emails, blog entries, wikis , and word processing documents. Semi -Structured Data Semi -structured data lies in between structured and unstructured data . It is a type of structured data, but lacks a strict structure impose d by an underlying data model. With semi -structured data, tags or other types of markers are used to identify certain elements within the data, but the data doesn™t have a rigid structure from which complete semantic meaning can be easily extracted without much further processing . For example, word processing software now can include metadata showing the author’s name and the date created, while the bulk of the document contains unstructured text . (Sophisticated learning algorithms would have to mine the te xt to understand what the text was about, because no model exists that classifies the text into neat categories) . As an additional nuance, the text in the document may be further tagged as including table of contents, chapters, and sections . Emails have th e sender, recipient, date, time and other fixed fields added to the unstructured data of the email message content and any attachments. Photos or other graphics can be tagged with keywords such as the creator, date, location and other © 201 4 Cloud Security Alliance – All Rights R eserved. 8
PAGE – 9 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 content -specific keyw ords (such as names of people in the photos), making it possible to organize and locate graphics. XML and other markup languages are often used to manage semi -structured data. Yet another way to characterize the domains is to look at the types of industrie s that generate and need to extract information from the data Financial services Retail Network security Large -scale science Social networking Internet of Things/sensor networks Visual media Figure 3 illustrates the various domains and specific subdomains in which big data processing issues arise. Visual media Intrustion detection APTs Big Data Network security Social networking High frequency trading Bioinformatics High energy physics Sentiment analysis Social graphs Scene analysis Retail Finance Large scale science Weather Anomaly detection Behavioral analysis Image/audio understanding Data domains Sensor data Figure 3: Data domains Figure 4 illustrates how the big data verticals map to the time and organization axes. A case can be made that in fact all of the industries included here have use cases that encounter data at all levels of organization, and have processing needs that span all res ponse times. In that case, the industry domain is another orthogonal axis for characterizing the © 201 4 Cloud Security Alliance – All Rights R eserved. 9
PAGE – 10 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 domain space of big data. We would then visualize these domains by particular common use cases, and map them to industry, time, and structure. Figure 4: Mapping the big data verticals Compute Infrastructure While the Hadoop ecosystem is a popular choice for processing large datasets in parallel using commodity computing resources, there are several other compute infrastructures to use in various domains. Figure 5 shows the taxonomy for the various styles of processing architectures. Computing paradigms on big data currently differ at the first leve l of abstraction on whether the processing will be done in batch mode, or in real -time/near real -time on streaming data (data that is constantly coming in and needs to be processed right away). In this section, we highlight two specific infrastructures: Ha doop for batch processing, and Spark for real -time processing. MapReduce is a programming model and an associated implementation for processing and generating large dataset s. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper referenced in [2]. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run -time system takes care of the details of partitioning the input data, scheduling the program’s © 201 4 Cloud Security Alliance – All Rights R eserved. 10
PAGE – 11 ============
BIG DATA WORKING GROUP Big Data Taxonomy, September 2014 execution across a set of machines, handling machine failures, and managing the required inter -machine communication. This allows programmers without any experience with parallel and distributed systems to utilize the resources of a large distributed system easily. Bulk synchronou s parallel processing [3] is a model proposed originally by Leslie Valiant. In this model, processors execute independently on local data for a number of steps. They can also communicate with other processors while computing. But they all stop to synchronize at known points in the execution; these points are called barrier synchronization points. This method ensures that deadlock or livelock problems can be detected easily. Big Data Batch Streaming Hadoop S4Infosphere Storm Spark MapReduce Bulk synchronous parallel Hama Giraph Pregel Compute infrastructure Figure 5: Compute i nfrastructure Low Latency: Stream Processing If an application demands fiimmediatefl response to each event as it occurs, some form of stream processing is needed , which essentially process es the data as it comes in. The general approach is to have a little bit of code that processes each of the events separately. In order to speed up the processing, the stream may be subdivided, and the computation distributed across clusters. Apache Storm is a popular framework for event processing that was developed at Twitter and promulgated by Twitter and other companies that required this paradigm of real -time processing. Other examples are Amazon™s Kinesis, or the streaming capabilities of MapR. These frameworks take care of the scaling onto multiple cluster nodes and come with varying degrees of suppor t for resilience and fault tolerance, for example, through checkpointing, to make sure the system can recover from failure. © 201 4 Cloud Security Alliance – All Rights R eserved. 11
81 KB – 33 Pages