Big Data Application Architecture Pattern Recipes provides an insight into heterogeneous infrastructures, databases, and visualization and analytics tools used for realizing the architectures of big data solutions. Its problem-solution approach helps in selecting the right architecture to solve the Contents at a glanceAbout the authorsAbout the technical reviewerAcknowledgments m RBRBBERABBEBBBREBBBRBBRRIRAE Iann XXiiintroductionxXVChapter 1: Big Data Introduction mm RREREmammEmmImmam 1Chapter 2: Big Data Application Architecture mRRaEaaaaaaaRRERaIIan 9Chapter 3: Big Data Ingestion and Streaming Patterns29Chapter 4: Big Data Storage Patterns■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■43Chapter 5: Big Data Access Patternsmmmaaaamaaaaamamamnan 57Chapter 6: Data Discovery and Analysis Patterns69Chapter 7: Big data visualization Patterns■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■79Chapter8: Big Data Deployment Patterns,…,,,,,…,,…,,91Chapter 9: Big Data NFRs anann101Chapter 10: Big Data Case Studies mmammmanmaaman 113Chapter 11: Resources, References, and Tools m mmmmmmmnnm 127Appendix A: References and bibliography mmmmammmaamammmammamamaammmn 137ndex■■■■■■■■139IntroductionBig data is opening up new opportunities for enterprises to extract insight from huge volumes of data in real time andacross multiple relational and nonrelational data types. The architectures for realizing these opportunities are basedon relatively less expensive and heterogeneous infrastructures than the traditional monolithic and hugely expensiveoptions that exist currentlyThe architectures for realizing big data solutions are composed of heterogeneous infrastructures, databases,and visualization and analytics tools. Selecting the right architecture is the key to harnessing the power of big dataHowever, heterogeneity brings with it multiple options for solving the same problem, as well as the need to evaluatetrade-offs and validate the"fitness-for-purpose"of the solutionThere are myriad open source frameworks, databases, Hadoop distributions, and visualization and analytics toolsavailable on the market, each one of them promising to be the best solution. How do you select the best end-to-endarchitecture to solve your big data problem?Most other big data books on the market focus on providing design patterns in the map reduceor Hadoop area onlyThis book covers the end-to-end application architecture required to realize a big datasolution covering not only Hadoop, but also analytics and visualization issuesEverybody knows the use cases for big data and the stories of Walmart and EBay, but nobodydescribes the architecture required to realize those use casesIf you have a problem statement, you can use the book as a reference catalog to search thecorresponding closest big data pattern and quickly use it to start building the applicationCXOs are being approached by multiple vendors with promises of implementing the perfectbig data solution. This book provides a catalog of application architectures used by peers intheir industryThe current published content about big data architectures is meant for the scientist or thegeek. This book attempts to provide a more industry-aligned view for architectsThis book will provide software architects and solution designers with a ready catalog ofbig data application architecture patterns that have been distilled from real-life big dataapplications in different industries like retail, telecommunication, banking, and insuranceThe patterns in this book will provide the architecture foundation required to launch your nextbig data applicationCHAPTER 1Big data IntroductionWhy Big DataAs you will see, this entire book is in problem-solution format. This chapter discusses topics in big dataalsense, so it is not as technical as other chapters The idea is to make sure you have a basic foundation for learningabout big data. Other chapters will provide depth of coverage that we hope you will find useful no matter what yourbackground. So let's get startedProblemWhat is the need for big data technology when we have robust, high-performing, relational database managementsystems(RDBMS)SolutionSince the theory of relational databases was postulated in 1980 by Dr. E F Codd (known as"Codd's 12 rules")mostdata has been stored in a structured format, with primary keys, rows, columns, tuples, and foreign keys. Initially, itwas just transactional data, but as more and more data accumulated, organizations started analyzing the data in anoffline mode using data warehouses and data marts. Data analytics and business intelligence(bi) became the primarydrivers for CxOs to make forecasts define budgets and determine new market drivers of growthThis analysis was initially conducted on data within the enterprise. However, as the internet connected the entireworld, data existing outside an organization became a substantial part of daily transactions. Even though things wereheating up, organizations were still in control even though the data was getting voluminous with normal querying oftransactional data That data was more or less structured or relationalThings really started getting complex in terms of the variety and velocity of data with the advent of social networkingsites and search engines like google. Online commerce via sites like Amazon com also added to this explosion of dataTraditional analysis methods as well as storage of data in central servers were proving inefficient and expensiveOrganizations like Google, Facebook, and Amazon built their own custom methods to store, process, and analyze thisdata by leveraging concepts like map reduce, Hadoop distributed file systems, and NosQl databasesThe advent of mobile devices and cloud computing has added to the amount and pace of data creation in theworld, so much so that 90 percent of the worlds total data has been created in the last two years and 70 percent of itby individuals, not enterprises or organizations. By the end of 2013, IDC predicts that just under 4 trillion gigabytesof data will exist on earth. Organizations need to collect this data from social media feeds, images, streaming videdtext files, documents, meter data, and so on to innovate, respond immediately to customer needs, and make quickdecisions to avoid being annihilated by competitionHowever, as I mentioned the problem of big data is not just about volume. The unstructured nature of the data(variety and the speed at which it is created by you and me(velocity is the real challenge of big dataCHAPTER 1 BIG DATA INTRODUCTIONAspects of Big dataProblemWhat are the key aspects of a big data system?Solutiona big data solution must address the three vs of big data data velocity variety and complexity in addition to volumeVelocity of the data is used to define the speed with which different types of data enter the enterprise and are thenanalyzedVariety addresses the unstructured nature of the data in contrast to structured data in weblogs, radio frequencyID(RFID), meter data, stock-ticker data, tweets, images, and video files on the InternetFor a data solution to be considered as big data, the volume has to be at least in the range of 30-50 terabytes(TBs)However, large volume alone is not an indicator of a big data problem. a small amount of data could have multiplesources of different types, both structured and unstructured, that would also be classified as a big data problemHow Big Data Differs from Traditional blProblemCan we use traditional business intelligence (Bi) solutions to process big data?SolutionTraditional Bi methodology works on the principle of assembling all the enterprise data in a central server The datais generally analyzed in an offline mode. The online transaction processing (oltp transactional data is transferred toa denormalized environment called as a data warehouse the data is usually structured in an rdbms with very littleunstructured dataa big data solution, however, is different in all aspects from a traditional BI solution:Data is retained in a distributed file system instead of on a central serverThe processing functions are taken to the data rather than data being taking to the functionsData is of different formats both structured as well as unstructuredData is both real-time data as well as offline dataTechnology relies on massively parallel processing (MPP)conceptsHow Big is the opportunity?ProblemWhat is the potential big data opportunity?SolutionThe amount of data is growing all around us every day, coming from various channels(see Figure 1-1)As 70 percent of all data is created by individuals who are customers of some enterprise or the other, organizationscannot ignore this important source of feedback from the customer as well as insight into customer behaviorCHAPTER 1 BIG DATA INTRODUCTIONContent TypeQuantityCommentInternetExabytes200191 exabyte=1 000,000 terabytesWeb Pages15Trillion(1012Plus dark WebTweets20 Billion (100)50 million user accountsLive Posts2.1 Billion (109) Forums, discussion bo ardsSocial Members2.1 Billion (10) Memberships-top 115 social sitesSocial Content Creators 600 Million (10) People (35% of Internet users)Facebook Members 500 Million(10 40% of online ho urs, top 10 propertiesYouTube visitors375 Million( 10) As of December 2009Blogs70 Million(100)36. 718 listed on TechnoratiForma PeriodicalsThousands Newspapers, other publications(10)Figure 1-1. Information explosionBig data drove an estimated $28 billion in IT spending last year, according to market researcher Gartner, IncThat figure will rise to $34 billion in 2013 and $232 billion in IT spending through 2016, Gartner estimatesThe main reason for this growth is the potential chief Information Officers(CIOs see in the greater insightsand intelligence contained in the huge unstructured data they have been receiving from outside the enterpriseUnstructured data analysis requires new systems of record-for example, NoSQL databases-so that organizationscan forecast better and align their strategic plans and initiativesDeriving Insight from DataProblemWhat are the different insights and inferences that big data analysis provides in different industries?SolutionCompanies are deriving significant insights by analyzing big data that gives a combined view of both structured andunstructured customer data. They are seeing increased customer satisfaction, loyalty, and revenue. For exampleEnergy companies monitor and combine usage data recorded from smart meters in real timeto provide better service to their consumers and improved uptimeWeb sites and television channels are able to customize their advertisement strategies basedon viewer household demographics and program viewing patternsFraud-detection systems are analyzing behaviors and correlating activities across multipledata sets from social media analysisHigh-tech companies are using big data infrastructure to analyze application logs toimprove troubleshooting, decrease security violations, and perform predictive applicationmaintenanceSocial media content analysis is being used to assess customer sentiment and improveproducts, services, and customer interactionThese are just some of the insights that different enterprises are gaining from their big data applicationsCHAPTER 1 BIG DATA INTRODUCTIONCloud enabled big dataProblemow is big data affected by cloud-based virtualized environments?SolutionThe inexpensive option of storage that big data and Hadoop deliver is very well aligned to the"everything as a serviceoption that cloud-computing offersoption provides the efficiency needed to process and manage large volumes of structured and unstructured data na oInfrastructure as a Service (IaaS)allows the Cio a"pay as you go"option to handle big data analysis. This virtualizedcluster of expensive virtual machines. This distributed environment gives enterprises access to very flexible and elasticresources to analyze structured and unstructured dataMap reduce works well in a virtualized environment with respect to storage and computing. Also, an enterprisemight not have the finances to procure the array of inexpensive machines for its first pilot. Virtualization enablescompanies to tackle larger problems that have not yet been scoped without a huge upfront investment. It allowscompanies to scale up as well as scale down to support the variety of big data configurations required for a particularAmazon Elastic MapReduce(EMR)is a public cloud option that provides better scaling functionality andperformance for MapReduce. Each one of the Map and reduce tasks needs to be executed discreetly, where thetasks are parallelized and configured to run in a virtual environment. EMR encapsulates the MapReduce engine in avirtual container so that you can split your tasks across a host of virtual machine (vm)instancesAs you can see, cloud computing and virtualization have brought the power of big data to both small and largeenterprisesStructured vs, Unstructured dataProblemWhat are the various data types both within and outside the enterprise that can be analyzed in a big data solutionSolutionStructured data will continue to be analyzed in an enterprise using structured access methods like Structured QueryLanguage(SQL). However, the big data systems provide tools and structures for analyzing unstructured datadae New sources of data that contribute to the unstructured data are sensors, web logs, human-generated interactionta like click streams, tweets, Facebook chats, mobile text messages, e-mails, and so forth.RDBMS SyStems will continue to exist with a predefined schema and table structure Unstructured data is datastored in different structures and formats unlike in aa relational database where the data is stored in a fixedrow-column like structure. The presence of this hybrid mix of data makes big data analysis complex, as decisions needto be made regarding whether all this data should be first merged and then analyzed or whether only an aggregatedview from different sources has to be comparedWe will see different methods in this book for making these decisions based on various functional andnonfunctional priorities.CHAPTER 1 BIG DATA INTRODUCTIONAnalytics in the Big Data WorldProblemHow do i analyze unstructured data, now that i do not have SQl-based tools?SolutionAnalyzing unstructured data involves identifying patterns in text, video, images, and other such content. This isdifferent from a conventional search, which brings up the relevant document based on the search string. Textanalytics is about searching for repetitive patterns within documents, e-mails, conversations and other data to drawinferences and insightsUnstructured data is analyzed using methods like natural language processing(NLP), data mining, master datamanagement(MDM), and statistics. Text analytics use NosQL databases to standardize the structure of the data soadvantage of techniques that originated in linguistics, statistics, and numerical analysi d extraction processes takethat it can be analyzed using query languages like PIG, Hive, and others. The analysis arBig Data ChallengesProblemWhat are the key big data challenges?SolutionThere are multiple challenges that this great opportunity has thrown at usOne of the very basic challenges is to understand and prioritize the data from the garbage that is coming into theenterprise. Ninety percent of all the data is noise, and it is a daunting task to classify and filter the knowledge fromthe noiseIn the search for inexpensive methods of analysis, organizations have to compromise and balance against theconfidentiality requirements of the data. The use of cloud computing and virtualization further complicates the decisionto host big data solutions outside the enterprise. But using those technologies is a trade-off against the cost of ownershipthat every organization has to deal withData is piling up so rapidly that it is becoming costlier to archive it Organizations struggle to determine how longthis data has to be retained. This is a tricky question, as some data is useful for making long-term decisions, whileother data is not relevant even a few hours after it has been generated and analyzed and insight has been obtainedWith the advent of new technologies and tools required to build big data solutions, availability of skills is a bigchallenge for CIOs. A higher level of proficiency in the data sciences is required to implement big data solutionstoday because the tools are not user-friendly yet. They still require computer science graduates to configure andoperationalize a big data systemCHAPTER 1 BIG DATA INTRODUCTIONDefining a reference ArchitectureProblemIs there a high-level conceptual reference architecture for a big data landscape that's similar to cloud-computingarchitectures?SolutionAnalogous to the cloud architectures the big data landscape can be divided into four layers shown verticallin Figure 1-2Infrastructure as a Service(laaS): This includes the storage, servers, and network as thebase, inexpensive commodities of the big data stack. This stack can be bare metal or virtual(cloud). The distributed file systems are part of this layerPlatform as a Service(PaaS ): The nosql data stores and distributed caches that can belogically queried using query languages form the platform layer of big data. This layer providesthe logical model for the raw, unstructured data stored in the filesData as a Service(DaaS): The entire array of tools available for integrating with the PaaSlayer using search engines, integration adapters, batch programs, and so on is housed inthis layer The apis available at this layer can be consumed by all endpoint systems in anelastic-computing modeBig Data Business Functions as a Service(BFaaS): Specific industries--like health, retail,ecommerce, energy, and banking-can build packaged applications that serve a specificbusiness need and leverage the DaaS layer for cross-cutting data functionsFIndustry BusinessFunctionsBig Data AnalysisVisualization toolsNosQL and relationalDatabasesBig data storageInfrastructure LayerFigure 1-2. big data architecture layersYou will see a detailed big data application architecture in the next chapter that essentially is based on thisfour-layer reference architecture