Paperback: 277 pagesPublisher: Apress; 1st ed. 2015 edition (December 25, 2015)Language: EnglishISBN-10: 1484209656ISBN-13: 978-1484209653Big Data Analytics with Spark is a step-by-step guide for learning Spark, which is an open-source fast and general-purpose cluster computing framework for lScalable Big Data ArchitectureCopyright 2016 by bahaaldine azarmiThis work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of thematerial is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting, reproduction on microfilms or in any other physical way, and transmission or informationtorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology nowknown or hereafter developed. Exempted from this legal reservation are brief excerpts in connection withreviews or scholarly analysis or material supplied specifically for the purpose of being entered and executedon a computer system, for exclusive use by the purchaser of the work. Duplication of this publication orparts thereof is permitted only under the provisions of the Copyright Law of the Publisher's location, in itscurrent version, and permission for use must always be obtained from Springer. Permissions for use may beobtained through Rightslink at the Copyright Clearance Center Violations are liable to prosecution underthe respective Copyright LawISBN-13(pbk):978-1-4842-13278ISBN-13( electronic:978-1-4842-1326-1Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol withevery occurrence of a trademarked name, logo, or image we use the names, logos, and images only in aneditorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.The use in this publication of trade names, trademarks, service marks, and similar terms, even if they arenot identified as such, is not to be taken as an expression of opinion as to whether or not they are subject toproprietary rightsWhile the advice and information in this book are believed to be true and accurate at the date of publication,neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors oromissions that may be made. The publisher makes no warranty, express or implied, with respect to thematerial contained hereinManaging director: Welmoed spahrLead Editor: Celestin Suresh JohnDevelopment Editor: Douglas PundickTechnical Reviewers: Sundar Rajan Raman and Manoj PatilEditorial Board: Steve Anglin, Pramila Balen, Louise Corrigan, Jim DeWolf, Jonathan GennickRobert Hutchinson, Celestin Suresh John, Michelle Lowman, James Markham, Susan McDermottMatthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan SpearingCoordinating Editor: Jill BalzanoCopy Editors: Rebecca Rider, Laura Lawrie, and Kim WimpsettCompositor: SPi GlobalIndexer: SPi GlobalArtist: SPi GlobalCover Designer: Anna IshchenkoDistributed to the book trade worldwide by Springer Science+ Business Media New York,233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax(201)348-4505e-mailorders-ny@springer-sbm.comorvisitwww.springeronline.comApressMedia,LlcisaCaliforniaLLC and the sole member (owner )is Springer Science Business Media Finance Inc(SSBM Finance Inc)SSBM Finance Inc is a Delaware corporationForinformationontranslationspleasee-mailrights@apress.comorvisitwww.apress.comApress and friends of ed books may be purchased in bulk for academic, corporate, or promotional usee Book versions and licenses are also available for most titles. For more information, reference our source code or other supplementary material referenced by the author in this text is available to readersatwww.apress.comFordetailedinformationabouthowtolocateyourbookssourcecodegotoWww.apress. com/source-code/For Aurelia and june.Contents at a glanceAbout the author■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■口■■■■■■■■■■■■■■■■■■■■■■■■■■About the technical reviewers■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■XIIChapter 1: The Big ( Data) ProblemChapter 2: Early Big Data with NoSQL■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■17Chapter 3: Defining the Processing Topology■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■Chapter 4: Streaming Data57Chapter 5: querying and Analyzing mamanataanaamamanan 81Chapter 6: Learning From Your Data? namm 105Chapter 7: Governance Considerations mmmmmmammmm 123Index■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■139ContentsAbout the authorAbout the technical reviewersChapter 1: The Big (Data) Problem■■■■■■■■■■■■■■■■■■■■Identifying Big Data SymptomsSize mattersTypical Business Use cases…2Understanding the Big Data Project's EcosystemHadoop distributionData Acquisition..,…,…6Processing LanguageMachine Learning…NosQL StoresCreating the Foundation of a Long-Term Big Data Architecture12Architecture overview wwww 12Log Ingestion Application13earning Applicati0n..,,,,,,,…,…13Processing Engine............14Search engine15Summary15CONTENTSChapter 2: Early Big Data with NosQL a■■■■■■■■■■■■■■■■■■■■■■■17NOSQL Landscape17Key/value17co|umn…18Document…18Graph19NoSQL in Our Use Case.Introducing Couchbase21Architecture22Cluster Manager and Administration ConsoleManaging Documents28Introducing ElasticSearch.Architecture .Monitoring ElasticSearch日面a面面面a,日面面a日面面面面日面面量a日面面面日面面量面画34Search with elasticsearch36Using NosQl as a Cache in a SQL-based Architecture38Caching Document……38ElasticSearch Plug-in for Couchbase with Couchbase XDCRElasticSearch only……Summar40Chapter 3: Defining the Processing Topology■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■口■■■■■■■■■■■■■■■■■■■41First Approach to Data Architecture41A Little bit of background,41Dealing with the Data SourcesProcessing the data,45Splitting the Architecture49Batch Processing...,,,…,……,50Stream Processing…..................…………,52The concept of a Lambda architecture m. mmeamemann53Summary55CONTENTSChapter 4: Streaming Data aaan■■■■■■57Streaming Architecture57Architecture diagram.…………………………57TechnologiesThe anatomy of the Ingested data….60Clickstream data wwww 60The raw dateThe Log Generator.................Setting Up the Streaming Architecture.Shipping the Logs in Apache Kafka64Draining the Logs from Apache KafkaSummary79Chapter 5: Querying and Analyzing Patterns.aa■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■a■口■■81Definining an Analytics Strategy81Continuous Processing….,,,,,,,…Real-Time Querying…,,,,…,,…,…,……82Process and Index data Using Spark82Preparing the Spark project............82Understanding a Basic Spark Application.......e.....Implementing the Spark Streamermplementing a Spark Indexer….........…89Implementing a Spark Data Processing .................91Data analytics with Elasticsearch93Introduction to the aggregation framework.........,..……………93Visualize data in kibana100Summary103CONTENTSChapter 6: Learning From Your Data? ara105Introduction to Machine Learning105Supervised Learning.………………………105Unsupervised Learning….……107Machine Learning with Spark……,,108Adding Machine Learning to our Architecture108Adding Machine Learning to Our Architecture,112Enriching the clickstream Data...................................112Labelizing the data117Training and Making Prediction .Summary121Chapter 7: Governance Considerations mmm RRREIIRIIRIm 123Dockerizing the Architecture123Introducing Docker123Installing DockerCreating Your Docker Images……,…125Composing the architecture .........................................................................................128Architecture Scalability132Sizing and Scaling the architecture.....,.…………132Monitoring the Infrastructure Using the Elastic Stack135Considering Security136Summary…137Index m 139About the authorBahaaldine azarmi, baha for short is a Solutions architect at elasticPrior to this position, Baha co-founded reachfive, a marketing dataplatform focused on user behavior and social analytics. Baha has alsoworked for different software vendors such as talend and oracle, where hehas held positions such as Solutions Architect and Architect. Baha is basedin Paris and has a master's degree in computer science from Polyech'ParisYou can find him at linkedin. com/in/bahaaldine

