High-speed distributed computing made easy with SparkOverviewImplement Spark's interactive shell to prototype distributed applicationsDeploy Spark jobs to various clusters such as Mesos, EC2, Chef, YARN, EMR, and so onUse Shark's SQL query-like syntax with SparkIn DetailSpark is a frameworkFast data processing with SparkCopyright o 2013 Packt PublishingAll rights reserved. No part of this book may be reproduced, stored in a retrievalsystem, or transmitted in any form or by any means, without the prior writtenpermission of the publisher, except in the case of brief quotations embedded incritical articles or reviewsrt has been made in the preparation of this book to ensure the accuracyof the information presented. However, the information contained in this book issold without warranty, cither express or implied. Neither the author nor PaclPublishing and its dealers and distributors will be held liable for any damagescaused or alleged to be caused directly or indirectly by this bookPackt Publishing has endeavored to provide trademark information about all of thecompanies and products mentioned in this book by the appropriate use of capitalsHowever, Packt Publishing cannot guarantee the accuracy of this informationFirst published: October 2013Production Reference: 1151013Published by Packt Publishing ltdLivery place35 Livery streetBirmingham b3 2PB UKISBN978-1-78216-706-8www.packtpub.comCover Image by Suresh Mogre(suresh mogre.99ogmail com)CreditsAuthorCopy EditorsHolden KarauBrandt d'melloalReviewersLavina pereiraWayne AllanTanvi gaitondeAndrea mostosiDipti KapadiaReynold XiProofreaderAcquisition EditorJonathan toddKunal ParikhIndexerCommissioning editorRekha nairhaon basuProduction coordinatorTechnical editorsManu JosephKrutika parabNadeem N BaganCover WorkProject CoordinatorAmey sawantAbout the authorHolden Karau is a transgendered software developer from Canada currentlyHolden graduated from the Urty of waterloo in 2009with a Bachelors of Mathematics in Computer Science. She currently works as aSoftware Development Engineer at Google. She has worked at Foursquare, whereshe was introduced to Scala. She worked on search and classification problems atAmazon. Open Source development has been a passion of Holden's from a veryyoung age, and a number of her projects have been covered on Slashdot. Outsideof programming, she enjoys playing with fire, welding, and dancing. You canlearnmoreatherwebsite(http://www.holdenkarau.com),blog(http://blogholdenkarau.com),andgithub(https://github.com/holdenk)I'd like to thank everyone who helped review early versions of thisbook, especially Syed Albiz, Marc Burns, Peter J. MacDonaldbert Hu, and Noah FiedelAbout the reviewersAndrea Mostosi is a passionate software developer. He started softwaredevelopment in 2003 at high school with a single-node LAMP stack and grew withit by adding more languages, components, and nodes. He graduated in Milan andworked on several web-related projects. I le is currently working with data, tryingto discover information hidden behind huge datasetsI would like to thank my girlfriend, Khadija, who lovingly supportsme in everything i do, and the people i collaborated with -for fun orfor work - for everything they taught me I'd also like to thank packtPublishing and its staff for the opportunity to contribute to this bookReynold Xin is an Apache Spark committer and the lead developer for Sharkand graphX, two computation frameworks built on top of Spark. He is also aco-founder of databricks which works on transforming large-scale data analysisthrough the Apache Spark platform. Before Databricks, he was pursuing a Phdin the UC Berkeley AMPLab, the birthpllace of sparkAside from engineering open source projects, he frequently speaks at Big Dataacademic and industrial conferences on topics related to databases, distributedsystems, and data analytics. He also taught Palestinian and Israeli high-schoolstudents Android programming in his spare timeWww.Packtpub.comSupport files, eBooks, discount offers and moreYoumightwanttovisitwww.PacktPub.comforsupportfilesanddownloadsrelatedto your bookDid you know that Packt offers e Book versions of every book published, with PDFandepuBfilesavailableYoucanupgradetotheebookversionatwww.packtpubcom and as a print book customer, you are entitled to a discount on the e book copy.Get in touch with us at service@packtpub com for more detailsAtwww.Packtpub.comyoucanalsoreadacollectionoffreetechnicalarticlessign up for a range of free newsletters and receive exclusive discounts and offerson packt books and ebooksPUPACKTLIB°http://packtlib.Packtpub.comDo you need instant solutions to your If questions? PacktLib is Packt's onlinedigital book library. Here, you can access, read and search across Packt's entirelibrary of booksWhy Subscribe?Fully searchable across every book published by packtCopy and paste, print and bookmark contentOn demand and accessible via web browserFree Access for packt account holdersIfyouhaveanaccountwithPacktatwww.packtpub.comyoucanusethistoaccessPacktLib today and view nine entirely free books. Simply use your login credentialsfor immediate accessTable of contentsPrefaceChapter 1: Installing Spark and Setting Up Your ClusterRunning Spark on a single machineRunning Spark on EC25788Running Spark on EC2 with the scriptsDeploying spark on Elastic MapReduce13Deploying Spark with Chef (opcode14Deploying Spark on Mesos15Deploying Spark on YARNDeploying set of machines over SSH17Links and referencesSummary22Chapter 2: Using the Spark Shell23Loading a simple text file23Using the Spark shell to run logistic regression25Interactively loading data from S327Summary29Chapter 3: Building and Running a Spark Application31Building your Spark project with sbt31Building your Spark job with Maven35Building your Spark job with something else37Summary38Chapter 4: Creating a SparkContext39Scala40Java40Shared Java and scala apls41Python41Table of contentsLinks and references42Summar42Chapter 5: Loading and Saving Data in Spark43RDDS43Loading data into an RDD44Saving your data49Links and references49Summary50Chapter 6: Manipulating Your RDDManipulating your RDD in Scala and Java51Scala rdd functions60Functions for joining Pair RDD functions61Other pairRDD functions62DoubleRDD functions64General RDD functions64Java rdd functions66Spark Java function classesCommon java rdd functionsMethods for combining JavaPairRDD functions69JavaPairRDD functions70Manipulating your RDD in PythonStandard rdd functions73PairRDD functions75Links and references76Summary76Chapter 7: Shark -Using Spark with HiveWhy Hive/Shark?Installing shark78Running shark79Loading data79Using Hive queries in a Spark program80Links and references83Summary83Chapter 8: Testing85Testing in Java and Scala85Refactoring your code for testabilit85Testing interactions with SparkContext88Testing in92Links and references94Summary94Table of ContentsChapter 9: Tips and Tricks95Where to find logs?95Concurrency limitations95Memory usage and garbage collection96Serialization96IDE integration97Using Spark with other languages98A quick note on security99Mailing lists99Links and references99Summary100Index101