Advanced.Analytics.with.Spark.Patterns.for.Learning.from.Data.at.Scale
Title: Advanced Analytics with Spark: Patterns for Learning from Data at ScaleAuthor: Josh Wills, Sandy Ryza, Sean Owen, Uri LasersonLength: 200 pagesEdition: 1Language: EnglishPublisher: O'Reilly MediaPublication Date: 2015-04-25ISBN-10: 1491912766ISBN-13: 9781491912768Apache Spark is emeAdvanced analytics with SparkSandy ryza, Uri Laserson, Sean Owen, and Josh willsBeng. Cambridge. Farnham·Kn· Sebastopol, Tokyo OREILLY°Advanced Analytics with Sparky Sandy ryza, Uri Laserson, Sean Owen, and Josh WillsCopyright o 2015 Sandy ryza, Uri Laserson, Sean Owen, and Josh Wills. All rights reservedPrinted in the United states of americaPublished by O reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA95472OReilly books may be purchased for educational, business, or sales promotional use. Online editions arealsoavailableformosttitles(http://safaribooksonline.com).Formoreinformationcontactourcorporateinstitutionalsalesdepartment800-998-9938orcorporate@oreilly.comEditor: Marie BeaugureauIndexer: Judy McconvilleProduction editor Kara ebrahimInterior Designer: David FutatoCopyeditor: Kim CoferCover Designer: Ellie VolckhausenProofreader: Rachel MonaghanIllustrator: Rebecca DemarestApril 2015First editionRevision History for the First Edition2015-03-27: First ReleaseSeehttp://oreilly.com/catalog/errata.csp?isbn=9781491912768forreleasedetailsThe O Reilly logo is a registered trademark of O Reilly Media, Inc. Advanced Analytics with Spark, thecover image of a peregrine falcon, and related trade dress are trademarks of o reilly media, Inc.While the publisher and the authors have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the authors disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and or rights978-1-491-91276-8ILSITable of contentsForeword.,ⅶiPreface1. Analyzing Big Data............................The Challenges of Data ScienceIntroducing apache SparkAbout this book2. Introduction to Data Analysis with Scala and spark.Scala for Data scientists10The Spark Programming Model11Record LinkageGetting Started The Spark Shell and SparkContext13Bringing Data from the Cluster to the Client18Shipping code from the client to the cluster22Structuring Data with Tuples and Case Classes23Aggregations28Creating Histograms29Summary Statistics for Continuous Variables30Creating Reusable Code for Computing Summary Statistics31Simple variable selection and Scoring36Where to Go from here373. Recommending Music and the audioscrobbler Data Set39Data Set40The alternating Least Squares Recommender algorithmPreparing the Data43Building a First ModelSpot Checking Recommendations48Evaluating Recommendation Quality50Computing auc51Hyperparameter Selection53Making recommendationsWhere to go from here564. Predicting Forest Cover with Decision Trees59Fast Forward to Regression59Vectors and features60Training Examples61Decision Trees and ForestsCovtype Data setPreparing the DataA First Decision Tree67Decision Tree HyperparametersTuning Decision TreesCategorical Features Revisited75Random decision forestsMaking PredictionsWhere to Go from here795. Anomaly Detection in Network Traffic with K-means Clustering. .... ...... 81Anomaly detection82K-means Clustering82Network intrusion83KDD Cup 1999 Data Set84A First Take on ClusteringChoosing k87Visualization in r89Feature normalization91Categorical variables94USing Labels with Entropy95Clustering in Action96Where to Go from here976. Understanding Wikipedia with Latent Semantic Analysis99The Term-Document matrix100Getting the dat102Parsing and preparing the data102Lemmatization104iv Table of ContentsComputing the TF-IDFs105Singular Value Decomposition107Finding lmportant Concepts109Querying and Scoring with the low-Dimensional Representation112Term-Term relevance113Document-Document relevance115Term-Document relevance116Multiple- Term QueriesWhere to Go from here1197. Analyzing Co-occurrence Networks with GraphX................ 12The medlinE Citation Index: A Network analysis122Getting the Data123Parsing XML Documents with Scala's XML Library125Analyzing the Mesh Major Topics and Their Co-occurrences127Constructing a Co-occurrence Network with GraphX129Understanding the Structure of Networks132Connected Components132Degree Distribution135Filtering Out Noisy Edges138Processing edge Triplets139analyzing the Filtered graphSmall-World Networks142Cliques and Clustering Coefficients143Computing Average Path Length with PregelWhere to Go from here1498. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data.... 151Getting the data152Working with Temporal and geospatial Data in Spark153Temporal Data with Joda Time and NScalaTime153Geospatial Data with the Esri Geometry API and Spray155Exploring the Esri Geometry API155Intro to Geo soN157Preparing the New York City Taxi Trip Data159Handling Invalid Records at Scale160spatial analysis164Sessionization in Spark167Building sessions: Secondary sorts in Spark168here to Go from Here171Table of contents9. Estimating financial Risk through Monte Carlo Simulation ........... 173Terminology174Methods for Calculating vaR175Variance-Covariance175Historical simulation175Monte carlo simulation175Our model176Getting the data177Preprocessing178Determining the Factor Weights181ampling183The Multivariate normal distribution185Running the Trials186Visualizing the Distribution of Returns189Evaluating Our resultsWhere to Go from here19210. Analyzing Genomics Data and the Bdg Project.195Decoupling Storage from Modeling196Ingesting Genomics Data with the ADAM CLI198Parquet Format and Columnar Storage204Predicting Transcription Factor Binding Sites from ENCODE Data206Querying Genotypes from the 1000 Genomes Project213Where to Go from here21411. Analyzing Neuroimaging Data with PySpark and Thunder.217Overview of Pyspark218PySpark Internals219Overview and Installation of the Thunder Library221Loading data with Thunder222Thunder Core Data Types229Categorizing Neuron Types with Thunder231Where to Go from here236A. Deeper into spark237B. Upcoming MLlib Pipelines APl247Index253Table of contentsForewordEver since we started the Spark project at Berkeley, I've been excited about not justbuilding fast parallel systems, but helping more and more people make use of large-scale computing. This is why I'm very happy to see this book, written by four expertsin data science, on advanced analytics with Spark. Sandy, Uri, Sean, and Josh havebeen working with Spark for a while, and have put together a great collection of con-tent with equal parts explanations and examplesThe thing i like most about this book is its focus on examples, which are all drawnfrom real applications on real-world data sets. It's hard to find one, let alone tenexamples that cover big data and that you can run on your laptop, but the authorshave managed to create such a collection and set everything up so you can run themin Spark. Moreover, the authors cover not just the core algorithms, but the intricaciesof data preparation and model tuning that are needed to really get good results. Youshould be able to take the concepts in these examples and directly apply them to yourown problemsBig data processing is undoubtedly one of the most exciting areas in computingtoday, and remains an area of fast evolution and introduction of new ideas. I hopethat this book helps you get started in this exciting new fieldMatei Zaharia, cto at Databricks and vice President, Apache Spark
用户评论