Advanced.Analytics.with.Spark.Patterns.for.Learning.from.Data.at.Scale

zhanglin73346197 15 0 PDF 2020-07-29 12:07:57

Title: Advanced Analytics with Spark: Patterns for Learning from Data at ScaleAuthor: Josh Wills, Sandy Ryza, Sean Owen, Uri LasersonLength: 200 pagesEdition: 1Language: EnglishPublisher: O'Reilly MediaPublication Date: 2015-04-25ISBN-10: 1491912766ISBN-13: 9781491912768Apache Spark is emeAdvanced analytics with SparkSandy ryza, Uri Laserson, Sean Owen, and Josh willsBeng. Cambridge. Farnham·Kn· Sebastopol, Tokyo OREILLY°Advanced Analytics with Sparky Sandy ryza, Uri Laserson, Sean Owen, and Josh WillsCopyright o 2015 Sandy ryza, Uri Laserson, Sean Owen, and Josh Wills. All rights reservedPrinted in the United states of americaPublished by O reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA95472OReilly books may be purchased for educational, business, or sales promotional use. Online editions arealsoavailableformosttitles(http://safaribooksonline.com).Formoreinformationcontactourcorporateinstitutionalsalesdepartment800-998-9938orcorporate@oreilly.comEditor: Marie BeaugureauIndexer: Judy McconvilleProduction editor Kara ebrahimInterior Designer: David FutatoCopyeditor: Kim CoferCover Designer: Ellie VolckhausenProofreader: Rachel MonaghanIllustrator: Rebecca DemarestApril 2015First editionRevision History for the First Edition2015-03-27: First ReleaseSeehttp://oreilly.com/catalog/errata.csp?isbn=9781491912768forreleasedetailsThe O Reilly logo is a registered trademark of O Reilly Media, Inc. Advanced Analytics with Spark, thecover image of a peregrine falcon, and related trade dress are trademarks of o reilly media, Inc.While the publisher and the authors have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the authors disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and or rights978-1-491-91276-8ILSITable of contentsForeword.,ⅶiPreface1. Analyzing Big Data............................The Challenges of Data ScienceIntroducing apache SparkAbout this book2. Introduction to Data Analysis with Scala and spark.Scala for Data scientists10The Spark Programming Model11Record LinkageGetting Started The Spark Shell and SparkContext13Bringing Data from the Cluster to the Client18Shipping code from the client to the cluster22Structuring Data with Tuples and Case Classes23Aggregations28Creating Histograms29Summary Statistics for Continuous Variables30Creating Reusable Code for Computing Summary Statistics31Simple variable selection and Scoring36Where to Go from here373. Recommending Music and the audioscrobbler Data Set39Data Set40The alternating Least Squares Recommender algorithmPreparing the Data43Building a First ModelSpot Checking Recommendations48Evaluating Recommendation Quality50Computing auc51Hyperparameter Selection53Making recommendationsWhere to go from here564. Predicting Forest Cover with Decision Trees59Fast Forward to Regression59Vectors and features60Training Examples61Decision Trees and ForestsCovtype Data setPreparing the DataA First Decision Tree67Decision Tree HyperparametersTuning Decision TreesCategorical Features Revisited75Random decision forestsMaking PredictionsWhere to Go from here795. Anomaly Detection in Network Traffic with K-means Clustering. .... ...... 81Anomaly detection82K-means Clustering82Network intrusion83KDD Cup 1999 Data Set84A First Take on ClusteringChoosing k87Visualization in r89Feature normalization91Categorical variables94USing Labels with Entropy95Clustering in Action96Where to Go from here976. Understanding Wikipedia with Latent Semantic Analysis99The Term-Document matrix100Getting the dat102Parsing and preparing the data102Lemmatization104iv Table of ContentsComputing the TF-IDFs105Singular Value Decomposition107Finding lmportant Concepts109Querying and Scoring with the low-Dimensional Representation112Term-Term relevance113Document-Document relevance115Term-Document relevance116Multiple- Term QueriesWhere to Go from here1197. Analyzing Co-occurrence Networks with GraphX................ 12The medlinE Citation Index: A Network analysis122Getting the Data123Parsing XML Documents with Scala's XML Library125Analyzing the Mesh Major Topics and Their Co-occurrences127Constructing a Co-occurrence Network with GraphX129Understanding the Structure of Networks132Connected Components132Degree Distribution135Filtering Out Noisy Edges138Processing edge Triplets139analyzing the Filtered graphSmall-World Networks142Cliques and Clustering Coefficients143Computing Average Path Length with PregelWhere to Go from here1498. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data.... 151Getting the data152Working with Temporal and geospatial Data in Spark153Temporal Data with Joda Time and NScalaTime153Geospatial Data with the Esri Geometry API and Spray155Exploring the Esri Geometry API155Intro to Geo soN157Preparing the New York City Taxi Trip Data159Handling Invalid Records at Scale160spatial analysis164Sessionization in Spark167Building sessions: Secondary sorts in Spark168here to Go from Here171Table of contents9. Estimating financial Risk through Monte Carlo Simulation ........... 173Terminology174Methods for Calculating vaR175Variance-Covariance175Historical simulation175Monte carlo simulation175Our model176Getting the data177Preprocessing178Determining the Factor Weights181ampling183The Multivariate normal distribution185Running the Trials186Visualizing the Distribution of Returns189Evaluating Our resultsWhere to Go from here19210. Analyzing Genomics Data and the Bdg Project.195Decoupling Storage from Modeling196Ingesting Genomics Data with the ADAM CLI198Parquet Format and Columnar Storage204Predicting Transcription Factor Binding Sites from ENCODE Data206Querying Genotypes from the 1000 Genomes Project213Where to Go from here21411. Analyzing Neuroimaging Data with PySpark and Thunder.217Overview of Pyspark218PySpark Internals219Overview and Installation of the Thunder Library221Loading data with Thunder222Thunder Core Data Types229Categorizing Neuron Types with Thunder231Where to Go from here236A. Deeper into spark237B. Upcoming MLlib Pipelines APl247Index253Table of contentsForewordEver since we started the Spark project at Berkeley, I've been excited about not justbuilding fast parallel systems, but helping more and more people make use of large-scale computing. This is why I'm very happy to see this book, written by four expertsin data science, on advanced analytics with Spark. Sandy, Uri, Sean, and Josh havebeen working with Spark for a while, and have put together a great collection of con-tent with equal parts explanations and examplesThe thing i like most about this book is its focus on examples, which are all drawnfrom real applications on real-world data sets. It's hard to find one, let alone tenexamples that cover big data and that you can run on your laptop, but the authorshave managed to create such a collection and set everything up so you can run themin Spark. Moreover, the authors cover not just the core algorithms, but the intricaciesof data preparation and model tuning that are needed to really get good results. Youshould be able to take the concepts in these examples and directly apply them to yourown problemsBig data processing is undoubtedly one of the most exciting areas in computingtoday, and remains an area of fast evolution and introduction of new ideas. I hopethat this book helps you get started in this exciting new fieldMatei Zaharia, cto at Databricks and vice President, Apache Spark

用户评论

lengthy20347 2020-07-29 12:07:10

正好学习spark，内容清晰，入门阅读很好

relax_42744 2020-07-29 12:07:09

Verygoodbookonsparkanalytics.

data86663 2020-07-29 12:07:09

对于spark和机器学习很有用的书，太感谢了

夏落晚风 2020-07-29 12:07:08

good,thankyouverymuch.

yz_wp 2020-07-29 12:07:07

赶一把大数据的时髦，下来科普一下，还不错。

learning spark

Learning Spark 英文版 pdf

20 2019-04-12
Communicationefficient learning of deep networks from decentralized data.pdf

现代移动设备可以访问大量适合学习模型的数据，这反过来又可以极大地改善设备上的用户体验。例如，语言模型可以提高语音识别和文本输入，图像模型可以自动选择好的照片。然而，这种丰富的数据通常是隐私敏感的，数量

33 2020-05-17
Learning From Data2nd Ed.pdf

Learning From Data 2nd Ed.pdf

10 2021-04-29
Large Scale Learning to Rank

Large Scale Learning to Rank，谷歌的论文

64 2018-12-09
HR Analytics from Kaggle源码

人力资源分析师如果您也想查看其他笔记本和解决方案,可以在Kaggle中找到此挑战: 上下文和内容一家活跃于大数据和数据科学领域的公司希望在成功通过该公司的某些课程的人员中聘用数据科学家。许多人报

6 2021-04-25
Learning Spark_Lighting Fast Data Analysis.pdf

LearningSpark-LightingFastDataAnalysis.pdfHoldenKarau,AndyKonwinski,PatrickWendell&MateiZaharia

29 2019-07-06
Learning.Spark.Lightning_Fast.Big.Data.Analysis.pdf

LearningSpark，pdf格式,为数不多的spark著作，值得细看

19 2019-07-08
Python Advanced Predictive Analytics epub

PythonAdvancedPredictiveAnalytics英文epub本资源转载自网络，如有侵权，请联系上传者或csdn删除查看此书详细信息请在美国亚马逊官网搜索此书

35 2019-07-09
Learning IBM Watson Analytics

Today,onlyasmallportionofbusinessesactuallyusearealanalyticaltoolaspartofroutinedecisionmaking.IBMWa

38 2019-08-12
Apache Spark the Analytics Operating System

VIA Anjul Bhambhri, VP of Big Data Engineering, IBM

36 2019-04-17

Advanced.Analytics.with.Spark.Patterns.for.Learning.from.Data.at.Scale

用户评论

推荐下载