About This Book, Learn why and how you can efficiently use Python to process data and build machine learning models in Apache Spark 2.0Develop and deploy efficient, scalable real-time Spark solutionsTake your understanding of using Spark with Python to the next level with this jump start guide, Who Table of contentsLearning PySparkCreditsForewordAbout the Authorsabout the reviewerwww.packtpub.comCustomer feedbackrefaceWhat this book coversWhat you need for this bookWho this book is foronventionsReader feedbackCustomer supportDownloading the example codeDownloading the color images of this bookErrataPiracyQuestions1. Understanding SparkWhat is Apache Spark?Spark Jobs and APIsExecution processResilient distributed datasetDataframesDatasetsCatalyst OptimizerProject TungstenSpark 2.0 architectureUnifying datasets and data framesIntroducing Spark SessionTungsten phase 2https://www.iteblog.comStructured streamingContinuous applicationsummary2. Resilient distributed datasetsIntcrnal workings of an RDDCreating RDDSchemaReading from filesLambda expressionsGlobal versus local scopeTransformationsThe. map(. transformationThe filter(. transformationThe. flatMap(.)transformationhe distinct(. transformationThe sample(.transformationThe.leftOuterJoin(. transformationhe repartition(.) transformationActionsThe take(. ) methodThe collect(. )methodChereduce(.)methodhecount(.methodThe.saveAsTextFile(.)methodThe foreach( methodSummary3. Data framesPython to RDd communicationsCatalyst Optimizer refreshSpeeding up PySpark with DataFramesCreating data framesGenerating our own ison dataCreating a DataFrameCreating a temporary tableSimple data Frame queriesDataFrame API queryhttps://www.iteblog.comSQL queryInteroperating with RDDsInferring the schema using reflectionQuerying with the Data Framc ApSchemaProgrammatically specifying theNumber of rowsRunning filter statementsQuerying with SQLNumber of rowsRunning filter statements using the where clausesData Frame scenario-on-time flight performancePreparing the source datasetsJoining flight performance and airportsVisualizing our flight-performance dataSpark Dataset APISummary4. Prepare Data for modelingChecking for duplicates. missing observations, and outliersDuplicatesMissing observationsOutliersGetting familiar with your dataDescriptive statisticsCorrelationsVisualizationHistogramsInteractions between featuresumma5. Introducing MLlibOverview of the packageLoading and transforming the dataGetting to know your dataDescriptive statisticsCorrelationsStatistical testingCreating the final datasethttps://www.iteblog.comCreating an Rdd of labeledPointsSplitting into training and testingPredicting infant survivalLogistic regression in MLlibSclecting only the most predictable featuresRandom forest in mllibSummary6. Introducing the ML PackageOverview of the packageTransformerEstimatorsClassificationRegressionClusteringelinePredicting the chances of infant survival with MLLoading the dataCreating transformersCreating an estimatorCreating a pipelineFitting the modelEvaluating the performance of the modelSaving the modelParameter hyper-tuningGrid searchTrain-validation splittingOther features of pySpark ml in actionFeature extractionNLP-related feature extractorsDiscretizing continuous variablesStandardizing continuous variablesClassificationClusteringFinding clusters in the births datasetTopic miningRegressionhttps://www.iteblog.comSummary7. GraphFramesIntroducing GraphframesInstalling Graph FramesCrcating a libraryPreparing your flights datasetBuilding the graphExecuting simple queriesDetermining the number of airports and tripsDetermining the longest delay in this datasetDetermining the number of delayed versus on-time/early flightsWhat flights departing Seattle are most likely to have significantlays?What states tend to have significant delays departing from Seattle?Understanding vertex degreesDetermining the top transfer airportsUnderstanding motifsDetermining airport ranking using Page RankDetermining the most popular non-stop flightsUsing Breadth-First SearchVisualizing flights using D3Summary8. Tensor framesWhat is Deep Learning?The need for neural networks and Deep learningWhat is feature engineering?Bridging the data and algorithmWhat is Tensor Flow?Installing pilInstalling TensorFlowMatrix multiplication using constantsMatrix multiplication using placeholdersRunning the modelRunning another modescissionIntroducing TensorFrameshttps://www.iteblog.comTensorFrames- quick startConfiguration and setupLaunching a Spark clusterCreating a TensorFrames libraryInstalling TensorFlow on your clusterUsing Tensor Flow to add a constant to an existing columnExecuting the Tensor graphBlockwise reducing operations exampleBuilding a data frame of vectorsAnalysing the Data frameComputing elementwise sum and min of all vectorsSummary9. Polyglot Persistence with BlazeInstalling blazePolyglot persistenceAbstracting dataWorking with NumPy arraysWorking with pandas Data FrameWorking with filesWorking with databasesInteracting with relational databasesInteracting with the MongoDB databaseData operationsAccessing columnsmbolic transformationsperations on columnsReducing dataJoinsSummary0. Structured streamingWhat is Spark StreamingWhy do we need Spark Streaming?What is the Spark Streaming application data flow?Simple streaming application using STreamsA quick primer on global aggregationsntroducing Structured Streaminghttps://www.iteblog.comSummar11. Packaging Spark ApplicationsThe spark-submit commandCommand line parametersDeploying the app programmaticallyConfiguring your Spark SessionCreating SparksessionModularizing codeStructure of the modulealculating the distance between two pointsConverting distance unitsBuilding an eggser defined functions in SparkSubmitting a jobMonitoring executionDatabricks jobssummaryIndexhttps://www.iteblog.comLearning Pysparkhttps://www.iteblog.com