pig编程实战详细介绍了实现mr的功能原理,是pig初学者的理想资料。pdfProgramming PigAlan gatesO REILLYBeijing· Cambridge· Farnham·Koln· Sebastopol· TokyoProgramming pigby Alan gatesCopyright@ 2011 Yahoo!, Inc. All rights reservedPrinted in the United States of americaPublished by O Reilly Media, Inc, 1005 Gravenstein Highway North, Sebastopol, CA 95472O'Reilly books may be purchased for educational, business, or sales promotional use. Online editionsarealsoavailableformosttitles(http://my.safaribooksonline.com).Formoreinformationcontactourcorporate/institutionalsalesdepartment:(800)998-9938orcorporate@oreilly.comEditors: Mike Loukides and Meghan Blanchette Indexer: Jay marchandProduction editor: Adam zarembaCover designer Karen montgomeryCopyeditor: Genevieve d'EntremontInterior Designer: David FutatoProofreader: Marlowe ShaefferIllustrator: Robert romanoOctober 2011First editionRevision History for the First Edition:2011-09-27cleaseSeehttporeillycom/catalog/errata.cspisbn=9781449302641forreleasedetailsNutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks ofO'Reilly Media, Inc. Programming Pig, the image of a domestic pig, and related trade dress are trademarksof o'Reilly media, IncMany of the designations used by manufacturers and sellers to distinguish their products are claimed asrademarks. Where those designations appear in this book, and O'Reilly Media, Inc. was aware of atrademark claim, the designations have been printed in caps or initial capsWhile every precaution has been taken in the preparation of this book, the publisher and author assumeno responsibility for errors or omissions, or for damages resulting from the use of the information contained hereinISBN:978-1-449-30264-1ILSI1317137246To my wife, Barbara, and our boys, Adam andJoel. Their support, encouragement, and sacri-ficed saturdays have made this book possibleTable of contentsPreface1. IntroductiongPig on HadoopPig Latin, a Parallel dataflow LanguageWhat Is Pig Useful For?Pio,ig Phmasophy2. Installing and Running PigDownloading and Installing Pig11Downloading the Pig Package from Apache11Downloading Pig from Cloudera12Downloading Pig Artifacts from MavenDownloading the SourceRIunning pig13Running pig locally on Your machineRunning Pig on Your Hadoop Cluster15Running pig in the cloud17Command-Line and Configuration OptionsReturn Codes183. Grunt19Entering Pig Latin Scripts in Grunt20HDFS Commands in grunt20Controlling Pig from Grunt214. Pig s Data ModelypComplex types26chemasCasts305. Introduction to Pig La33Preliminary matters33Case sensitivity34Comments34Input and Output3434StoreDumPRelational operations3737FilterGer44Distinct45oinLimitSample49Parallel49User Defined functions51Registering UDFS51define and udesCalling Static Java Functions546. Advanced Pig Latin............57Advanced Relational operations57Advanced Features of foreach7Using Different Join Implementationscogroup66union66Integrating Pig with Legacy Code and MapReducestream69mapreduceonlinear data flows72Controlling Execution75Setting the partitioner76Pig Latin PreprocessorI Table of ContentsParameter SubstitutionMacros78Including Other Pig Latin Scripts807. Developing and Testing Pig Latin Scripts81Development tools81Syntax Highlighting and Checkingescribe82explain82llustrate89Pig statistics90MapReduce Job Status92bugging Tips94Testing Your Scripts with PigUnit978. Making Pig Fly101Writing Your Scripts to Perform Well102Filter early and Often102Project early and Often103Set Up your Joins Properly104Use multiquery When possible105Choose the right Data Type105Select the right Level of parallelism105Writing Your udF to Perform106Tune Pig and Hadoop for Your Job106Using Compression in Intermediate Results108Data Layout Optimization109Bad Record handling1099. Embedding Pig latin in Python..........,111Compile112Bind113Binding multiple sets of variables114Ru115Running Multiple bindings116Utility methods11610. Writing Evaluation and Filter Functions119Writing an Evaluation Function in Java119Where Your uDF Will Run120Evaluation Function basics120Input and Output Schemas124Error Handling and Progress Reporting127Table of contents|ⅶiConstructors and Passing data from Frontend to backend128Overloading UDFs133Memory Issues in Eval funcs135Algebraic Interface135Accumulator interface139Python udfs140Writing Filter functions14211. Writing Load and Store Functions145Load functions146Frontend Planning Functions146Passing Information from the Frontend to the backend148Backend Data reading148Additional load function interfaces153Store functions157Store Function Frontend Planning157Store Functions and udfcontext159Writing Data159Failure Cleanup162Storing metada12. Pig and Other Members of the Hadoop Community............. 165Pig and hive165Cascading165NOSQL DatabasesHBase166Cassandra168Metadata in Hadoop169Built-in User Defined Functions and Piggybank,171B. Overview of Hadoop189Index195ⅶ ii Table of Contents