Haddop The Definitive Guide,O'REILLY出版,英文原版,非扫描OURTH EDITIONHadoop: The Definitive guideTom whiteBeijing· Cambridge· Farnham·.Kon· Sebastopol· Tokyo OREILLY°Hadoop: The definitive Guide fourth editionby tom whiteCopyright C 2015 Tom White. All rights reservedPrinted in the United States of americaPublished by Oreilly Media, InC, 1005 Gravenstein Highway North, Sebastopol, CA 95472OReilly books may be purchased for educational,business, or sales promotional use. Online editions arealsoavailableformosttitles(http://safaribooksonline.com).Formoreinformationcontactourcorporateinstitutionalsalesdepartment:800-998-9938orcorporate@oreilly.comEditors: Mike Loukides and Meghan blanchetteIndexer: Lucie haskinsProduction editor: matthew hackerCover Designer: Ellie VolckhausenCopyeditor: Jasmine KwitynInterior Designer: David FutatoProofreader: Rachel headlustrator: Rebecca demarestJune 2009First editionOctober 2010:Second editionMay2012:Third editionApril 2015:Fourth editionRevision History for the Fourth Edition:2015-03-19: First release2015-04-17: Second releaseSeehttp://oreilly.com/catalog/errata.csp?isbn=9781491901632forreleasedetailsThe O reilly logo is a registered trademark of O Reilly Media, InC. Hadoop: The Definitive Guide, the coverimage of an African elephant, and related trade dress are trademarks of o reilly Media, Inc.Many of the designations used by manufacturers and sellers to distinguish their products are claimed astrademarks. Where those designations appear in this book, and OReilly Media, Inc was aware ofa trademarkclaim, the designations have been printed in caps or initial capsWhile the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errorsor omissions, including without limitation responsibility for damages resulting from the use of or relianceon this work. Use of the information and instructions contained in this work is at your own risk if any codesamples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with suchlicenses and/or rightsISBN:978-1-491-90163-2[LFor eliane, emilia, and lottieTable of contentsForeword,xⅶilPrefabPart 1. Hadoop fundamentalsMeet Hadoop.DataData Storage and analysisQuerying All Your DataBeyond BatchComparison with Other Systems3356688Relational database management SystemsGrid Computing10Volunteer Computing11A Brief History of Apache Hadoop12What's in this book152. MapReduce.19A Weather Dataset19Data format19analyzing the data with Unix tools21Analyzing the Data with Hadoop22Map and reduce22Java Map reduce24Scaling out30Data flow30Combiner functions34Running a distributed Map Reduce jobHadoop Streaming37Ruby37Python3. The Hadoop Distributed Filesystem...................... 43The Design of HDFSHDFS Concepts45Blocks45Namenodes and datanodes46Block cachingHDFS Federation48HDFS High AvailabilityThe Command-Line Interface50Basic Filesystem operations51Hadoop Filesystems53Interfaces54The Java Interface5Reading Data from a Hadoop URL57Reading Data Using the FileSystem API58Writing data61Directories63Querying the FilesystemDeleting data68Data flow69Anatomy of a File read69anatomy of a File WriteCoherency model74Parallel Copying with distcp76Keeping an hdFS Cluster balanced4. YARN79Anatomy of a YARN Application Run80Resource requests81Application Lifespan82Building YARN Applications82YARN Compared to Map Reduce 183Scheduling in yarn85Scheduler options86Capacity scheduler configuration88Fair Scheduler Configuration0Delay schedulin94Dominant resource fairness95Fuurther readin96ⅵi| Table of contents5. Hadoop /097Data Integrity99q7Data Integrity in HDFSLocalFilesystemChecksum File system99Compression100Cod101Compression and input splits105Using Compression in Map reduce107Serialization109The Writable interface110Writable classes113Implementing a Custom Writable121Serialization frameworksFile-Based Data Structures127127Map file135Other File formats and Column-Oriented formats136Part lL. Map Reduce6. Developing a MapReduce application .The Configuration API141Combining Resources143Variable expansion143Setting Up the Development Environment144Managing Configuration146GenericOptions Parser, Tool, and ToolRunner148Writing a Unit Test with MRUnit152153Reducer156Running locally on Test Data156Running a Job in a Local Job runner157Testing the Driver158Running on a cluster160Packaging a Job160aunching a job162The Map reduce Web UI165Retrieving the results167Debugging a Job168Hadoop logs172Table of contents|ⅶiRemote Debugging174Tuning a job175Profiling Tasks175Map Reduce Workflows177Decomposing a Problem into Map Reduce jobs177Job Control178apache oozie1797. How MapReduce Works185anatomy of a MapReduce Job run185Job Submission186Job initialization187Task assignment188Task Execution189Progress and Status Updates190Job Completion192Failures193Task Failure193Application Master Failure194Node Manager Failure195Resource Manager Failure196Shuffle and sort197The Map Side197The Reduce side198Configuration Tuning201Task Execution203The Task Execution Environment203Speculative execution204Output Committers2068. Map Reduce Types and Formats209Map Reduce typees209The Default MapReduce Job214Input formats220Input splits and records220Text Inputplt232Binary input236Multiple inputs237Dtabase input (and outputp238Output Formats238Text Output239Binary output239I Table of Contents