Kafka_The Definitive Guide_Real-Time Data and Stream Processing at Scale, 2017年新书,值得一看,欢迎下载!Kafka: The Definitive guideReal-Time Data and stream Processing at scaleNeha Narkhede, Gwen Shapira, and Todd PalinKafka The Definitive guideby Neha Narkhede, Gwen Shapira, and Todd palinCopyright o 2017 Neha Narkhede, Gwen Shapira, Todd Palino. All rights reservedPrinted in the united states of americaPublished by o reilly media, Inc, 1005 Gravenstein Highway north, Sebastopol, CA954720 Reilly books may be purchased for educational, business, or sales promotional useOnlineeditionsarealsoavailableformosttitles(http://oreilly.com/safari).formore information, contact our corporate/institutional sales department: 800-998-9938or corporatedoreilly comEditor: Shannon CuttProduction Editor: Shiny KalapurakkelCopyeditor: Christina EdwardsProofreader: Amanda KerseyIndexer: WordCo Indexing Services, IncInterior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator. Rebecca demarestJuly 2017: First EditionRevision history for the First Edition2017-07-07: First ReleaseSeehttp://oreilly.com/catalog/errata.csp?isbn=9781491936160forreleasedetailsThe o' Reilly logo is a registered trademark of 0 reilly media, Inc. Kafka. TheDefinitive Guide, the cover image, and related trade dress are trademarks of o ReillyⅥedia,IncWhile the publisher and the authors have used good faith efforts to ensure that theinformation and instructions contained in this work are accurate, the publisher andthe authors disclaim all responsibility for errors or omissions, including withoutlimitation responsibility for damages resulting Irom the use of or reliance on thisck. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes issubject to open source licenses or the intellectual property rights of others, it isyour responsibility to ensure that your use thereof complies with such licenses and,/ orrights978-1-491-93616-0ForewordIt' s an exciting time for Apache Kafka. Kafka is being used by tens of thousands oforganizations, including over a third of the fortune 500 companies. It s among thefastest growing open source projects and has spawned an immense ecosystem around itIt s at the heart of a movement towards managing and processing streams of data.So where did kafka come from? Why did we build it? And what exactly is it?Kafka got its start as an internal infrastructure system we built at Linked In. Ourobservation was really simple: there were lots of databases and other systems built tostore data, but what was missing in our architecture was something that would help usto handle the continuous flow of data. Prior to building Kafka, we experimented withall kinds of off the shelf options; from messaging systems to log aggregation and etltools, but none of them gave us what we wantedWe eventually decided to build something from scratch. Our idea was that instead offocusing on holding piles of data like our relational databases, key-value storessearch indexes, or caches, we would focus on treating data as a continually evolvingand ever growing stream, and build a data system-and indeed a data architectureoriented around that ideaThis idea turned out to be even more broadly applicable than we expected. Though Kafkagot its start powering real-time applications and data flow behind the scenes of asocial network, you can now see it at the heart of next-generation architectures inevery industry imaginable. Big retailers are re-working their fundamental businessprocesses around continuous dala streams; car companies are collecting and processingreal-time data streams from internet-connected cars; and banks are rethinking theirfundamental processes and systems around Kafka as wellSo what is this Kafka thing all about? How does it compare to the systems you alreadyknow and use?We' ve come to think of Kafka as a streaming platform: a system that lets you publishand subscribe to streams of data, store the, and process them, and that is exactlywhat Apache Kafka is built to be. Getting used to this way of thinking about datamight be a little different than what you' re used to, but it turns out to be anincredibly powerful abstraction for building applications and architectures. Kafka isoften compared to a couple of existing technology categories: enterprise messagingsystems, big data systems like Hadoop, and data integration or etL tools. Each ofthese comparisons has some validity but also falls a little shortKafka is like a messaging system in that it lets you publish and subscribe to streamsof messages. In this way, it is similar to products like ActiveMQ, RabbitMQ, IBMSMQSeries, and other products. But even with these similarities, Kafka has a number ofcore differences from traditional messaging systems that make it another kind ofanimal entirely. Here are the big three differences: first, it works as a loderndistributed system that runs as a cluster and can scale to handle all the applicationsin even the most massive of companies. Rather than running dozens of individualmessaging brokers, hand wired to different apps, this lets you have a central platformthat can scale elastically to handle all the streams of data in a company. Secondly,Kafka is a true storage system built to store data for as long as you ight like. Thishas huge advantages in using it as a connecting layer as it provides real deliveryguarantees-its data is replicated, persistent, and can be kept around as long as youlike. Finally, the world of stream processing raises the level of abstraction quitesignificantly. Messaging systems mostly just hand out messages. The stream processingcapabilities in Kafka let you compute derived streams and datasets dynamically off ofyour streams with far less code. These differences make Kafka enough of its own thingthat it doesn' t really make sense to think of it as "yet another queueAnother view on Kafka-and one of our motivating lenses in designing and building itwas to think of it as a kind of real-time version of Hadoop. Hadoop lets you store andperiodically process file data at a very large scale. kafka lets you store andcontinuously process streams of data, also at a large scale. At a technical levelthere are definitely similarities, and many people see the emerging area of streamprocessing as a superset of the kind of batch processing people have done with hadoopand its various processing layers. What this comparison misses is that the use casesthat continuous, low-latency processing opens up are quite different from those thatnaturally fall on a batch processing system. Whereas hadoop and big data targetedanalytics applications, often in the data warehousing space, the low latency nature ofKafka makes it applicable for the kind of core applications that directly power abusiness. This makes sense: events in a business are happening all the time and theability to react to them as they occur makes it much easier to build services thatdirectly power the operation of the business, feed back into customer experiences, andSo onThe final area Kafka gets compared to is etl or data integration tools. After althese tools move data around, and Kalka moves data around. There is some validity tothis as well, but i think the core difference is that Kafka has inverted the problemRather than a tool for scraping data out of one system and inserting it into anotherKafka is a platform oriented around real-time streams of events. This means that notonly can it connect off-the-shelf applications and data systems, it can power customapplications built to trigger off of these same data streams. We think thisarchitecture centered around streams of events is a really important thing. In someways these flows of data are the most central aspect of a modern digital company, asimportant as the cash flows you' d see in a financial statementThe ability to combine these three areas-to bring all the streams of data togetheracross all the use cassis what makes the idea of a streaming platform so appealingto peopleStill, all of this is a bit different, and learning how to think and buildapplications oriented around continuous streams of data is quite a mindshift if youare coming from the world of request/response style applications and relationaldatabases. This book is absolutely the best way to learn about Kafka from internalsto APIs, written by some of the people who know it best. I hope you enjoy reading itas much as i have lJay krepsCofounder and ceo at ConfluentPrefaceThe greatest compl iment you can give an author of a technical book is "This is thebook i wish i had when i got started with this subject. This is the goal we set forourselves when we started writing this book. We looked back at our experience writingKafka, running Kafka in production, and helping many companies use Kalka to buildsoftware architectures and manage their data pipelines and we asked ourselves,"Whatare the most useful things we can share with new users to take them from beginner toexperts? This book is a reflection of the work we do every day: run Apache Kafka andhelp others use it in the best waysWe included what we believe you need to know in order to successfully run Apache kafkain production and build robust and performant applications on top of it. Wehighlighted the popular use cases: message bus for event-driven microservices, streamprocessing applications, and large-scale data pipelines. We also focused on making thebook general and comprehensive enough so it will be useful to anyone using Kafka, nomatter the use case or architecture. We cover practical matters such as how to installand configure Kafka and how to use the Kafka APIs, and we also dedicated space toKafka's design principles and reliability guarantees, and explore several of Kafka'sdelightful architecture details: the replication protocol, controller, and storagelayer. We believe that know ledge of Kafkas design and internals is not only a funread for those interested in distributed systems, but it is also incredibly useful forthose who are seeking to make informed decisions when they deploy kafka in productionand design applications that use Kafka. The better you understand how Kafka works, there you can make informed decisions regarding the many trade-offs that are involvedIn cnginccringOne of the problems in software engineering is that there is always more than one wayto do any thing Platforms such as Apache Kafka provide plenty of flexibility, which igreat for experts but makes for a steep learning curve for beginners. Very oftenApache kafka tells you how to use a feature but not why you should or shouldn't useit. Whenever possible, we try to clarify the existing choices, the tradeoffs involvedand when you should and shouldn't use the different options presented by apacheKafkaWho should read this bookKafka: The Definitive Guide was written for software engineers who developapplications that use Kalka's APls and for production engineers (also called SREsdevops, or sysadmins) who install, configure, tune, and monitor Kafka in productionWe also wrote the book with data architects and data engineers in mind-thoseresponsible for designing and building an organization' s entire data infrastructureSome of the chapters, especially chapters 3, 4, and ll are geared toward javadevelopers. Those chapters assume that the reader is familiar with the basics of theJava programming language, including topics such as exception handling andconcurrency. Other chapters, especially chapters 2, 8, 9, and 10, assume the readerhas some experience running Linux and some familiarity with storage and networkconfiguration in Linux. The rest of the book discusses Kafka and softwarearchitectures in more general terms and does not assume special knowledgeAnother category of people who may find this book interesting are the managers andarchitects who don't work directly with Kafka but work with the people who do. It isjust as important that they understand the guarantees that Kafka provides and thetrade-offs that their employees and coworkers will need to make while building Kafkabased systems. The book can provide ammunition to managers who would like to get theirstaff trained in Apache Kafka or ensure that their teams know what they need to knowConventions used in this bookThe following typographical conventions are used in this book:ItalicIndicates new terms. URLs. email addresses. filenames. and file extensionsConstant widthUsed for program listings, as well as within paragraphs to refer to programelements such as variable or function names, databases, data types, environmentvariables, statements, and keywordsConstant width boldShows commands or other text that should be typed literally by the userConstant width italicShows text that should be replaced with user-supplied values or by valuesdetermined by contextTipThis element signifies a tip or suggestion.