Why doesn't your home page appear on the first page of search results, even when you query your own name? How do other web pages always appear at the top? What creates these powerful rankings? And how? The first book ever about the science of web page rankings, 'Google's PageRank and Beyond' supplieCopyright C 2006 by Princeton University PressPublished by Princeton University Press, 41 William Street,PriIn the United Kingdom: Princeton University PreSs, 3 Market Place,Woodstock, Oxfordshire OX20 1SYAll Rights ReservedLibrary of Congress 2005938841ISBN-13:978-0691-12202-1ISBN-10:0-691-122024The publisher would like to acknowledge the authors of this volume for providingthe camera-ready from which this book was printedGoogle and Page Rank are trademarks of Google InBritish Library Cataloging-in-Publication Data is availablePrinted on acid-free paper.aceton eduPrinted in the United States of America109876543ContentsPrefaceChapter 1. Introduction to Web Search Engines1.1 A Short History of Information Retrieval1.2 An Overview of traditional information retrieval1.3 Web Information Retrieval9Chapter 2. Crawling, Indexing, and Query Processing152.1 Crawling152.2 The Content Index19Query Processing21Chapter 3. Ranking Webpages by Popularity3.1 The Scene in 19983.2 Two ThesesQuery-Independence52803Chapter 4. The Mathematics of Google's PageRank4.1 The Original summation Formula for pageRank4.2 Matrix Representation of the Summation Equations4.3 Problems with the iterative Process4. 4 A Little Markov Chain Theory4.5 Early Adjustments to the Basic Model4.6 Computation of the Page Rank vector4.7 Theorem and Proof for Spectrum of the Google Matrix5Chapter 5. Parameters in the PageRank Model5. 1 The a facto475.2 The Hyperlink Matrix H5.3 The Teleportation Matrix EChapter 6. The Sensitivity of PageRank6. 1 Sensitivity with respect to a57CONTENTS6.2 Sensitivity with respect to H6.3 Sensitivity with respect to v?6. 4 Other Analyses of sensitivity6.5 Sensitivity Theorems and Proofs8的86Chapter 7. The PageRank Problem as a Linear System7.1 Properties of(i-aS)7.2 Properties of (I-aH)7.3 Proof of the Page Rank Sparse Linear SystemChapter 8 Issues in Large-Scale Implementation of PageRank758.1 Storage lssues758.2 Convergence Criterion8.3Accuracy798.4 Dangling Nodes8.5 Back Button ModelingChapter 9. Accelerating the Computation of PageRank9. 1 An Adaptive Power Method9.2 Extrapolation9.3 Aggregation9.4 Other Numerical MethodsChapter 10. Updating the PageRank Vector9910.1 The Two Updating Problems and their History10010.2 Restarting the Power Method1010.3 Approximate Updating Using Approximate Aggregation10.4 Exact Aggregation104410.5 Exact VS. Approximate Aggregation10.6 Updating with Iterative Aggregation10.7 Determining the Partition10.8 Conclusions111Chapter 11. The HITS Method for Ranking Webpages11511.1 The HITS Algorithm11511.2 HITS Implementation11711. 3 HITS Convergence11911.4 HITS Example11.5 Strengths and Weaknesses of HITS12211.6 HITS's Relationship to Bibliometrics12311.7 Query-Independent HITs12411.8 Accelerating HITS11. 9 HITS SensitivityCONTENTSChapter 12. Other Link Methods for Ranking Webpages13112.1 SALSA13112.2 Hybrid Ranking Methods13512.3 Rankings based on traffic Flow136Chapter 13. The Future of Web Information Retrieval13913.1 Spam13913.2 Personalization14213.3 Clustering13.4 Intelligent Agents13.5 Trends and Time-Sensitive Search14413.6 Privacy and Censorship14613.7 Library Classification Schemes14713.8 Data Fusion148Chapter 14. Resources for Web Information Retrieval14914.1 Resources for Getting Started14914.2 Resources for Serious Study150Chapter 15. The Mathematics Guide15315. 1 Linear Algebra15315.2 Perron-Frobenius Theory16715.3 Markov chains17515.4 Perron Complementation15.5 Stochastic Complementation19215.6 Censoring19415.7 Aggregation15.8 DisaggregationChapter 16. Glossary201彐 ibliography207Index219PrefacePurposeAs teachers of linear algebra, we wanted to write a book to help students and the generalpublic appreciate and understand one of the most exciting applications of linear algebratoday-the use of link analysis by web search engines. This topic is inherently interesting,timely, and familiar. For instance, the book answers such curious questions as: How dosearch engines work? Why is Google so good? What's a Google bomb? How can Iimprove the ranking of my homepage in Teoma?We also wanted this book to be a single source for material on web search engine rankngs. a great deal has been written on this topic, but it's currently spread across numeroustechnical reports, preprints, conference proceedings, articles, and talks. Here we havesummarized, clarified, condensed, and categorized the state of the art in web rankingOur AudienceWe wrote this book with two diverse audiences in mind the general science readerand the technical science reader The title echoes the technical content of the book butin addition to being informative on a technical level, we have also tried to provide someentertaining features and lighter material concerning search engines and how they workThe mathematicsOur goal in writing this book was to reach a challenging audience consisting of thegeneral scientific public as well as the technical scientific public. Of course, a completunderstanding of link analysis requires an acquaintance with many mathematical ideasNevertheless, we have tried to make the majority of the book accessible to the general scientific public. For instance, each chapter builds progressively in mathematical knowledge,technicality, and prerequisites. As a result, Chapters 1-4, which introduce web search andlink analysis, are aimed at the general science reader. Chapters 6, 9, and 10 are particularlymathematical. The last chapter, Chapter 15, The Mathematics Guide, is a condensed butcomplete reference for every mathematical concept used in the earlier chapters. Throughout the book, key mathematical concepts are highlighted in shaded boxes. By postponingthe mathematical definitions and formulas until Chapter 15 (rather than interspersing themthroughout the text), we were able to create a book that our mathematically sophisticatedreaders will also enjoy. We feel this approach is a compromise that allows us to serve bothaudiences: the general and technical scientific publicPREFACEAsidesAn enjoyable feature of this book is the use of Asides. Asides contain entertaining newsstories, practical search tips, amusing quotes, and racy lawsuits. Every chapter, even theparticularly technical ones, contains several asides. Often times a light aside provides theperfect break after a stretch of serious mathematical thinking. Brief asides appear in shadedboxes while longer asides that stretch across multiple pages are offset by horizontal barsand italicized font. We hope you enjoy these breaks we found ourselves looking forwardto writing them.Computing and codeTruly mastering a subject requires experimenting with the ideas. Consequently, we haveincorporated matlab code to encourage and jump-start the experimentation process whileany programming language is appropriate, we chose Matlab for three reasons: (1)its matrixstorage architecture and built-in commands are particularly suited to the large sparse linkanalysis matrices of this text, (2)among colleges and universities, Matlab is a market leaderin mathematical software, and (3)it's very user-friendly. The Matlab programs in this bookare intended to be instruction, not production, code. We hope that, by playing with theseprograms, readers will be inspired to create new models and algorithmsAcknowledgmentsWe thank Princeton University Press for supporting this book. We especially enjoyedworking with Vickie Kearn, the Senior Editor at PUP. Vickie, thank you for displaying justthe right combination of patience and gentle pressure. For a book with such timely mate-rial, you showed amazing faith in us. We thank all those who reviewed our manuscriptsand made this a better book. Of course. we also thank our families and friends for theirencouragement. Your pride in us is a powerful driving forceDedicationWe dedicate this book to mentors and mentees worldwide. The energy, inspiration, andsupport that is sparked through such relationships can inspire great products. For us, itproduced this book, but more importantly, a wonderful synergistic friendshipChapter OneIntroduction to Web Search Engines1.1 A SHORT HISTORY OF INFORMATION RETRIEVALToday we have museums for everything-the museum of baseball, of baseball players, ofcrazed fans of baseball players, museums for world wars, national battles, legal fights, andfamily feuds. While there's no shortage of museums, we have yet to find a museum dedicated to this books field, a museum of information retrieval and its history Of coursethere are related museums, such as the Library Museum in Boras, Sweden, but none concentrating on infornation retrieval. Information retrieval is the process of searchingwithin a document collection for a particular information need(called a query ) Althoughdominated by recent events following the invention of the computer, information retrievalactually has a long and glorious tradition. To honor that tradition, we propose the creation of a museum dedicated to its history. Like all museums, our museum of informationretrieval contains some very interesting artifacts. Join us for a brief tour.The earliest document collections were recorded on the painted walls of caves. Acave dweller interested in searching a collection of cave paintings to answer a particularinformation query had to travel by foot, and stand, staring in front of each painting Un-fortunately, it's hard to collect an artifact without being gruesome, so let's fast forward aBefore the invention of paper, ancient romans and greeks recorded information onpapyrus rolls. Some papyrus artifacts from ancient rome had tags attached to the rollsThese tags were an ancient form of today s post-it note, and make an excellent addition toour museum. A tag contained a short summary of the rolled document, and was attachedin order to save readers from unnecessarily unraveling a long irrelevant document. Theseabstracts also appeared in oral form. At the start of Greek plays in the fifth century B Cthe chorus recited an abstract of the ensuing action. While no actual classification schemehas survived from the artifacts of greek and roman libraries we do know that anotherelementary information retrieval tool, the table of contents, first appeared in greek scrollsfrom the second century B C. Books were not invented until centuries later, when necessitrequired an alternative writing material. As the story goes, the library of Pergamum(inwhat is now Turkey) threatened to overtake the celebrated Library of Alexandria as thebest library in the world, claiming the largest collection of papyrus rolls. As a result, theEgyptians ceased the supply of papyrus to Pergamum, so the Pergamenians invented analternative writing material, parchment, which is made from thin layers of animal skin. (Infact, the root of the word parchment comes from the word Pergamum. ) Unlike papyrus,The boldface terms that appear throughout the book are also listed and defined in the Glossary, which begins