mining of massive datasets-带目录-v2.1.pdf
斯坦福 mining of massive datasets课程的 [英文] 教材,第二版,2015年7月出版, 该教材是斯坦福CS246课程的教材,主要面向本科生及以上学历。由剑桥大学出版社出版。PrefaceThis book evolved from material developed over several years by Anand Rajaraman and Jeff Ullman for a one-quarter course at Stanford. The courseCS345A, titled"Web Mining, was designed as an advanced graduate coursealthough it has bccomc accessible and interesting to advanccd undergraduatesWhen Jure Leskovec joined the Stanford faculty, we reorganized the Immaterialconsiderably. He introduced a new course C$224W on network analysis andadded material to CS345A. which was renumbered CS246. The three authorsalso introduced a large-scale data-mining project course, CS341. The book nowcontains material taught in all three coursesWhat the book is aboutAt the highest levcl of description, this book is about data mining. Howcverit focuses on data mining of very large amounts of data, that is, data so largeit does not fit in main memory. Because of the emphasis on size, many of ourexamples are about the Web or data derived from the Web. Further, the booktakes an algorithmic point of view: data mining is about applying algorithmsto data, rather than using data to train"a machine-learning engine of somesort. The principal topics covered are1. Distributed file systems and map-reduce as a tool for creating parallelalgorithms that succeed on very large amounts of data2. Similarity search, including the key techniques of minhashing and localitysensitive hashing3. Data-stream processing and specialized a gorithms for dea ling with datathat arrives so fast it must be processed immediately or lost4. The technology of search engines, including Google's PageRank, link-spamdetection, and the hubs-and-authorities approach5. Frequent-itemset mining, including association rules, market-baskets, theA-Priori Algorithm and its improvements6. Algorithms for clustering very large, high-dimensional datasetiPREFACE7. Two key problems for Web applications: managing advertising and recommendation svstems8. Algorithms for analyzing and mining the structurc of vcry largc graphsespecially social-network9. Techniques for obtaining the important properties of a large dataset bydimensionality reduction, including singular-value decomposition and la-tent semantic indexing10. Machine-learning a.Igorithms that can be applied to very large data, suchas perceptrons, support-vector machines, and gradient descentPrerequisitesTo appreciate fully the Inaterial in this book, we recolllnend the followingprerequisites1. An introduction to database systems, covering SQL and related program-ming systems.2. A sophomore-level course in data structures, algorithms, and discretemath3. A sophomore-level course in software systems, software engineering, andprograMining languagesExercisesThe book contains extensive exercises, with some for almost every section. Weindicate harder exercises or parts of exercises with an exclamation point. Thehardest exercises have a double exclamation pointSupport on the WebGotohttp://www.mmds.orgforslideshomeworkassignmentsprojectrequirements and exams from courses related to this bookGradiance automated homeworkThere are automated exercises based on this book, using the gradiance rootqucstiontcchnology,availablcatwww.gradiance.com/services.Studentsmayenter a public class by creating all account at that site and entering the classwith code 1EDD8A1D. Instructors may use the site by making an account therePREFACEand then emailing support at gradiance dot com with their login name, thename of their school, and a request to use the MMDS materialsAcknowledgementsCover art is by Scott Ullmante We would like to thank Foto Afrati, Arun Marathe, and Rok Sosic for critica.dings of a draft of this manuscriptErrors were also reported by Rajiv Abraham, Apoorv Agarwal, Aris Anag-nostopoulos, Atilla soner Balkir, Arnaud Belletoile, Robin bennett, Susan biancani, Amitabh Chaudhary, Leland Chen, Ilua Feng, Marcus Gemeinde, Anastasios Gounaris, Clark Grubb, Shrey gupta, Waleed hameid, Saman Haratizadeh, Przemyslaw Horban, Jeff Hwang, Rafi Kamal, Lachlan Kang, Ed Knorr,Haewoon Kwak, Ellis Lau, Greg Lee, David Z Liu, Ethan Lozano, Yunan LuoMichael Mahoney, Justin Meyer, Bryant Moscon, Brad Penoff, John PhillipsPhilips Kokoh Prasetyo, Qi Ge, Harizo Rajaona, Rich Seiter, Hitesh Shetty. An-gad Singh, Sandeep Sripada, Dennis Sidharta, Krzysztof Stencel. Mark StorusRoshan Sumbaly, Zack Taylor, Tim Triche Jr, Wang Bin, Weng Zhen-BinRobert west, Oscar Wul, Xie Ke, Nicolas Zhao, and Zhou Jingbo, The remain-Ing errors are ours, of courseLA RJ. D. UPalo alto. CaMarch 2014PREFACEContents1 Data Mining1.1 What is Data Mining1.1. 1 Statistical Modeling1.1.2 Machine learning1.1.3 Computational Approaches to Modeling1.1.1 Summarization1.1.5 Feature Extraction1.2 Statistical limits on data mining1.2.1 Total Information Awareness111223445561.2. An Example of Bonferroni's Principle1.2.2 Bonferronis Principle1.2.4 Exercises for Section 1.21. 3 Things Useful to Know1.3.1 Importance of Words in Documents1.3.2 Hash Functions1.3.3 Inde101.3.4 Secondary storage1.3.5 The Basc of Natural Logarithms121.3.6 Power Laws131. 3. 7 Exercises for Section 1.31.4 Outline of the book151.5 Summary of Chapter 11.6 References for Chapter 12 Map Reduce and the New Software Stack212.1 Distributed File Systems22.1.1 Physical Organization of Compute Nodes222.1.2 Largc-Scalc Filc-Systcm Organization.232.2 MapReduce2.2.1 The Map Tasks52.2.2 Grouping by Key262.2.3 The Reduce Tasks272.2.4 Combiners27CONTENTS2.2.5 Details of MapReduce Execution282.2.6 Coping With Node Failures292.2.7 Exercises for Section 2.2302.3 Algorithms Using MapReduce302.3.1 Matrix-Vector Multiplication by MapReduce2.3.2 If the Vector v Cannot Fit in Main Memory2.3.3 Relational-Algebra Operations322.3.4 Computing Selections by Mapreduce352.3.5 Computing Projections by Mapreduce362.3.6 Union, Intersection, and Difference by MapReduce ... 362.3.7 Computing Natural Join by Mapreduce372.3.8 Grouping and aggregation by Mapreduce372.3.10 Matrix Multiplication with One MapReduce Step2.3.9 Matrix Multiplication38392.3.11 Exercises for Section 2.3402.4 Extensions to Mapreduce2.4.1 Workflow Systems2.4.2 Recursive Extensions to Mapreduce422.4.3 Pregel452.4.4 Excrciscs for Scction 2.4462.5 The Communication Cost Model2.5.1 Communication-Cost for Task Networks2.5.2 Wall-Clock Timc2.5. 3 Multiway joins492.5.4 Exercises for Section 2.5522.6 Complexity Theory for Mapreduce542.6.1 Reducer Size and Replication Rate.,,,542.6.2 An Example: Similarity Joins52.6.3 A Graph Model for Mapreduce Problems2.6.4 Mapping schemas2.6.5 When Not All Inputs Are Present2.6.6 Lower Bounds on Replication Rate2.6.7 Case Study: Matrix Multiplication622. 6. 8 Exercises for Section 2.6662.7 Summary of Chapter 2672. 8 References for Chapter 2693 Finding Similar Items733.1 Applications of Near-Neighbor Search733.1.1 Jaccard Similarity of Sets743.1.2 Similarity of Documents743.1.3 Collaborative Filtering as a similar-Sets Problem3.1.4 Excrciscs for Scction 3.13.2 Shingling of documents3.2.1 k-Shingles77CONTENTSIX3.2.2 Choosing the Shingle Size783.2.3 Lashing Shingles793.2. 4 Shingles built from words793.2.5 Exercises for Section 3.2803.3 Similarity-Preserving Summaries of Sets803.3.1 Matrix Representation of Sets3.3.2 Minhashing3.3.3 Minhashing and Jaccard Similarity823.3.4 Minhash Signatures833.3.5 Computing Minhash Signatures833.3.6 Excrciscs for Scction 3.3863.4 Locality-Sensitive Hashing for Documents3.4.1 LSH for Minhash Signatures3.4.2 Analysis of the Banding Tcchniquc893.4.3 Combining the Techniques3. 4. 4 Exercises for Section 3.43.5 Distance measures923.5.1 Definition of a Distance Measure3.5.2 Euclidean distances933.5.3 Jaccard Distance943.5.4 Cosine Distance953.5.5 Edit Distance953.5.6 Hamming Distance963.5.7 Exercises for Section 3.5973.6 The Theory of Locality-Sensitive Functions93.6.1 Locality-Sensitive Functions993.6.2 Locality-Sensitive Families for Jaccard Distance1003.6.3 AInplilying a Locality-Sensitive Family1013. 6.4 Exercises for section 3. 61033.7 LSH Familics for Othcr Distance Mcasurcs1043.7.1 LSH Families for Hamming distance1043.7.2 Random Hyperplanes and the Cosine Distance1053.7.3 Sketches1063.7.4 LSII Families for Euclidean Distance1073.7.5 More LSH Families for Euclidean Spaces1083.7.6 Exercises for Section 3.7..1093.8 Applications of Locality-Sensitive Hashing1103.8.1 Entity resolution1103.8.2 An Entity-Resolution Example1113.8.3 Validating Record Matches1123.8.4 Matching Fingerprints3.8.5 A LSH Family for Fingerprint Matchill.1131143.8.6 Similar Nows Articles1153.8. 7 Exercises for Section 3. 81173.9 Methods for High Degrees of Similarity11SCONTENTS3.9.1 Finding Identical Items..1183.9.2 Representing Sets as Strings1183.9.3 Length-Based Filtering1193.9.1 Prefix Indexing1193.9.5 Using position Information1213.9.6 Using Position and Length in Indexes1223.9.7 Exercises for Section 3.91253.10 Summary of Chapter 31263. 11 References for Chapter 31284 Mining Data Streams1314.1 The Strcam Data Modcl1314.1.1 A Data-Strealll-Mallageiment SysteIll1324.1.2 Examples of Stream Sources1334.1.3 Strcam Qucrics1344.1. 4 Issues in streal processinlg1354.2 Sampling Data in a Stream1354.2.1ANMotivatingEample1364.2.2 Obtaining a Representative Sample1364.2.3 The General Sampling Problem1374.2.4 Varying the Sample Size1384.2.5 Exercises for Section 4.21384.3 Filtering Streams1384.3.1 A Motivating Example1394.3.2 The bloom filter1404.3.3 Analysis of Bloom Filtering1404.3.4 Exercises for Section 4.3,1414.4 Counting Distinct Elements in a Stream1424.4.1 The Count-Distinct Problem,,,,,1424.4.2 The Flajolet-Martin Algorithm1434.4.3 Combining estimates1444.4.4 Space Requirements1444.4.5 Exercises for Section 4.41454.5 Estimating Moments1454.5. 1 Definition of moments1454.5.2 The Alon-Matias-Szegedy Algorithm for SecondMoments 1464.5.3 Why the Alon-Matias-Szegedy Algorithm Works1474.5.4 Higher-Order Moments1484.5.5 Dealing With Infinite Streamsl484.5.6 Exercises for Section 4.51494.6 Counting Ones in a Window1504.6.1 The Cost of exact counts1514.6.2 The Datar-Gionis-Indyk-Motwani Algorithm..1514.6.3 Storage Requirements for the DGIM Algorithm ...... 1534.6.4 Query Answering in the DGIM Algorithm153
暂无评论