Foundations of Statistical Natural Language Processing.pdfSecond printing, 19991999 Massachusetts Institute of TechnologySecond printing with corrections, 2000All rights reserved. No part of this book may be reproduced in any form by anyelectronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisherTypeset in 10/13 Lucida Bright by the authors using WATEX2Printed and bound in the United States of americaLibrary of Congress Cataloging-in-Publication InformationManning, Christopher DFoundations of statistical natural language processing Christopher DManning.inrich schutzep.Includes bibliographical references (p. )and indexISBN0-262-13360-11. Computational linguistics-Statistical methods. I. Schutze, HinrichI. TitleP98.5S83M361999410’.285-dc2199-21137CIPBrief ContentsI Preliminaries 11 Introduction 32 Mathematical foundations 393 Linguistic Essentials 81Corpus-Based work 117II words 1495 Collocations 1516 Statistical Inference: n-gram Models over Sparse Data 1917 Word Sense Disambiguation 2298 Lexical acquisition 265I Grammar 3159 Markov Models 31710 Part-of-Speech Tagging 34111 Probabilistic Context Free grammars 38112 Probabilistic Parsing407Iv Applications and Techniques 46113 Statistical Alignment and Machine Translation 46314 Clustering49515 Topics in Information Retrieval 52916 Text categorization 575ContentsList of tables xuList of figuresTable of Notations xxvPreface xxixRoad Map xxxvI Preliminaries 11 Introduction 31. 1 Rationalist and Empiricist Approaches to Language 41.2 Scientific Content 71. 2.1 Questions that linguistics should answer 81. 2.2 Non-categorical phenomena in language 111.2.3 Language and cognition as probabilistichenomena 151.3 The Ambiguity of Language: Why NLP Is Difficult 171.4 Dirty Hands 191. 4.1 Lexical resources 191. 4.2 Word counts 201.4.3Zipf” s laws231. 4.4 Collocations 291. 4.5 Concordances 311. 5 Further Reading 3 4Contents1. 6 Exercises 352 Mathematical foundations 392. 1 Elementary Probability Theory 402.1.1 Probabiliy spaces 402.1.2 Conditional probability and independence 422. 1. 3 Bayes' theorem 432.1.4 Random variables 452.1.5 Expectation and variance 462.1.6 Notation 4 72.1.7 Joint and conditional distributions 482.1.8 Determining P 482. 1. 9 Standard distributions 502.1.10 Bayesian statistics 542.1.11 Exercises 592.2 Essential Information Theory 602.2.1 Entry612.2.2 Joint entropy and conditional entropy 632.2. 3 Mutual information 662.2.4 The noisy channel model 682.2.5 Relative entropy or Kullback-Leibler divergence 722.2.6 The relation to language: Cross entropy 732.2.7 The entropy of English 762.2.8 Perplexity 782. 2. 9 Exercises 782.3 Further Reading 793 Linguistic Essentials 813. 1 Parts of Speech and Morphology 8 13.1.1Nouns romans 83.1.2 Words that accompany nouns: Determiners andadjectives 873.1.3 Verbs 883.1. 4 Other parts of speech 913.2 Phrase structure 933.2.1Phrasesdmachmears963.2.2 Dependency: Arguments and adjuncts 1013.2.3 X theory 1063.2.4 Phrase structure ambiguity 107Contents3.3 Semantics and Pragmatics 1093. 4 Other Areas 1123.5 Further Reading 1133.6 Exercises 11 44 Corpus-Based Work 1174.1 Getting Set Up 1184.1.1 Computers 1l84.1.2 Corpora ll84.13 Software 1204.2 Looking at Text 1234.2. 1 Low-level formatting issues 1234.2.2 Tokenization What is a word?244.2.3 Morphology 13 14.2. 4 Sentences 13 44.3 Marked-up Data 1364.3.1 Markup schemes 1374.3.2 Grammatical tagging 1394. 4 Further Reading 1454.5 Exercises 147II Words 1495 Collocations 1515.1 Frequency 1535.2 Mean and Variance 1575.3 Hypothesis Testing1625.3.1 The t t1635.3.2 Hypothesis testing of differences 1665.3.3 Pearsons chi-square test 1695.3. 4 Likelihood ratios 1725.4 Mutual Information 1785.5 The notion of collocation 1835.6 Further Reading 1876 Statistical Inference: n gram Models over Sparse data6.1 Bins: Forming Equivalence Classes 1926.1.1 Reliability vs. discrimination 1926.1.2 n-grammdels192Conte6.1.3Buildinggram models 1956.2 Statistical Estimators 1966.2. 1 Maximum Likelihood Estimation(MLE) 1976.2.2 Laplace's law, Lidstone's law and theJeffreys-Perks law 2026.2 3 Held out estimation 2056.2.4 Cross-validation (deleted estimation) 2106.2.5 Good-Turing estimation 2126.2.6 Briefly noted 2166.3 Combining Estimators 2176.3.1 Simple linear interpolation 2186.3.2 Katz's backing-off 2196.3.3 General linear interpolation 2206.3. 4 Briefly noted 2226.3.5 Language models for Austen 2236. 4 Conclusions 2245 Further Reading 2256.6 Exercises 2257 Word sense Disambiguation 2297. Methodological preliminaries 2327.1.1 Supervised and unsupervised learning 2327.1.2 Pseudowords 2337.1.3 Upper and lower bounds on performance 2337. 2 Supervised Disambiguation 2357. 2.1 Bayesian classification 2357.2.2 An information-theoretic approach 2397.3 Dictionary-Based Disambiguation 2417.3.1 Disambiguation based on sense definitions 2427.3.2 Thesaurus-based disambiguation 2447.3.3 Disambiguation based on translations in asecond-language corpus 2477.3.4 One sense per discourse, one sense percollocation 2497.4 Unsupervised Disambiguation 2527.5 What Is a Word sense? 2567.6 Further Reading 2607.7 Exercises 262Contents8 Lexical Acquisition 2658. 1 Evaluation measures 2678.2 Verb Subcategorization 2718.3 Attachment Ambiguity 2788.3.1 Hindle and Rooth (1993 )2808.3.2 General remarks on pp attachment 2848. 4 Selectional preferences 2888.5 Semantic Similarity 2948.5.1 Vectospace measures 2968.5.2 Probabilistic measures 3038.6 The role of Lexical Acquisition in Statistical NLP 3088.7 Further Reading 312III Grammar 3159 Markov Models 31791 Markoⅴ Models3189.2 Hidden markov models 3209.2.1 Why use HMMs? 3229.2.2 General form of an hmm 3249.3.2 Finding the best state sequence 33/ p9.3 The Three Fundamental Questions for HMMs 39.3. 1 Finding the probability of an observatic3269.3.3 The third problem: Parameter estimation 3339.4 HMMs: Implementation, Properties, and variants 3369.4.1 Implementation 33 69.4.2 Variants 3379.4.3 Multiple input observations 33 89.4.4 Initialization of parameter values 3399.5 Further Reading 33 910 Part-of-Speech Tagging 34110. 1 The Information Sources in Tagging 34310.2 Markov Model Taggers 34510.2. 1 The probabilistic model 34510.2.2 The Viterbi algorithm 34910.2.3 Variations 35110.3 Hidden Markov Model Taggers 356