实用自然语言处理 《实用自然语言处理(O'Reilly)》一书的注释和测试 第一部分。基础 NLP管道 数据采集 文字清理 - Unicode normalization - Spell correction - Keyboard errors (fat finger) - OCR errors - Which character to replace first? - keyboard - inner key's first - ??? (statistically) - OCR - ?? (statistically, depending on the source docs) 前处理 - Text -> [Sentence Tokenization] -> Sentences - Sentence - Lowercasting - Removal