Tokenizationofrawtextisastandardpre-processingstepformanyNLPtasks.ForEnglish,tokenizationusuallyinvolvespunctuationsplittingandseparationofsomeaffixeslikepossessives.Otherlanguagesrequiremoreextensivetokenpre-processing,whichisusuallycalledsegmentation.TheStanfordWordSegmentercurrentlysup