NLTK Tokenizers and Dictionaries

One of NLTK’s better features is the standard tokenizer. It integrates particularly well with the default dictionaries and corpora NLTK makes available. It only takes a single line to generate a list of stop words, or a list of all words in the english language, or a list of stopwords in spanish, italian, and so on. Loading the data to and from disk was a bit finicky, but from there you can go from a great big long stream of text to a list of paragraphs, where each paragraph is a list of sentences, and each sentence is a list of words, and each word is text token that doesn’t appear in your stop list. This is incredibly useful, and it would be easy to go farther and have each word tagged as a part of speech as well.

One variation on the testing I am working on is using only a subset of each document for the testing portion, usually a number of random selected paragraphs from the method I just described. This could be a more practical model to look at, since in a real world application you may not have the time or computing power or bandwidth to analyze a whole document. A mall decrease in accuracy could be exchanged for a large increase in processing , which would be useful for larger datasets. This is itself a great example of the benefit NLTK can provide. It allows you to spend more time thinking about the structure of your problem and less about getting the computer to do what you want it do (not always an easy thing to do).