NLTK datatypes and documentation

NLTK offers a number of different models or object types to represent the data you’re working with. One of the most common types in NLTK is a corpus which represents a body of texts to analyze. Although it provides a convenient data type to access, it’s needlessly difficult to get data in or convert between different corpus types. Just attempting to find the proper data types to instantiate a corpus object required reading the API for ClassifiedCorpusReader plus everything inherited from TaggedCorpusReader plus everything inherited from PlaintextCorpusReader plus everything inherited from CorpusReader. Although this kind of over-specialization of objects is fairly common in programming, it can be quite confusing. Here NLTK is handicapped by a distinct lack of proper documentation (a relatively unexciting task that is often the weakest point of open source projects). I would have liked clear examples of the most important functions in various classes, although with explanations of the data types it was expecting. I was eventually able to figure out the specifics that I needed, but it required a very close reading of the source code. This is not only incredibly time inefficient, it is a needless technical barrier for a tool that could offer a lot to non-technical users. Any planning on using NLTK should be prepared for some reasonably technical digging. NLTK does offer a number of good tutorials and walk throughs, but that is not a substitute for proper documentation of the tool and all it’s APIs.

"Hello World"

building CUNY Communities since 2009

NLTK datatypes and documentation

Need help with the Commons?