Probabalistic models in NLTK

While I was reading through the code of NLTK’s bayesian classifier I noticed something a little unusual about it. Naive bayes can be based on two different underlying models of the features. The first is a Bernoulli model where the model assigns a binary value to each word, true if the word is in the document false if the word is not. This captures both the presence and absence of a word, but ignores relative frequencies between words. The second is a multivariate model where each feature represents the relative probability (eg, .004 or .80) of a word. This does not mathematically account for the absense of words, but does account for the relative frequency between words. The multivariate model is usually more accurate in real life and is faster to calculate, making it more common to see. It is the model used in the very popular SpamBayes project, and is what many spam filters are based on. This was the model I had originally assumed I would be working with. However, given all the discussion we’ve had about feature selection, and the fairly well known limits of the “bag of words” model, instead of modifying NTLK with a multivariate model, I would like to add in features in addition to the appearance of words. The bernoulli model is better at capturing features that are dissimilar, so I could add some other text features (such as mood, tense, sentence complexity) in addition to word occurrence. I think for categorizing genre, especially, I will see more benefit from adding in additional textual features and using the existing bernoulli model than in switching to a multivariate model of the same features.

"Hello World"

building CUNY Communities since 2009

Probabalistic models in NLTK

Need help with the Commons?