Feature selection

The first model that I am working with is based on naive bayes with the frequency of words as the feature being measured. Although this is a standard baseline to work with, I am looking for some more sophisticated methods that will let me determine specific literary genre, which has been harder to classify. In addition to just the word choice that identifies the genre, what other features might be useful to measure? I was thinking of addition in a parts of speech measurement might help distinguish. Perhaps romance novels include more conditional statements and mystery novels include more definitives. I was thinking of adding a part of speech tagger or some sort of semantic analysis to tag each word with its associated context. Once I had labeled all the words with a contextual categy, I could run a naive bayes using each pair of words and contexts as an individual feature (so “to love” in the present tense would be different than “to love” in the past). We might find patterns where a genre is associated with a word only when context is added in (“love” may be common in both Mystery and Romance, but in Mystery it may be in the past tense and in Romance the future tense). Adding this feature to the feature selector, testing the accuracy of the feature selector, and rewriting the naive bayes algorithm to use words and contexts together is a pretty large task, but definitely is a direction I want to take as I refine the classifier.

"Hello World"

building CUNY Communities since 2009

Need help with the Commons?