The preliminary results using the standard and modified NLTK bayesian classifier were almost suspiciously high. As a way to double check the results I used an NLTK feature that shows you the features that most differentiated the various classes. If the words that most clearly differentiated classes matched my knowledge of the books, I could be fairly confident that the results were not spurious. Some of the differentiating markers were unexpected, but enough matched my expectation to establish the validity of the results. For example, one of the strongest markers was often Watson or Holmes identifying Doyle, Sometimes the results were slightly less intuitive, such as “Street” being a stronger indicator for Doyle than “Baker” (although there may been bakers as a profession in other works). There was some slightly overlap between the two romance authors, although proper names again were the strongest signal of an author. Given how useful proper names are, and how much they are tied to “story” (and therefore “author’ and not necessarily “genre”) it would be interesting in a future round to strip out as many proper names as possible. Another interesting trend was questioning and emotionally related words showing up more often as indicators of Austen and Bronte. Although any one particular word is likely to be an overtrain at this point, this is the pattern I am hoping to detect with genre.
Overall, the results seem to match other bayesian classifiers in limited situations. With a larger dataset we should see a decrease in accuracy and a chance for improvements to be noticeable. In addition, it has supported the idea that there is a detectable pattern in genre outside of individual word frequencies.