Preliminary results

I presented an overview of my results in class, but wanted to take a minute to discuss my preliminary results here. My most meaningful set so far has been an attempt to classify author using the complete works of Austen, Doyle, and Bronte. My hope was that Doyle would correspond closely to a “mystery” genre, while Austen and Doyle would be “romance.” In addition, I was hoping to confuse the algorithm with two similar authors in one genre. Rather than use full novels as a sample size (which would limit the number of discrete samples), I divided each author’s work into chunks of 10kb. This resulted in roughly 400 fragments per author which was then randomized and split into testing and training. The basic naive bayes classifier was in the high 90% accurate on the full data set, using only 10% of the samples for training. A number of optimizations were tried including stopword filtering, removal of punctuation, better randomization of samples, increasing sample size, and increased normalization of words (by removing capitals and tenses, etc). All of these did improve accuracy as measured, but given the unrealistically high accuracy of the baseline testing those results are not overly meaningful.

So far the initial results are actually quite encouraging. Naive bayes did quite well on a small sample size, and will certainly serve as a good baseline to compare other classification techniques against. At this point naive bays is TOO good. I need a larger dataset (or some more limits on the training) in order to get enough of an error rate to make improvements worthwhile (or even meaningful).