Validity of results

The preliminary results using the standard and modified NLTK bayesian classifier were almost suspiciously high. As a way to double check the results I used an NLTK feature that shows you the features that most differentiated the various classes. If the words that most clearly differentiated classes matched my knowledge of the books, I could be fairly confident that the results were not spurious. Some of the differentiating markers were unexpected, but enough matched my expectation to establish the validity of the results. For example, one of the strongest markers was often Watson or Holmes identifying Doyle, Sometimes the results were slightly less intuitive, such as “Street” being a stronger indicator for Doyle than “Baker” (although there may been bakers as a profession in other works). There was some slightly overlap between the two romance authors, although proper names again were the strongest signal of an author. Given how useful proper names are, and how much they are tied to “story” (and therefore “author’ and not necessarily “genre”) it would be interesting in a future round to strip out as many proper names as possible. Another interesting trend was questioning and emotionally related words showing up more often as indicators of Austen and Bronte. Although any one particular word is likely to be an overtrain at this point, this is the pattern I am hoping to detect with genre.

Overall, the results seem to match other bayesian classifiers in limited situations. With a larger dataset we should see a decrease in accuracy and a chance for improvements to be noticeable. In addition, it has supported the idea that there is a detectable pattern in genre outside of individual word frequencies.

Preliminary results

I presented an overview of my results in class, but wanted to take a minute to discuss my preliminary results here. My most meaningful set so far has been an attempt to classify author using the complete works of Austen, Doyle, and Bronte. My hope was that Doyle would correspond closely to a “mystery” genre, while Austen and Doyle would be “romance.” In addition, I was hoping to confuse the algorithm with two similar authors in one genre. Rather than use full novels as a sample size (which would limit the number of discrete samples), I divided each author’s work into chunks of 10kb. This resulted in roughly 400 fragments per author which was then randomized and split into testing and training. The basic naive bayes classifier was in the high 90% accurate on the full data set, using only 10% of the samples for training. A number of optimizations were tried including stopword filtering, removal of punctuation, better randomization of samples, increasing sample size, and increased normalization of words (by removing capitals and tenses, etc). All of these did improve accuracy as measured, but given the unrealistically high accuracy of the baseline testing those results are not overly meaningful.

So far the initial results are actually quite encouraging. Naive bayes did quite well on a small sample size, and will certainly serve as a good baseline to compare other classification techniques against. At this point naive bays is TOO good. I need a larger dataset (or some more limits on the training) in order to get enough of an error rate to make improvements worthwhile (or even meaningful).

Python performance

Python has been very easy to work with, but larger datasets are starting to cause problems. Some of the issues may be related to the IDLE shell that I am working with, although some research has shown many other people having similar problems. One performance change I have seen is using sets for comparisons rather than lists. In NLP very often we iterate through a large set of words (such as in a book) and check them against a list (such as stopwords). Checking to see if an item is present in an unsorted list requires iterating through the entire thing. Python offers a similar data structure called a set. It is more difficult to add, remove or reorder items in a set, but it is much faster computationally to determine if a specific item is a member of the set. There may be portions of the NLTK that are using list comparisons which could be improved by set comparison. Certainly in my own code I have already found a number, although none have been a magic bullet in terms of performance.

NLTK Tokenizers and Dictionaries

One of NLTK’s better features is the standard tokenizer. It integrates particularly well with the default dictionaries and corpora NLTK makes available. It only takes a single line to generate a list of stop words, or a list of all words in the english language, or a list of stopwords in spanish, italian, and so on. Loading the data to and from disk was a bit finicky, but from there you can go from a great big long stream of text to a list of paragraphs, where each paragraph is a list of sentences, and each sentence is a list of words, and each word is text token that doesn’t appear in your stop list. This is incredibly useful, and it would be easy to go farther and have each word tagged as a part of speech as well.

One variation on the testing I am working on is using only a subset of each document for the testing portion, usually a number of random selected paragraphs from the method I just described. This could be a more practical model to look at, since in a real world application you may not have the time or computing power or bandwidth to analyze a whole document. A mall decrease in accuracy could be exchanged for a large increase in processing , which would be useful for larger datasets. This is itself a great example of the benefit NLTK can provide. It allows you to spend more time thinking about the structure of your problem and less about getting the computer to do what you want it do (not always an easy thing to do).

Python IDLE Stability

One of the most irritating issues I’ve had with Python has been the stability of the default IDLE interface. While I understand that most languages can be used to crash a machine, but it is far too easy in python to do it with seemingly trivial commands. Mostly this seems to be related to the size of the terminal buffer and displaying things on screen. As an example, if you try to print an reference to an object, which a beginner might try, (eg print myNewObject) you get a nice easy hex number for the object’s memory location, something you can use to identify two variables that refer to the same object. This is good. However, if you have a large array and try something like print myFirstArray[]” IDLE will send things to the terminal until it crashes (which happens after a thousand or two lines, not a particularly large number). Once you’ve crashed IDLE in that fashion there’s no way to save the output of your session or access any of your data, so you end up losing a lot of work. There’s just no reason for a print command to cause data loss in what is supposed to be a user friendly language.

NLTK datatypes and documentation

NLTK offers a number of different models or object types to represent the data you’re working with. One of the most common types in NLTK is a corpus which represents a body of texts to analyze. Although it provides a convenient data type to access, it’s needlessly difficult to get data in or convert between different corpus types. Just attempting to find the proper data types to instantiate a corpus object required reading the API for ClassifiedCorpusReader plus everything inherited from TaggedCorpusReader plus everything inherited from PlaintextCorpusReader plus everything inherited from CorpusReader. Although this kind of over-specialization of objects is fairly common in programming, it can be quite confusing. Here NLTK is handicapped by a distinct lack of proper documentation (a relatively unexciting task that is often the weakest point of open source projects). I would have liked clear examples of the most important functions in various classes, although with explanations of the data types it was expecting. I was eventually able to figure out the specifics that I needed, but it required a very close reading of the source code. This is not only incredibly time inefficient, it is a needless technical barrier for a tool that could offer a lot to non-technical users. Any planning on using NLTK should be prepared for some reasonably technical digging. NLTK does offer a number of good tutorials and walk throughs, but that is not a substitute for proper documentation of the tool and all it’s APIs.

Probabalistic models in NLTK

While I was reading through the code of NLTK’s bayesian classifier I noticed something a little unusual about it. Naive bayes can be based on two different underlying models of the features. The first is a Bernoulli model where the model assigns a binary value to each word, true if the word is in the document false if the word is not. This captures both the presence and absence of a word, but ignores relative frequencies between words. The second is a multivariate model where each feature represents the relative probability (eg, .004 or .80) of a word. This does not mathematically account for the absense of words, but does account for the relative frequency between words. The multivariate model is usually more accurate in real life and is faster to calculate, making it more common to see. It is the model used in the very popular SpamBayes project, and is what many spam filters are based on. This was the model I had originally assumed I would be working with. However, given all the discussion we’ve had about feature selection, and the fairly well known limits of the “bag of words” model, instead of modifying NTLK with a multivariate model, I would like to add in features in addition to the appearance of words. The bernoulli model is better at capturing features that are dissimilar, so I could add some other text features (such as mood, tense, sentence complexity) in addition to word occurrence. I think for categorizing genre, especially, I will see more benefit from adding in additional textual features and using the existing bernoulli model than in switching to a multivariate model of the same features.

IDLE Development Tool

I originally was working with WingIDE, but have almost completely switched over to IDLE. It is not as full featured as something like Wing, but there are a few major pluses. First, it is part of Python, nothing extra to load or find, and it is fully open source. Second, it is extremely light weight. Many IDEs can suck up gigabytes of RAM, and are very poorly suited as a small tool for someone to experiment with. It is also very intelligent about auto-complete, listing the functions of a Python object as well as things like directory contents. Finally, it offers an interesting Python command prompt interface that really makes it easy to experiment with pieces of code and data. You can run python programs from the prompt, but you can also type in lines of code which it will execute on the spot. You can declare variables and load modules, which are all stored for your entire session. It is easiest to work out your syntax and algorithms when you are stepping through each line with the ability to view all your data between steps. This, coupled with the general readability of python code, makes it easy to use python as a data exploration tool (in most languages you would need to code your algorithm before applying it to data). It also makes the concept of a function almost self explanatory

Finding and accessing data

NLTK has a built in corpus datatype, as well as access to a number of common corpora (like the Brown Corpora), many of which are already tagged. This makes it easy to jump in and practice using things like the tokenizers, taggers, and classifiers without having to clean up and load a bunch of data. There were no preexisting Austen or Doyle corpora so I built one myself. It was much more time consuming than expected to gather and import the data.

The best version of Doyle’s work that I found was in HTML, and I needed to write a script that read in each HTML file, used a HTML to TXT library to strip out the tags, then wrote it back out to disk as a new fie. This was, in a way, a good real life lesson in Python file handling. Files can be opened read only, read write, and all sorts of variations, writes can overwrite or append files, and file locks do need to be properly closed. It is a bit confusing and someone who was learning to program for the first time would have to learn a number of additional concepts about file systems just to read and write files. I definitely think there should be some sane behavior (such as appending data and releasing write lock) for trying to write data to a file name. Many people recommend using UNIX shell redirection to write the output of a Python program to a file (and I’ve actually seen a lot of system scripting being done that way, with a small bash script wrapping a number of internal python modules).

The version of Austen’s work I found was done in UTF8 format, so I did not have to deal with cleaning up html. However, they all came with both a header and a footer from project Gutenberg. I tried using a regular expression, so I could read them in, remove things that matched my pattern, and write them back out to disk. However my expressions kept getting screwed up by minor variations and accounting for all of those was time consuming. I ended up just opening each novel by hand and manually deleting them. This obviously wouldn’t scale to very large datasets, so I may need to refine this when I have a larger dataset.

Feature selection

The first model that I am working with is based on naive bayes with the frequency of words as the feature being measured. Although this is a standard baseline to work with, I am looking for some more sophisticated methods that will let me determine specific literary genre, which has been harder to classify. In addition to just the word choice that identifies the genre, what other features might be useful to measure? I was thinking of addition in a parts of speech measurement might help distinguish. Perhaps romance novels include more conditional statements and mystery novels include more definitives. I was thinking of adding a part of speech tagger or some sort of semantic analysis to tag each word with its associated context. Once I had labeled all the words with a contextual categy, I could run a naive bayes using each pair of words and contexts as an individual feature (so “to love” in the present tense would be different than “to love” in the past). We might find patterns where a genre is associated with a word only when context is added in (“love” may be common in both Mystery and Romance, but in Mystery it may be in the past tense and in Romance the future tense). Adding this feature to the feature selector, testing the accuracy of the feature selector, and rewriting the naive bayes algorithm to use words and contexts together is a pretty large task, but definitely is a direction I want to take as I refine the classifier.