Finding and accessing data

NLTK has a built in corpus datatype, as well as access to a number of common corpora (like the Brown Corpora), many of which are already tagged. This makes it easy to jump in and practice using things like the tokenizers, taggers, and classifiers without having to clean up and load a bunch of data. There were no preexisting Austen or Doyle corpora so I built one myself. It was much more time consuming than expected to gather and import the data.

The best version of Doyle’s work that I found was in HTML, and I needed to write a script that read in each HTML file, used a HTML to TXT library to strip out the tags, then wrote it back out to disk as a new fie. This was, in a way, a good real life lesson in Python file handling. Files can be opened read only, read write, and all sorts of variations, writes can overwrite or append files, and file locks do need to be properly closed. It is a bit confusing and someone who was learning to program for the first time would have to learn a number of additional concepts about file systems just to read and write files. I definitely think there should be some sane behavior (such as appending data and releasing write lock) for trying to write data to a file name. Many people recommend using UNIX shell redirection to write the output of a Python program to a file (and I’ve actually seen a lot of system scripting being done that way, with a small bash script wrapping a number of internal python modules).

The version of Austen’s work I found was done in UTF8 format, so I did not have to deal with cleaning up html. However, they all came with both a header and a footer from project Gutenberg. I tried using a regular expression, so I could read them in, remove things that matched my pattern, and write them back out to disk. However my expressions kept getting screwed up by minor variations and accounting for all of those was time consuming. I ended up just opening each novel by hand and manually deleting them. This obviously wouldn’t scale to very large datasets, so I may need to refine this when I have a larger dataset.