Choosing a corpus

So for my real project I will need Moretti’s underlying data from his project. Without that not only do I not have classifications, I don’t even know what books are being analyzed. So until then I chose to work with a smaller corpus and see if I could at least identify larger (and easier classes). I decided to start with all of Jane Austen’s work as well as all of the Sherlock Holmes novels. I figure these are good examples of “Mystery” and “Romance”, and any of my tools should be able to group these pretty accurately. They are also both old enough their work is all public domain and easily obtainable. It would also be interesting to see what sorts of patterns distinguish those two groups from each other. It may even hint at other features or models might be more accurate than the naive bayes I am starting with.

Python IDEs

Python has a number of specialized IDEs which was interested in trying. Since Python emphasizes easy of development, I was hoping someone had created an easy to use IDE. There are many good languages out there that are limited by the quality of the tools to work with them (Java is one good example). Here are some of the ones I worked with and my thoughts about them. I was surprsied there was no real “winner”: There are a number of excellent options out there, even at the student-friendly price of 0.

PyDev- This is a python extension for the Eclipse IDE. Although Eclipse is powerful and relatively popular, it is slow and confusing and prone to annoying issues. The fact that Ecliplse is the best free Java IDE is one of the reasons I find coding in Java unpleasant. Aside from being extremely slow and memory hungry, it is difficult to get to “coding” and requires a lot of steps to the point where you’re actually working. Not recommended.

PyCharm- This is often considered the “smartest” python IDE, with the best code completion. Some of its more powerful features can’t be found in other IDEs. Unfortunately, ease of use is not one of its strongest points, and its best features are only relevant to the most experienced coders. After a lengthy install and startup, even trying the “quickstart” option ends up forcing you to “select a python interpreter” while building a project. After a few minutes on google looking for the answer I decided this had probably already failed the ease-of-use test.

Komodo Edit- This is a reduced-feature version of a commercial product. It is easy to install and to use, and definitely has the fulled finished feel of a commercial product. Project and file management are a little better than expected, but the free version lacks a full debugging feature. Many people would never miss it, and that makes Komodo Edit the easy place to start, especially for the less demanding programmer (or non-programmer).

WingIDE – Another reduced-feature version of a commercial product. This is now the IDE I am using. It has a very intuitive and complete debugger, which can be difficult to find in a free product. The interface looks like it came out of windows 3.1, but is logical and doesn’t get in the way of working.

Ninja IDE – a newer project and a completely open-source project (for those who do not like feature limited closed source projects). Still doesn’t seem to offer the seamless experience some of the commercial products do. The UI isn’t always consistent, and a red and black “ninja” theme it sets by default is a little bit much. Although it is not something I found useful now, I would bet that it will continue to improve much faster than the other programs here.

IDLE- and although it’d not an official IDE, I wanted to specifically mention the built-in interpreter/shell that comes with Python, IDLE. It gives you an interactive shell where you can execute python commands and receive immediate feedback on their operation. No file names, no import statements, no class definitions, just put in the commands or functions. It combines the python interpreter with some basic command prompt functions (command history, auto completion) and provides a very low barrier to actual programming. It does a great job of bringing the immediate access and sequential feel of a procedural language to python. This is definitely the place to start for anyone learning python, and I think would be accessible even to people with limited programming experience.

Python 3 and NLTK

Python currently exists in two main variations: python version 2.x and version 3.x. Python 3 is now over 4 years old, but is not backwards compatible with python 2. This means all programs and libraries written in python 2 will need to be reprogrammed to work with python 3. I initially planned on using python 3-64 bit since it offers the latest features and best performance, but I was totally unable to get it to work. NLTK is an open source project primarily supported by professors and students, and relies on a number of mathematical libraries provided by other groups. Because of these, it has not been fully ported to Python 3 (it is only in alpha stage) and I was not able to get it working properly using python 3. In addition, I wasn’t even able to get it working with Python 2.x/32 bit as some of the required 3rd party libraries are not available compiled for 64 bit platforms. There are various unofficial 64 builds, along with the option of compiling from source, but both of those can be problematic.

So for now, anyone wishing to use the NTLK is stuck with python 2-32 bit. This definitely reduces the value of the tool, especially for learning purposes, as it will require learning an already outdated language to use. I am continuing with python 2 and the NLTK since I did not find anything else that offered similar functionality.

Python and NLTK tools

I’d like to use this space not only to record my experiences with the Natural Language Toolkit, but also with Python itself. Although I have extensive programming experience, I have never used python, and it is not similar to any other language I know.

Python is a high-level, object oriented language that has become very popular in recent years. It is generally considered easy to learn, it has extensive built in libraries for common tasks, and it is generally easy to maintain or modify existing programs. It is a “get it done” language, and is especially popular with people who are not full time programmers. As such, it may be useful to other DH practitioners who are not interested in becoming computer programmers, but do need to write their own programs on occasion.

 

The Natural Language Toolkit is a collection of python libraries for processing and working with natural language data (a library is a small program that provides a collection of basic functions that are not an official part of the language). The NLTK offers a large number of tools in the following areas: accessing corpora, strong processing, collocation discovery, part of speech tagging, classification, chunking, parsing, semantic interpretation, evaluation metric, probability and estimation, and linguistic fieldwork. I am working mostly with the “accessing corpora”, “string processing”, and “classification”.

Project idea

So the project that I am working on is a text classification project. I am currently working with python and the natural language tool kit to build some basic text classifiers. I am trying to repeat Moretti’s genre classication from Graphs, Maps and Trees on page 19. He manaully classified british novels into various genres (such as Murder Mystery, Romantic, etc). Although interesting, it makes his analysis prone to errors or hidden biases. If I can repeat it with an automated and repeatable method, it will reinforce his claims. In addition, I am working with several different kinds of text classification methods, and I am hoping to improve some of them when classifying genre, which is not only broad, but often an overlapping classification (books can be both mysteries and women’s novels for example).