Choosing a corpus

So for my real project I will need Moretti’s underlying data from his project. Without that not only do I not have classifications, I don’t even know what books are being analyzed. So until then I chose to work with a smaller corpus and see if I could at least identify larger (and easier classes). I decided to start with all of Jane Austen’s work as well as all of the Sherlock Holmes novels. I figure these are good examples of “Mystery” and “Romance”, and any of my tools should be able to group these pretty accurately. They are also both old enough their work is all public domain and easily obtainable. It would also be interesting to see what sorts of patterns distinguish those two groups from each other. It may even hint at other features or models might be more accurate than the naive bayes I am starting with.