Python performance

Python has been very easy to work with, but larger datasets are starting to cause problems. Some of the issues may be related to the IDLE shell that I am working with, although some research has shown many other people having similar problems. One performance change I have seen is using sets for comparisons rather than lists. In NLP very often we iterate through a large set of words (such as in a book) and check them against a list (such as stopwords). Checking to see if an item is present in an unsorted list requires iterating through the entire thing. Python offers a similar data structure called a set. It is more difficult to add, remove or reorder items in a set, but it is much faster computationally to determine if a specific item is a member of the set. There may be portions of the NLTK that are using list comparisons which could be improved by set comparison. Certainly in my own code I have already found a number, although none have been a magic bullet in terms of performance.

"Hello World"

building CUNY Communities since 2009

Need help with the Commons?