I have been getting some great success from the scikits-learn CountVectorizer transformations. Here are some notes on how I like to use it:
import sklearn.feature_extraction ngram_range = (1,2) clf = sklearn.feature_extraction.text.CountVectorizer( ngram_range=ngram_range, min_df=10, # minimum number of docs that must contain n-gram to include as a column #tokenizer=lambda x: [x_i.strip() for x_i in x.split()] # keep '*' characters as tokens )
There is a stop_words
parameter that is also sometimes useful.