Using the sklearn text.CountVectorizer

I have been getting some great success from the scikits-learn CountVectorizer transformations. Here are some notes on how I like to use it:

import sklearn.feature_extraction

ngram_range = (1,2)

clf = sklearn.feature_extraction.text.CountVectorizer(
        ngram_range=ngram_range,
        min_df=10,  # minimum number of docs that must contain n-gram to include as a column
        #tokenizer=lambda x: [x_i.strip() for x_i in x.split()]  # keep '*' characters as tokens
    )

There is a stop_words parameter that is also sometimes useful.

Comments Off on Using the sklearn text.CountVectorizer

Filed under machine learning

Comments are closed.