Tag Archives: sklearn

Why do I call that variable `clf`?

From the sklearn docs: “We call our estimator instance `clf`, as it is a classifier.” http://scikit-learn.org/stable/tutorial/basic/tutorial.html#learning-and-predicting

Comments Off on Why do I call that variable `clf`?

Filed under machine learning, software engineering

Using the sklearn text.CountVectorizer

I have been getting some great success from the scikits-learn CountVectorizer transformations. Here are some notes on how I like to use it:

import sklearn.feature_extraction

ngram_range = (1,2)

clf = sklearn.feature_extraction.text.CountVectorizer(
        min_df=10,  # minimum number of docs that must contain n-gram to include as a column
        #tokenizer=lambda x: [x_i.strip() for x_i in x.split()]  # keep '*' characters as tokens

There is a stop_words parameter that is also sometimes useful.

Comments Off on Using the sklearn text.CountVectorizer

Filed under machine learning

Using the sklearn grid_search tools

Scikit-learn has a really nice grid search module. It will soon be called model_selection, because it has grown beyond simple grid search. But here is the spirit of it:

import sklearn.svm, sklearn.grid_search, sklearn.datasets.samples_generator
parameters = {'kernel':('poly', 'rbf'), 'C':[.01, .1, 1, 10, 100]}
clf = sklearn.grid_search.GridSearchCV(
X, y = sklearn.datasets.samples_generator.make_classification(n_samples=200, n_features=5, random_state=12345)
clf.fit(X, y)

And say you want to take a careful look at the results? They are all in there, too. http://nbviewer.ipython.org/gist/aflaxman/cb0660e602d361d06599

Comments Off on Using the sklearn grid_search tools

Filed under machine learning, software engineering

I like the term OneHotEncoder

Dummy variable just sounds demeaning to me. http://stats.stackexchange.com/questions/149122/treating-missing-data-in-voting-pattern-analysis/149572#149572

Comments Off on I like the term OneHotEncoder

Filed under machine learning