Category Archives: machine learning

Using the sklearn text.CountVectorizer

I have been getting some great success from the scikits-learn CountVectorizer transformations. Here are some notes on how I like to use it:

import sklearn.feature_extraction

ngram_range = (1,2)

clf = sklearn.feature_extraction.text.CountVectorizer(
        ngram_range=ngram_range,
        min_df=10,  # minimum number of docs that must contain n-gram to include as a column
        #tokenizer=lambda x: [x_i.strip() for x_i in x.split()]  # keep '*' characters as tokens
    )

There is a stop_words parameter that is also sometimes useful.

Leave a comment

Filed under machine learning

To read: EnsembleMatrix paper

EnsembleMatrix: Interactive Visualization to Support Machine Learning with Multiple Classifiers http://research.microsoft.com/en-us/um/redmond/groups/cue/publications/CHI2009-EnsembleMatrix.pdf

I want one

Comments Off on To read: EnsembleMatrix paper

Filed under dataviz, machine learning

Brief survey on sequence classification

hi Abie,

It was great speaking with you. This is the paper I was talking about.

http://dl.acm.org/citation.cfm?id=1882478

Looking forward to know more about each other’s work.

Thanks,

2 Comments

Filed under disease modeling, machine learning

Using the sklearn grid_search tools

Scikit-learn has a really nice grid search module. It will soon be called model_selection, because it has grown beyond simple grid search. But here is the spirit of it:

import sklearn.svm, sklearn.grid_search, sklearn.datasets.samples_generator
parameters = {'kernel':('poly', 'rbf'), 'C':[.01, .1, 1, 10, 100]}
clf = sklearn.grid_search.GridSearchCV(
    sklearn.svm.SVC(probability=True),
    parameters,
    n_jobs=64)
X, y = sklearn.datasets.samples_generator.make_classification(n_samples=200, n_features=5, random_state=12345)
clf.fit(X, y)
clf.best_params_

And say you want to take a careful look at the results? They are all in there, too. http://nbviewer.ipython.org/gist/aflaxman/cb0660e602d361d06599

Comments Off on Using the sklearn grid_search tools

Filed under machine learning, software engineering

What was I up to two year ago?

One fun thing about keeping my lab notebook in digital form with IPython Notebooks is that I can flip through my old work so easily. Did I say fun? I meant scary, and sometimes depressing. But yes, also fun.

For example, two years ago, I was working on some projects that are still not wrapped up today, and I was doing a lot of prep for the first edition of my now re-titled “machine learning for health metricians” class.

Hey that includes the answer to [a question someone just asked on stats.stackexchange](http://stats.stackexchange.com/q/149801/18291)

Comments Off on What was I up to two year ago?

Filed under machine learning

Why do we call it “ridge” regression?

Asked and answered: http://stats.stackexchange.com/a/151351/18291

With a link to more detail: http://www.itl.nist.gov/div898/handbook/pri/section3/pri336.htm

Comments Off on Why do we call it “ridge” regression?

Filed under machine learning

I like the term OneHotEncoder

Dummy variable just sounds demeaning to me. http://stats.stackexchange.com/questions/149122/treating-missing-data-in-voting-pattern-analysis/149572#149572

Comments Off on I like the term OneHotEncoder

Filed under machine learning