I have been getting some great success from the scikits-learn CountVectorizer transformations. Here are some notes on how I like to use it:
ngram_range = (1,2)
clf = sklearn.feature_extraction.text.CountVectorizer(
min_df=10, # minimum number of docs that must contain n-gram to include as a column
#tokenizer=lambda x: [x_i.strip() for x_i in x.split()] # keep '*' characters as tokens
There is a
stop_words parameter that is also sometimes useful.
One fun thing about keeping my lab notebook in digital form with IPython Notebooks is that I can flip through my old work so easily. Did I say fun? I meant scary, and sometimes depressing. But yes, also fun.
For example, two years ago, I was working on some projects that are still not wrapped up today, and I was doing a lot of prep for the first edition of my now re-titled “machine learning for health metricians” class.
Hey that includes the answer to [a question someone just asked on stats.stackexchange](http://stats.stackexchange.com/q/149801/18291)
Doctors love decision trees, computer scientists love recursion, so maybe that’s why decision trees have been coming up so much in the Artificial Intelligence for Health Metricians class I’m teaching this quarter. We’ve been very sklearn-focused in our labs so far, but I thought my students might like to see how to build their own decision tree learner from scratch. So I put together this little notebook for them. Unfortunately, it is a little too complicated to make them do it themselves in a quarter-long class with no prerequisites on programming.
To complement that ASA address about what is statistics that I read last week, here is the abstract KDD address about what is data mining: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=754923
Does the talk exist somewhere?