January 29, 2016 · 8:00 am

Using the sklearn text.CountVectorizer

I have been getting some great success from the scikits-learn CountVectorizer transformations. Here are some notes on how I like to use it:

import sklearn.feature_extraction

ngram_range = (1,2)

clf = sklearn.feature_extraction.text.CountVectorizer(
        ngram_range=ngram_range,
        min_df=10,  # minimum number of docs that must contain n-gram to include as a column
        #tokenizer=lambda x: [x_i.strip() for x_i in x.split()]  # keep '*' characters as tokens
    )

There is a stop_words parameter that is also sometimes useful.

Comments Off on Using the sklearn text.CountVectorizer

Filed under machine learning

Tagged as ai4hm, sklearn

Comments are closed.

Posts
aco ai ai4hm algorithms baby animals Bayesian books conference contest costs dataviz data viz disease modeling dismod diversity diversity club free/open source funding gaussian processes gbd global health health inequality health metrics health records idv IDV4GH ihme infoviz ipython iraq journal club machine learning malaria matching algorithms matchings MCMC media microsimulation mortality mpld3 my research Mysteries networks networkx optimization orms pandas privacy probability public health pymc pymc3 python random effects reading list reproducible research reproductive health research jobs seminar sklearn software carpentry spanning trees sparql statistics stats survey talks TCS teaching Theory Blogs travel tutorial va verbal autopsy vital registration
Theory Blogs
some rights reserved

This material is released under the Creative Commons Noncommercial Attribution Share-Alike 3.0 License
Pages
- About
January 2016

M T W T F S S

1 2 3

4 5 6 7 8 9 10

11 12 13 14 15 16 17

18 19 20 21 22 23 24

25 26 27 28 29 30 31

« Oct Feb »
Archives
Archives
Meta

Using the sklearn text.CountVectorizer

Share this:

Related

Posts

Theory Blogs

some rights reserved

Pages

Archives

Meta