6.14.5. scikits.learn.feature_extraction.text.CountVectorizer¶
- CountVectorizer(analyzer=WordNGramAnalyzer(max_n=1, min_n=1, charset='utf-8',
- stop_words=set(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom',...'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once']),
- preprocessor=RomanPreprocessor()), vocabulary={}, max_df=1.0, max_features=None, dtype=<type 'long'>)
Convert a collection of raw documents to a matrix of token counts
This implementation produces a dense representation of the counts using a numpy array.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features (the vocabulary size found by analysing the data) might be very large and the count vectors might not fit in memory.
For this case it is either recommended to use the sparse.CountVectorizer variant of this class or a HashingVectorizer that will reduce the dimensionality to an arbitrary number by using random projection.
Parameters : analyzer: WordNGramAnalyzer or CharNGramAnalyzer, optional :
vocabulary: dict, optional :
A dictionary where keys are tokens and values are indices in the matrix. This is useful in order to fix the vocabulary in advance.
dtype: type, optional :
Type of the matrix returned by fit_transform() or transform().
Methods
fit fit_transform transform - CountVectorizer.fit(raw_documents, y=None)¶
Learn a vocabulary dictionary of all tokens in the raw documents
Parameters : raw_documents: iterable :
an iterable which yields either str, unicode or file objects
Returns : self :
- CountVectorizer.fit_transform(raw_documents, y=None)¶
Learn the vocabulary dictionary and return the count vectors
This is more efficient than calling fit followed by transform.
Parameters : raw_documents: iterable :
an iterable which yields either str, unicode or file objects
Returns : vectors: array, [n_samples, n_features] :
- CountVectorizer.transform(raw_documents)¶
Extract token counts out of raw text documents
Parameters : raw_documents: iterable :
an iterable which yields either str, unicode or file objects
Returns : vectors: array, [n_samples, n_features] :