Contents

scikits.learn.feature_extraction.text.TermCountVectorizer

class scikits.learn.feature_extraction.text.TermCountVectorizer(vocabulary={})

Convert a document collection to a document-term matrix.

Parameters :

vocabulary: dict, optional :

A dictionary where keys are tokens and values are indices in the matrix. This is useful in order to fix the vocabulary in advance.

Methods

fit
transform
__init__(vocabulary={})
fit(tokenized_documents, y=None)

The learning is postponed to the first time transform() is called so this method doesn’t actually do anything.

transform(tokenized_documents)

Learn the vocabulary dictionary if necessary and return the vectors.

Parameters :

tokenized_documents: list :

a list of tokenized documents

Returns :

vectors: array, [n_samples, n_features] :