scikits.learn.feature_extraction.text.TermCountVectorizer¶

class scikits.learn.feature_extraction.text.TermCountVectorizer(vocabulary={})¶

Convert a document collection to a document-term matrix.

Parameters :

vocabulary: dict, optional :

A dictionary where keys are tokens and values are indices in the matrix. This is useful in order to fix the vocabulary in advance.

Methods

`fit`
`transform`

fit(tokenized_documents, y=None)¶: The learning is postponed to the first time transform() is called so this method doesn’t actually do anything.

transform(tokenized_documents)¶

Learn the vocabulary dictionary if necessary and return the vectors.

Parameters :

tokenized_documents: list :

a list of tokenized documents

Returns :

vectors: array, [n_samples, n_features] :