Contents

scikits.learn.feature_extraction.text.SparseHashingVectorizer

class scikits.learn.feature_extraction.text.SparseHashingVectorizer(dim=100000, probes=1, use_idf=True, analyzer=<scikits.learn.feature_extraction.text.WordNGramAnalyzer object at 0x38a8430>)

Compute term freq vectors using hashed term space in a sparse matrix

The logic is the same as HashingVectorizer but it is possible to use much larger dimension vectors without memory issues thanks to the usage of scipy.sparse datastructure to store the tf vectors.

This function requires scipy 0.7 or higher.

Methods

get_idf
get_tfidf
get_vectors
hash_sign
vectorize
vectorize_files
__init__(dim=100000, probes=1, use_idf=True, analyzer=<scikits.learn.feature_extraction.text.WordNGramAnalyzer object at 0x38a8430>)
get_tfidf()

Compute the TF-log(IDF) vectors of the sampled documents

vectorize(text_documents)

Vectorize a batch of documents in python utf-8 strings or unicode

vectorize_files(document_filepaths)

Vectorize a batch of utf-8 text files