scikits.learn.feature_extraction.text.SparseHashingVectorizer¶

class scikits.learn.feature_extraction.text.SparseHashingVectorizer(dim=100000, probes=1, use_idf=True, analyzer=<scikits.learn.feature_extraction.text.WordNGramAnalyzer object at 0x38a8430>)¶

Compute term freq vectors using hashed term space in a sparse matrix

The logic is the same as HashingVectorizer but it is possible to use much larger dimension vectors without memory issues thanks to the usage of scipy.sparse datastructure to store the tf vectors.

This function requires scipy 0.7 or higher.

Methods

`get_idf`
`get_tfidf`
`get_vectors`
`hash_sign`
`vectorize`
`vectorize_files`

__init__(dim=100000, probes=1, use_idf=True, analyzer=<scikits.learn.feature_extraction.text.WordNGramAnalyzer object at 0x38a8430>)¶

get_tfidf()¶: Compute the TF-log(IDF) vectors of the sampled documents

vectorize(text_documents)¶: Vectorize a batch of documents in python utf-8 strings or unicode

vectorize_files(document_filepaths)¶: Vectorize a batch of utf-8 text files

Contents

scikits.learn.feature_extraction.text.SparseHashingVectorizer¶