scikits.learn.feature_extraction.text.WordNGramAnalyzer¶
- class scikits.learn.feature_extraction.text.WordNGramAnalyzer(default_charset='utf-8', min_n=1, max_n=1, filters=[<method 'lower' of 'unicode' objects>, <function strip_tags at 0x3898210>, <function strip_accents at 0x3896830>], stop_words=None)¶
Simple analyzer: transform a text document into a sequence of word tokens
- This simple implementation does:
- lower case conversion
- unicode accents removal
- token extraction using unicode regexp word bounderies for token of minimum size of 2 symbols
- output token n-grams (unigram only by default)
Methods
analyze - __init__(default_charset='utf-8', min_n=1, max_n=1, filters=[<method 'lower' of 'unicode' objects>, <function strip_tags at 0x3898210>, <function strip_accents at 0x3896830>], stop_words=None)¶