scikit-learn: machine learning in Python — scikit-learn 0.15.2 documentation

scikit-learn Tutorials
- An introduction to machine learning with scikit-learn
- A tutorial on statistical-learning for scientific data processing
- Working With Text Data
1. Supervised learning
- 1.1. Generalized Linear Models
- 1.2. Support Vector Machines
- 1.3. Stochastic Gradient Descent
- 1.4. Nearest Neighbors
- 1.5. Gaussian Processes
- 1.6. Cross decomposition
- 1.7. Naive Bayes
- 1.8. Decision Trees
- 1.9. Ensemble methods
- 1.10. Multiclass and multilabel algorithms
- 1.11. Feature selection
- 1.12. Semi-Supervised
  - 1.12.1. Label Propagation
- 1.13. Linear and quadratic discriminant analysis
  - 1.13.1. Dimensionality reduction using LDA
  - 1.13.2. Mathematical Idea
- 1.14. Isotonic regression
2. Unsupervised learning
- 2.1. Gaussian mixture models
- 2.2. Manifold learning
- 2.3. Clustering
- 2.4. Biclustering
- 2.5. Decomposing signals in components (matrix factorization problems)
- 2.6. Covariance estimation
- 2.7. Novelty and Outlier Detection
  - 2.7.1. Novelty Detection
  - 2.7.2. Outlier Detection
    - 2.7.2.1. Fitting an elliptic envelope
    - 2.7.2.2. One-class SVM versus elliptic envelope
- 2.8. Density Estimation
  - 2.8.1. Density Estimation: Histograms
  - 2.8.2. Kernel Density Estimation
- 2.9. Neural network models (unsupervised)
  - 2.9.1. Restricted Boltzmann machines
3. Model selection and evaluation
- 3.1. Cross-validation: evaluating estimator performance
- 3.2. Grid Search: Searching for estimator parameters
- 3.3. Pipeline: chaining estimators
  - 3.3.1. Usage
  - 3.3.2. Notes
- 3.4. FeatureUnion: Combining feature extractors
  - 3.4.1. Usage
- 3.5. Model evaluation: quantifying the quality of predictions
- 3.6. Model persistence
  - 3.6.1. Persistence example
  - 3.6.2. Security & maintainability limitations
- 3.7. Validation curves: plotting scores to evaluate models
  - 3.7.1. Validation curve
  - 3.7.2. Learning curve
4. Dataset transformations
- 4.1. Feature extraction
- 4.2. Preprocessing data
- 4.3. Kernel Approximation
- 4.4. Random Projection
- 4.5. Pairwise metrics, Affinities and Kernels
  - 4.5.1. Cosine similarity
  - 4.5.2. Chi Squared Kernel
5. Dataset loading utilities
- 5.1. General dataset API
- 5.2. Toy datasets
- 5.3. Sample images
- 5.4. Sample generators
- 5.5. Datasets in svmlight / libsvm format
- 5.6. The Olivetti faces dataset
- 5.7. The 20 newsgroups text dataset
- 5.8. Downloading datasets from the mldata.org repository
- 5.9. The Labeled Faces in the Wild face recognition dataset
  - 5.9.1. Usage
  - 5.9.2. Examples
- 5.10. Forest covertypes
6. Strategies to scale computationally: bigger data
- 6.1. Scaling with instances using out-of-core learning
7. Computational Performance
- 7.1. Prediction Latency
- 7.2. Prediction Throughput
- 7.3. Tips and Tricks
Examples
- General examples
- Examples based on real world datasets
- Biclustering
- Clustering
- Covariance estimation
- Cross decomposition
- Dataset examples
- Decomposition
- Ensemble methods
- Tutorial exercises
- Gaussian Process for Machine Learning
- Generalized Linear Models
- Manifold learning
- Gaussian Mixture Models
- Nearest Neighbors
- Semi Supervised Classification
- Support Vector Machines
- Decision Trees
Frequently Asked Questions
- What is the project name (a lot of people get it wrong)?
- How do you pronounce the project name?
- Why scikit?
- How can I contribute to scikit-learn?
- Can I add this new algorithm that I (or someone else) just published?
- Can I add this classical algorithm from the 80s?
- Why did you remove HMMs from scikit-learn?
- Will you add graphical models or sequence prediction to scikit-learn?
- Will you add GPU support?
- Do you support PyPy?
Support
- Mailing List
- User questions
- Bug tracker
- IRC
- Documentation resources
0.15.2
- Bug fixes
0.15.1
- Bug fixes
- API changes summary
0.15
- Highlights
- Changelog
- API changes summary
- People
0.14
- Changelog
- API changes summary
- People
0.13.1
- Changelog
- People
0.13
- New Estimator Classes
- Changelog
- API changes summary
- People
0.12.1
- Changelog
- People
0.12
- Changelog
- API changes summary
- People
0.11
- Changelog
  - Highlights
  - Other changes
- API changes summary
- People
0.10
- Changelog
- API changes summary
- People
0.9
- Changelog
- API changes summary
- People
0.8
- Changelog
- People
0.7
- Changelog
- People
0.6
- Changelog
- People
0.5
- Changelog
- New classes
- Documentation
- Fixes
- Examples
- External dependencies
- Removed modules
- Misc
- Authors
0.4
- Changelog
- Authors
Earlier versions
External Resources, Videos and Talks
- New to Scientific Python?
- External Tutorials
- Videos
About us
- History
- People
- Citing scikit-learn
- Artwork
- Funding
  - Donating to the project
  - The 2013’ Paris international sprint
- Infrastructure support
Documentation of scikit-learn 0.15
5. Dataset loading utilities
- 5.1. General dataset API
- 5.2. Toy datasets
- 5.3. Sample images
- 5.4. Sample generators
- 5.5. Datasets in svmlight / libsvm format
- 5.6. The Olivetti faces dataset
- 5.7. The 20 newsgroups text dataset
- 5.8. Downloading datasets from the mldata.org repository
- 5.9. The Labeled Faces in the Wild face recognition dataset
  - 5.9.1. Usage
  - 5.9.2. Examples
- 5.10. Forest covertypes
Forest covertypes
The Labeled Faces in the Wild face recognition dataset
- Usage
- Examples
Downloading datasets from the mldata.org repository
The Olivetti faces dataset
The 20 newsgroups text dataset
- Usage
- Converting text to vectors
- Filtering text for more realistic training
Reference
- sklearn.base: Base classes and utility functions
  - Base classes
  - Functions
    - sklearn.base.clone
- sklearn.cluster: Clustering
  - Classes
  - Functions
- sklearn.cluster.bicluster: Biclustering
  - Classes
    - sklearn.cluster.bicluster.SpectralBiclustering
    - sklearn.cluster.bicluster.SpectralCoclustering
- sklearn.covariance: Covariance Estimators
  - sklearn.covariance.EmpiricalCovariance
  - sklearn.covariance.EllipticEnvelope
  - sklearn.covariance.GraphLasso
  - sklearn.covariance.GraphLassoCV
  - sklearn.covariance.LedoitWolf
  - sklearn.covariance.MinCovDet
  - sklearn.covariance.OAS
  - sklearn.covariance.ShrunkCovariance
  - sklearn.covariance.empirical_covariance
  - sklearn.covariance.ledoit_wolf
  - sklearn.covariance.shrunk_covariance
  - sklearn.covariance.oas
  - sklearn.covariance.graph_lasso
- sklearn.cross_validation: Cross Validation
  - sklearn.cross_validation.KFold
  - sklearn.cross_validation.LeaveOneLabelOut
  - sklearn.cross_validation.LeaveOneOut
  - sklearn.cross_validation.LeavePLabelOut
  - sklearn.cross_validation.LeavePOut
  - sklearn.cross_validation.StratifiedKFold
  - sklearn.cross_validation.ShuffleSplit
  - sklearn.cross_validation.StratifiedShuffleSplit
  - sklearn.cross_validation.train_test_split
  - sklearn.cross_validation.cross_val_score
  - sklearn.cross_validation.permutation_test_score
  - sklearn.cross_validation.check_cv
- sklearn.datasets: Datasets
  - Loaders
  - Samples generator
- sklearn.decomposition: Matrix Decomposition
  - sklearn.decomposition.PCA
  - sklearn.decomposition.ProjectedGradientNMF
  - sklearn.decomposition.RandomizedPCA
  - sklearn.decomposition.KernelPCA
  - sklearn.decomposition.FactorAnalysis
  - sklearn.decomposition.FastICA
  - sklearn.decomposition.TruncatedSVD
  - sklearn.decomposition.NMF
  - sklearn.decomposition.SparsePCA
  - sklearn.decomposition.MiniBatchSparsePCA
  - sklearn.decomposition.SparseCoder
  - sklearn.decomposition.DictionaryLearning
  - sklearn.decomposition.MiniBatchDictionaryLearning
  - sklearn.decomposition.fastica
  - sklearn.decomposition.dict_learning
  - sklearn.decomposition.dict_learning_online
  - sklearn.decomposition.sparse_encode
- sklearn.dummy: Dummy estimators
  - sklearn.dummy.DummyClassifier
  - sklearn.dummy.DummyRegressor
- sklearn.ensemble: Ensemble Methods
  - sklearn.ensemble.AdaBoostClassifier
  - sklearn.ensemble.AdaBoostRegressor
  - sklearn.ensemble.BaggingClassifier
  - sklearn.ensemble.BaggingRegressor
  - 3.2.3.3.3. sklearn.ensemble.ExtraTreesClassifier
  - 3.2.3.3.4. sklearn.ensemble.ExtraTreesRegressor
  - 3.2.3.3.5. sklearn.ensemble.GradientBoostingClassifier
  - 3.2.3.3.6. sklearn.ensemble.GradientBoostingRegressor
  - 3.2.3.3.1. sklearn.ensemble.RandomForestClassifier
  - sklearn.ensemble.RandomTreesEmbedding
  - 3.2.3.3.2. sklearn.ensemble.RandomForestRegressor
  - partial dependence
    - sklearn.ensemble.partial_dependence.partial_dependence
    - sklearn.ensemble.partial_dependence.plot_partial_dependence
- sklearn.feature_extraction: Feature Extraction
  - sklearn.feature_extraction.DictVectorizer
  - sklearn.feature_extraction.FeatureHasher
  - From images
  - From text
- sklearn.feature_selection: Feature Selection
  - sklearn.feature_selection.GenericUnivariateSelect
  - sklearn.feature_selection.SelectPercentile
  - sklearn.feature_selection.SelectKBest
  - sklearn.feature_selection.SelectFpr
  - sklearn.feature_selection.SelectFdr
  - sklearn.feature_selection.SelectFwe
  - sklearn.feature_selection.RFE
  - sklearn.feature_selection.RFECV
  - sklearn.feature_selection.VarianceThreshold
  - sklearn.feature_selection.chi2
  - sklearn.feature_selection.f_classif
  - sklearn.feature_selection.f_regression
- sklearn.gaussian_process: Gaussian Processes
  - sklearn.gaussian_process.GaussianProcess
  - sklearn.gaussian_process.correlation_models.absolute_exponential
  - sklearn.gaussian_process.correlation_models.squared_exponential
  - sklearn.gaussian_process.correlation_models.generalized_exponential
  - sklearn.gaussian_process.correlation_models.pure_nugget
  - sklearn.gaussian_process.correlation_models.cubic
  - sklearn.gaussian_process.correlation_models.linear
  - sklearn.gaussian_process.regression_models.constant
  - sklearn.gaussian_process.regression_models.linear
  - sklearn.gaussian_process.regression_models.quadratic
- sklearn.grid_search: Grid Search
  - sklearn.grid_search.GridSearchCV
  - sklearn.grid_search.ParameterGrid
  - sklearn.grid_search.ParameterSampler
  - sklearn.grid_search.RandomizedSearchCV
- sklearn.isotonic: Isotonic regression
  - sklearn.isotonic.IsotonicRegression
  - sklearn.isotonic.isotonic_regression
  - sklearn.isotonic.check_increasing
- sklearn.kernel_approximation Kernel Approximation
  - sklearn.kernel_approximation.AdditiveChi2Sampler
  - sklearn.kernel_approximation.Nystroem
  - sklearn.kernel_approximation.RBFSampler
  - sklearn.kernel_approximation.SkewedChi2Sampler
- sklearn.lda: Linear Discriminant Analysis
  - sklearn.lda.LDA
- sklearn.learning_curve Learning curve evaluation
  - sklearn.learning_curve.learning_curve
  - sklearn.learning_curve.validation_curve
- sklearn.linear_model: Generalized Linear Models
  - sklearn.linear_model.ARDRegression
  - sklearn.linear_model.BayesianRidge
  - sklearn.linear_model.ElasticNet
  - 3.2.3.1.6. sklearn.linear_model.ElasticNetCV
  - sklearn.linear_model.Lars
  - 3.2.3.1.3. sklearn.linear_model.LarsCV
  - sklearn.linear_model.Lasso
  - 3.2.3.1.5. sklearn.linear_model.LassoCV
  - sklearn.linear_model.LassoLars
  - 3.2.3.1.4. sklearn.linear_model.LassoLarsCV
  - 3.2.3.2.1. sklearn.linear_model.LassoLarsIC
  - sklearn.linear_model.LinearRegression
  - sklearn.linear_model.LogisticRegression
  - sklearn.linear_model.MultiTaskLasso
  - sklearn.linear_model.MultiTaskElasticNet
  - sklearn.linear_model.MultiTaskLassoCV
  - sklearn.linear_model.MultiTaskElasticNetCV
  - sklearn.linear_model.OrthogonalMatchingPursuit
  - sklearn.linear_model.OrthogonalMatchingPursuitCV
  - sklearn.linear_model.PassiveAggressiveClassifier
  - sklearn.linear_model.PassiveAggressiveRegressor
  - sklearn.linear_model.Perceptron
  - sklearn.linear_model.RandomizedLasso
  - sklearn.linear_model.RandomizedLogisticRegression
  - sklearn.linear_model.RANSACRegressor
  - sklearn.linear_model.Ridge
  - sklearn.linear_model.RidgeClassifier
  - 3.2.3.1.2. sklearn.linear_model.RidgeClassifierCV
  - 3.2.3.1.1. sklearn.linear_model.RidgeCV
  - sklearn.linear_model.SGDClassifier
  - sklearn.linear_model.SGDRegressor
  - sklearn.linear_model.lars_path
  - sklearn.linear_model.lasso_path
  - sklearn.linear_model.lasso_stability_path
  - sklearn.linear_model.orthogonal_mp
  - sklearn.linear_model.orthogonal_mp_gram
- sklearn.manifold: Manifold Learning
  - sklearn.manifold.LocallyLinearEmbedding
  - sklearn.manifold.Isomap
  - sklearn.manifold.MDS
  - sklearn.manifold.SpectralEmbedding
  - sklearn.manifold.TSNE
  - sklearn.manifold.locally_linear_embedding
  - sklearn.manifold.spectral_embedding
- sklearn.metrics: Metrics
  - Model Selection Interface
    - sklearn.metrics.make_scorer
  - Classification metrics
  - Regression metrics
  - Clustering metrics
  - Biclustering metrics
    - sklearn.metrics.consensus_score
  - Pairwise metrics
- sklearn.mixture: Gaussian Mixture Models
  - sklearn.mixture.GMM
  - sklearn.mixture.DPGMM
  - sklearn.mixture.VBGMM
- sklearn.multiclass: Multiclass and multilabel classification
  - Multiclass and multilabel classification strategies
  - sklearn.multiclass.OneVsRestClassifier
  - sklearn.multiclass.OneVsOneClassifier
  - sklearn.multiclass.OutputCodeClassifier
  - sklearn.multiclass.fit_ovr
  - sklearn.multiclass.predict_ovr
  - sklearn.multiclass.fit_ovo
  - sklearn.multiclass.predict_ovo
  - sklearn.multiclass.fit_ecoc
  - sklearn.multiclass.predict_ecoc
- sklearn.naive_bayes: Naive Bayes
  - sklearn.naive_bayes.GaussianNB
  - sklearn.naive_bayes.MultinomialNB
  - sklearn.naive_bayes.BernoulliNB
- sklearn.neighbors: Nearest Neighbors
  - sklearn.neighbors.NearestNeighbors
  - sklearn.neighbors.KNeighborsClassifier
  - sklearn.neighbors.RadiusNeighborsClassifier
  - sklearn.neighbors.KNeighborsRegressor
  - sklearn.neighbors.RadiusNeighborsRegressor
  - sklearn.neighbors.NearestCentroid
  - sklearn.neighbors.BallTree
  - sklearn.neighbors.KDTree
  - sklearn.neighbors.DistanceMetric
  - sklearn.neighbors.KernelDensity
  - sklearn.neighbors.kneighbors_graph
  - sklearn.neighbors.radius_neighbors_graph
- sklearn.neural_network: Neural network models
  - sklearn.neural_network.BernoulliRBM
- sklearn.cross_decomposition: Cross decomposition
  - sklearn.cross_decomposition.PLSRegression
  - sklearn.cross_decomposition.PLSCanonical
  - sklearn.cross_decomposition.CCA
  - sklearn.cross_decomposition.PLSSVD
- sklearn.pipeline: Pipeline
  - sklearn.pipeline.Pipeline
  - sklearn.pipeline.FeatureUnion
  - sklearn.pipeline.make_pipeline
  - sklearn.pipeline.make_union
- sklearn.preprocessing: Preprocessing and Normalization
  - sklearn.preprocessing.Binarizer
  - sklearn.preprocessing.Imputer
  - sklearn.preprocessing.KernelCenterer
  - sklearn.preprocessing.LabelBinarizer
  - sklearn.preprocessing.LabelEncoder
  - sklearn.preprocessing.MultiLabelBinarizer
  - sklearn.preprocessing.MinMaxScaler
  - sklearn.preprocessing.Normalizer
  - sklearn.preprocessing.OneHotEncoder
  - sklearn.preprocessing.StandardScaler
  - sklearn.preprocessing.PolynomialFeatures
  - sklearn.preprocessing.add_dummy_feature
  - sklearn.preprocessing.binarize
  - sklearn.preprocessing.label_binarize
  - sklearn.preprocessing.normalize
  - sklearn.preprocessing.scale
- sklearn.qda: Quadratic Discriminant Analysis
  - sklearn.qda.QDA
- sklearn.random_projection: Random projection
  - sklearn.random_projection.GaussianRandomProjection
  - sklearn.random_projection.SparseRandomProjection
  - sklearn.random_projection.johnson_lindenstrauss_min_dim
- sklearn.semi_supervised Semi-Supervised Learning
  - sklearn.semi_supervised.LabelPropagation
  - sklearn.semi_supervised.LabelSpreading
- sklearn.svm: Support Vector Machines
  - Estimators
  - Low-level methods
- sklearn.tree: Decision Trees
  - sklearn.tree.DecisionTreeClassifier
  - sklearn.tree.DecisionTreeRegressor
  - sklearn.tree.ExtraTreeClassifier
  - sklearn.tree.ExtraTreeRegressor
  - sklearn.tree.export_graphviz
- sklearn.utils: Utilities
  - sklearn.utils.check_random_state
  - sklearn.utils.resample
  - sklearn.utils.shuffle
Who is using scikit-learn?
- Spotify
- Inria
- Evernote
- Télécom ParisTech
- AWeber
- Yhat
- Rangespan
- Birchbox
- Bestofmedia Group
- Change.org
- PHIMECA Engineering
- HowAboutWe
- PeerIndex
- DataRobot
- OkCupid
- Lovely
- Data Publica
Contributing
- Submitting a bug report
- Retrieving the latest code
- Contributing code
- Other ways to contribute
- Coding guidelines
- APIs of scikit-learn objects
  - Different objects
  - Estimators
- Rolling your own estimator
Developers’ Tips for Debugging
- Memory errors: debugging Cython with valgrind
Maintainer / core-developer information
- Making a release
How to optimize for speed
- Python, Cython or C/C++?
- Fast matrix multiplications
- Profiling Python code
- Memory usage profiling
- Performance tips for the Cython developer
- Profiling compiled extensions
- Multi-core parallelism using joblib.Parallel
- A sample algorithmic trick: warm restarts for cross validation
Utilities for Developers
- Validation Tools
- Efficient Linear Algebra & Array Operations
- Efficient Random Sampling
- Efficient Routines for Sparse Matrices
- Graph Routines
- Backports
  - ARPACK
  - Benchmarking
- Testing Functions
- Multiclass and multilabel utility function
- Helper Functions
- Hash Functions
- Warnings and Exceptions
Installing scikit-learn
- Installing an official release
- Third party distributions of scikit-learn
- Building on windows
- Bleeding Edge
- Testing
  - Testing scikit-learn once installed
  - Testing scikit-learn from within the source folder
An introduction to machine learning with scikit-learn
- Machine learning: the problem setting
- Loading an example dataset
- Learning and predicting
- Model persistence
Choosing the right estimator

Classification

Identifying to which set of categories a new observation belong to.

Applications: Spam detection, Image recognition.
Algorithms:
SVM, nearest neighbors, random forest, ...

Examples

Regression

Predicting a continuous value for a new example.

Applications: Drug response, Stock prices.
Algorithms:
SVR, ridge regression, Lasso, ...

Examples

Clustering

Automatic grouping of similar objects into sets.

Applications: Customer segmentation, Grouping experiment outcomes
Algorithms:
k-Means, spectral clustering, mean-shift, ...

Examples

Dimensionality reduction

Reducing the number of random variables to consider.

Applications: Visualization, Increased efficiency
Algorithms:
PCA, feature selection, non-negative matrix factorization.

Examples

Model selection

Comparing, validating and choosing parameters and models.

Goal: Improved accuracy via parameter tuning
Modules:
grid search, cross validation, metrics.

Examples

Preprocessing

Feature extraction and normalization.

Application: Transforming input data such as text for use with machine learning algorithms.
Modules:
preprocessing, feature extraction.

Examples

News

On-going development: What's new (changelog)
August 2014. scikit-learn 0.15.1 is available for download (Changelog).
July 2014. scikit-learn 0.15.0 is available for download (Changelog).
July 14-20th, 2014: international sprint. During this week-long sprint, we gathered 18 of the core contributors in Paris. We want to thank our sponsors: Paris-Saclay Center for Data Science & Digicosme and our hosts La Paillasse, Criteo, Inria, and tinyclues.
August 2013. scikit-learn 0.14 is available for download (Changelog).