8.4.1.13. sklearn.datasets.fetch_mldata¶
- sklearn.datasets.fetch_mldata(dataname, target_name='label', data_name='data', transpose_data=True, data_home=None)¶
Fetch an mldata.org data set
If the file does not exist yet, it is downloaded from mldata.org .
mldata.org does not have an enforced convention for storing data or naming the columns in a data set. The default behavior of this function works well with the most common cases:
- data values are stored in the column ‘data’, and target values in the column ‘label’
- alternatively, the first column stores target values, and the second data values
- the data array is stored as n_features x n_samples , and thus needs to be transposed to match the sklearn standard
Keyword arguments allow to adapt these defaults to specific data sets (see parameters target_name, data_name, transpose_data, and the examples below).
mldata.org data sets may have multiple columns, which are stored in the Bunch object with their original name.
Parameters: dataname: :
Name of the data set on mldata.org, e.g.: “leukemia”, “Whistler Daily Snowfall”, etc. The raw name is automatically converted to a mldata.org URL .
target_name: optional, default: ‘label’ :
Name or index of the column containing the target values.
data_name: optional, default: ‘data’ :
Name or index of the column containing the data.
transpose_data: optional, default: True :
If True, transpose the downloaded data array.
data_home: optional, default: None :
Specify another download and cache folder for the data sets. By default all scikit learn data is stored in ‘~/scikit_learn_data’ subfolders.
Returns: data : Bunch
Dictionary-like object, the interesting attributes are: ‘data’, the data to learn, ‘target’, the classification labels, ‘DESCR’, the full description of the dataset, and ‘COL_NAMES’, the original names of the dataset columns.
Examples
Load the ‘iris’ dataset from mldata.org: >>> from sklearn.datasets.mldata import fetch_mldata >>> iris = fetch_mldata(‘iris’) >>> iris.target[0] 1 >>> print(iris.data[0]) [-0.555556 0.25 -0.864407 -0.916667]
Load the ‘leukemia’ dataset from mldata.org, which needs to be transposed to respects the sklearn axes convention: >>> leuk = fetch_mldata(‘leukemia’, transpose_data=True) >>> print(leuk.data.shape[0]) 72
Load an alternative ‘iris’ dataset, which has different names for the columns: >>> iris2 = fetch_mldata(‘datasets-UCI iris’, target_name=1, ... data_name=0) >>> iris3 = fetch_mldata(‘datasets-UCI iris’, ... target_name=’class’, data_name=’double0’)