1. Generating Data
1.1. Generating data for regression
[1]:
from sklearn.datasets import make_regression
X, y = make_regression(**{
'n_samples': 1000,
'n_features': 50,
'n_informative': 10,
'n_targets': 1,
'bias': 5.3,
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (1000, 50), y shape (1000,)
[2]:
from sklearn.linear_model import Lasso
from sklearn.metrics import explained_variance_score
lasso = Lasso()
lasso.fit(X, y)
explained_variance_score(y, lasso.predict(X))
[2]:
0.9997115301602587
1.2. Generating data for classification
[3]:
from sklearn.datasets import make_classification
X, y = make_classification(**{
'n_samples': 2000,
'n_features': 20,
'n_informative': 2,
'n_redundant': 2,
'n_repeated': 0,
'n_classes': 2,
'n_clusters_per_class': 2,
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (2000, 20), y shape (2000,)
[4]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
rf = RandomForestClassifier(n_estimators=2)
rf.fit(X, y)
roc_auc_score(y, rf.predict(X))
[4]:
0.9635284635284634
1.3. Generating data for multilabel classification
[5]:
from sklearn.datasets import make_multilabel_classification
X, Y = make_multilabel_classification(**{
'n_samples': 2000,
'n_features': 20,
'n_classes': 5,
'n_labels': 2,
'length': 50,
'allow_unlabeled': False,
'sparse': False,
'return_indicator': 'dense',
'return_distributions': False,
'random_state': 37
})
print(f'X shape = {X.shape}, Y shape {Y.shape}')
X shape = (2000, 20), Y shape (2000, 5)
[6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
rf = RandomForestClassifier(n_estimators=2)
rf.fit(X, Y)
roc_auc_score(Y, rf.predict(X))
[6]:
0.8589404714735016
[7]:
import numpy as np
from sklearn.metrics import coverage_error, \
label_ranking_average_precision_score, \
label_ranking_loss
probs = rf.predict_proba(X)
n_cols = len(probs)
n_rows = probs[0].shape[0]
Y_preds = np.array([[probs[c][r][1] for c in range(n_cols)] for r in range(n_rows)])
for metric in [coverage_error, label_ranking_average_precision_score, label_ranking_loss]:
print(f'{metric(Y, Y_preds):.5f} : {metric.__name__}')
2.58750 : coverage_error
0.93136 : label_ranking_average_precision_score
0.09871 : label_ranking_loss
1.4. Generating data for clustering
[8]:
from sklearn.datasets import make_blobs
X, y = make_blobs(**{
'n_samples': 3000,
'n_features': 15,
'centers': 3,
'cluster_std': 1.0,
'center_box': (-10.0, 10.0),
'random_state': 37
})
print(f'X shape = {X.shape}, y shape {y.shape}')
X shape = (3000, 15), y shape (3000,)
[9]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
km = KMeans(n_clusters=3, random_state=37)
km.fit(X)
silhouette_score(X, km.predict(X))
[9]:
0.8286949308825501
1.5. Available datasets
1.5.1. Datasets for regression
1.5.1.1. California housing
[10]:
from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True)
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /root/scikit_learn_data
1.5.1.2. Diabetes
[11]:
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
1.5.1.3. Linnerud
[12]:
from sklearn.datasets import load_linnerud
X, y = load_linnerud(return_X_y=True)
1.5.2. Datasets for classification
1.5.2.1. Newsgroup
[13]:
from sklearn.datasets import fetch_20newsgroups_vectorized
T_X, T_y = fetch_20newsgroups_vectorized(subset='train', return_X_y=True)
V_X, V_y = fetch_20newsgroups_vectorized(subset='test', return_X_y=True)
Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
1.5.2.2. Covertype
[14]:
from sklearn.datasets import fetch_covtype
X, y = fetch_covtype(return_X_y=True)
Downloading https://ndownloader.figshare.com/files/5976039
1.5.2.3. KDD Cup 1999
[15]:
from sklearn.datasets import fetch_kddcup99
X, y = fetch_kddcup99(subset='http', return_X_y=True)
Downloading https://ndownloader.figshare.com/files/5976042
1.5.2.4. RCV1
[16]:
from sklearn.datasets import fetch_rcv1
T_X, T_y = fetch_rcv1(subset='train', return_X_y=True)
V_X, V_y = fetch_rcv1(subset='test', return_X_y=True)
Downloading https://ndownloader.figshare.com/files/5976069
Downloading https://ndownloader.figshare.com/files/5976066
Downloading https://ndownloader.figshare.com/files/5976063
Downloading https://ndownloader.figshare.com/files/5976060
Downloading https://ndownloader.figshare.com/files/5976057
Downloading https://ndownloader.figshare.com/files/5976048
1.5.2.5. Digits
[17]:
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
1.5.2.6. Iris
[18]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
1.5.2.7. Wine
[19]:
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)
1.5.3. OpenML
[20]:
from sklearn.datasets import fetch_openml
mice = fetch_openml(name='miceprotein', version=4)
X, y = mice.data, mice.target
print(mice.DESCR)
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015
**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126.
Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.
The data set consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. There are 38 control mice and 34 trisomic mice (Down syndrome), for a total of 72 mice. In the experiments, 15 measurements were registered of each protein per sample/mouse. Therefore, for control mice, there are 38x15, or 570 measurements, and for trisomic mice, there are 34x15, or 510 measurements. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse.
The eight classes of mice are described based on features such as genotype, behavior and treatment. According to genotype, mice can be control or trisomic. According to behavior, some mice have been stimulated to learn (context-shock) and others have not (shock-context) and in order to assess the effect of the drug memantine in recovering the ability to learn in trisomic mice, some mice have been injected with the drug and others have not.
Classes:
```
* c-CS-s: control mice, stimulated to learn, injected with saline (9 mice)
* c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice)
* c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice)
* c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)
* t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice)
* t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice)
* t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice)
* t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)
```
The aim is to identify subsets of proteins that are discriminant between the classes.
### Attribute Information:
```
1 Mouse ID
2..78 Values of expression levels of 77 proteins; the names of proteins are followed by “_n†indicating that they were measured in the nuclear fraction. For example: DYRK1A_n
79 Genotype: control (c) or trisomy (t)
80 Treatment type: memantine (m) or saline (s)
81 Behavior: context-shock (CS) or shock-context (SC)
82 Class: c-CS-s, c-CS-m, c-SC-s, c-SC-m, t-CS-s, t-CS-m, t-SC-s, t-SC-m
```
### Relevant Papers:
Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal.pone.0129126
Ahmed MM, Dhanasekaran AR, Block A, Tong S, Costa ACS, Stasko M, et al. (2015) Protein Dynamics Associated with Failed and Rescued Learning in the Ts65Dn Mouse Model of Down Syndrome. PLoS ONE 10(3): e0119491.
Downloaded from openml.org.