iota2 and scikit-learn machine learning algorithms

Iota2 is able to use some of machine learning algorithms coming from scikit-learn (more specially ensemble methods and SVC).

This documentation exposes how configure iota2 in order to use scikit-learn library.

All scikit-learn parameters are available in the scikit_models_parameters section. Some of them refer directly to scikit-learn models classifier parameters (keywordsArguments).

Scikit-learn parameters table

Parameter Key	Parameter Type	Default value	Parameter purpose
standardization	Boolean	False	Apply features standardization before learning and classification process
cross_validation_parameters	Dictionary	{}	Range of estimator’s parameters to be tested during cross-validation.
cross_validation_grouped	Boolean	False	If false, cross validation folds can contains mixed samples from different polygons
cross_validation_folds	Integer	5	Number of cross validation folds
model_type	String	None	scikit-learn classifier’s name
keywordsArguments

About standardization

Standardize features by removing the mean and scaling to unit variance.

Note

The standardization implemented in iota2 comes from scikit-learn StandardScaler method and used will default values : StandardScaler(copy=True, with_mean=True, with_std=True)

Cross validation parameters

Cross validation is a method used to find the best optimized estimator’s parameters according to a scorer function (overall-accuracy). The user has to provide a list of estimator’s parameters to optimize. This list of parameters must be provided through a python dictionary. For instance , considering a RandomForestClassifier machine learning classifier, the configuration file could contains :

scikit_models_parameters:
{
    model_type: "RandomForestClassifier"
    cross_validation_parameters: {'n_estimators': [50, 100, 150],
                                  'max_depth': [5, 10, 20]}
}

Because n_estimators and min_samples_split are two parameters of RandomForestClassifier. In this case, every couple in [50, 100, 150] and [5, 10, 20] will be tested and the best one, w.r.t the estimated scorer value, will be used to build the RandomForestClassifier model.

Note

The cross validation workflow implemented in iota2 comes from the scikit-learn GridSearchCV method.

Note

Once the cross validation is achieve, a text file call *_cross_val_param.cv is created next to models. This file contains every cross validation score for each parameters to optimize and the choosen parameters.

Model’s keywords arguments

Every classifier from ensemble methods and SVC and are accessible in iota2, each one with its own set of input parameters. For instance with the RandomForestClassifier, user can configure n_estimators, criterion, max_leaf_nodes etc. Then the configuration file could contains :

scikit_models_parameters:
{
    model_type: "RandomForestClassifier"
    criterion: "entropy"
    min_samples_split: 4

    cross_validation_parameters: {'n_estimators': [50, 100, 150],
                                  'max_depth': [5, 10, 20]}
}

Configuration file example

Here is an example of a configuration file configuration fully operational with the downloadable data-set implementing scikit-learn machine learning algorithms.