iota2 and scikit-learn machine learning algorithms

Iota2 is able to use some of machine learning algorithms coming from scikit-learn (more specially ensemble methods and SVC).

This documentation exposes how configure iota2 in order to use scikit-learn library.

All scikit-learn parameters are available in the scikit_models_parameters section. Some of them refer directly to scikit-learn models classifier parameters (keywordsArguments).

Scikit-learn parameters table

Parameter Key

Parameter Type

Default value

Parameter purpose

standardization

Boolean

False

Apply features standardization before learning and classification process

cross_validation_parameters

Dictionary

{}

Range of estimator’s parameters to be tested during cross-validation.

cross_validation_grouped

Boolean

False

If false, cross validation folds can contains mixed samples from different polygons

cross_validation_folds

Integer

5

Number of cross validation folds

model_type

String

None

scikit-learn classifier’s name

keywordsArguments

About standardization

Standardize features by removing the mean and scaling to unit variance.

Note

The standardization implemented in iota2 comes from scikit-learn StandardScaler method and used will default values : StandardScaler(copy=True, with_mean=True, with_std=True)

Cross validation parameters

Cross validation is a method used to find the best optimized estimator’s parameters according to a scorer function (overall-accuracy). The user has to provide a list of estimator’s parameters to optimize. This list of parameters must be provided through a python dictionary. For instance , considering a RandomForestClassifier machine learning classifier, the configuration file could contains :

scikit_models_parameters:
{
    model_type: "RandomForestClassifier"
    cross_validation_parameters: {'n_estimators': [50, 100, 150],
                                  'max_depth': [5, 10, 20]}
}

Because n_estimators and min_samples_split are two parameters of RandomForestClassifier. In this case, every couple in [50, 100, 150] and [5, 10, 20] will be tested and the best one, w.r.t the estimated scorer value, will be used to build the RandomForestClassifier model.

Note

The cross validation workflow implemented in iota2 comes from the scikit-learn GridSearchCV method.

Note

Once the cross validation is achieve, a text file call *_cross_val_param.cv is created next to models. This file contains every cross validation score for each parameters to optimize and the choosen parameters.

Model’s keywords arguments

Every classifier from ensemble methods and SVC and are accessible in iota2, each one with its own set of input parameters. For instance with the RandomForestClassifier, user can configure n_estimators, criterion, max_leaf_nodes etc. Then the configuration file could contains :

scikit_models_parameters:
{
    model_type: "RandomForestClassifier"
    criterion: "entropy"
    min_samples_split: 4

    cross_validation_parameters: {'n_estimators': [50, 100, 150],
                                  'max_depth': [5, 10, 20]}
}

Configuration file example

Here is an example of a configuration file configuration fully operational with the downloadable data-set implementing scikit-learn machine learning algorithms.