Iota2 regression
While Iota2 has first been used for land use classification (prediction of a discrete variable), it also allows to perform regression (prediction of a continuous variable). This chapter introduces the use of Iota2 regression builder.
Documentation
The regression pipeline is divided in the following groups:
Init: pre-processing steps
Sampling: input data is split between train and validation set, and training pixels are selected and features extracted
Training: target values are standardized (mean to 0 and std to 1) and passed to a model for training
Prediction: the best model is used to predict pixel values on a map per tile
Mosaic: tiles are merged together
Validation: predicted values are compared to validation ones
In regression mode, two learn/predict workflows are available:
Pytorch: neural networks built inside Iota2
Scikit-Learn: multiple models available from the lib
Configuration file
Here are template configuration files for the two workflows:
Pytorch configuration template
Scikit configuration template
Configuration template explanation
To use regression mode, you must provide a builders
section with only i2_regression
. This builder is not compatible with other.
builders:
{
builders_class_name: ["i2_regression"]
}
The data_field
represents the target value used to train de model. It must be lowercase, without underscores, and point to a integer of float field of your ground_truth
file.
If you want to use the Pytorch workflow, fill in the deep_learning_parameters
field of the arg_train
section. If you want to use the Scikit workflow fill in the scikit_models_parameters
section. The two are not compatible. One and only one of these must be set. In both case, you must provide the python_data_managing
/ number_of_chunks
field to split the prediction in chunks. A too low value will not fit in memory while a too high value will increase the scheduler overhead.
python_data_managing:
{
number_of_chunks: 20
}
Pytorch workflow
For full configuration, see Deep Learning in iota2.
To perform regression with Pytorch, you must set dl_name
to one of the available models for regression:
MLPRegressor
multi layer perceptron defined in torch_nn_bank.pyuser defined model in
dl_module
inheriting fromIota2NeuralNetwork
arg_train:
{
deep_learning_parameters:
{
dl_name: "MLPRegressor" # <-- here
model_optimization_criterion: "MSE"
model_selection_criterion: "MSE"
# ...
}
}
While in classification mode the only optimization criterion is cross_entropy
, the regression mode offers multiple.
Available model_optimization_criterion
:
After learning models for all combination of hyperparameters in hyperparameters_solver
, the best model will be selected according to the model_selection_criterion
which accepts the same values.
The prediction will then be made in chunk as configured in python_data_managing
.
Scikit workflow
For full configuration, see iota2 and scikit-learn machine learning algorithms.
To perform regression with ScikitLearn, you must set model_type
to one of the available models for regression:
scikit_models_parameters:
{
model_type: "RandomForestRegressor"
standardization: True
keyword_arguments:
{
criterion: "squared_error"
}
}
Additional parameters can be given to the model using the keyword_arguments
field.
The prediction will then be made in chunk as configured in python_data_managing
.
Run Iota2
For complete documentation about Iota2 command line arguments, see Going further with iota2 launching parameters.
You can then run Iota2 using the following command line to check your configuration file.
Iota2.py -config config_regression.cfg -only_summary
In pytorch mode, it should produce something like:
Group init:
[x] Step 1: Sensors pre-processing
[x] Step 2: Generate a common masks for each sensors
[x] Step 3: Compute validity raster by tile
Group sampling:
[x] Step 4: Generate tile's envelope
[x] Step 5: Generate a region vector
[x] Step 6: Prepare samples
[x] Step 7: merge samples by models
[x] Step 8: Generate samples statistics by models
[x] Step 9: Select pixels in learning polygons by models
[x] Step 10: Split pixels selected to learn models by tiles
[x] Step 11: Extract pixels values by tiles
[x] Step 12: Merge samples dedicated to the same model
Group training:
[x] Step 13: Train pytorch regression model
Group prediction:
[x] Step 14: from pytorch models generated for each hyperparameters couples, choose the best one
[x] Step 15: Predict with pytorch regression model
[x] Step 16: Merge tile's classification's part
Group mosaic:
[x] Step 17: Mosaic
Group validation:
[x] Step 18: Generate regression metrics
Inspect output
After Iota2 has ended, you can go to the output folder and see your results in the final
folder (see Final folder). If the results are not there, you will find why in the logs. You can look at the graphs tasks_status_i2_regression_<i>.svg
to find out which step has failed. Then you can get more details about the error in the logs or by re-launching the given step in debug mode (-starting_step <j> -ending_step <j> -scheduler_type debug
).
Output tree
This output tree is based on a pytorch run. Your can differ slightly if you choose different parameters.
- /Output_Regression_Pytorch
- output folder
output folder defined in config file output_path
- ! classif
- per tile prediction maps
- Contains regression maps, for each tile and each region. They will be merged in the
final
directory.
Regression_T31TCJ_model_1_seed_0.tif
- ! MASK
MASK_region_1_T31TCJ.tif
- ! dataAppVal
- split train / validation samples
- Shapefiles obtained after spliting reference data between learning and validation set according a ratio.
- ! bymodels
(empty)
- T31TCJ_seed_0_learn.sqlite
learning polygons
- T31TCJ_seed_0_val.sqlite
validation polygons
- T31TCJ_seed_0_val.xml
validation statistics for sample extraction
- T31TCJ_seed_0_val_point.sqlite
points sampled in validation polygons
- T31TCJ_seed_0_val_predicted.sqlite
- predicted values
- these values are pixels extracted from the final mosaic map at the location of validation data
- ! dataRegion
- vector data split by region
- When using eco-climatic region, contains the vector data split by region.
(empty)
- ! envelope
- shapefiles
- Contains shapefiles, one for each tile.Used to ensure tile priority, with no overlap.
T31TCJ.dbf
T31TCJ.prj
T31TCJ.shp
T31TCJ.shx
- ! features
- useful information
- for each tile, contains useful information
- T31TCJ
- ! tmp
- temporary folder
- folder created temporarily during the chain execution
MaskCommunSL.dbf
MaskCommunSL.prj
- MaskCommunSL.shp
- common scene
- the common scene of all sensors for this tile.
MaskCommunSL.shx
MaskCommunSL.tif
- Sentinel2L3A_T31TCJ_reference.tif
- reference image
- the image, generated by iota2, used for reprojecting data
- Sentinel2L3A_T31TCJ_input_dates.txt
- list of dates
- the list of date detected in
s2_path
for the current tile.
Sentinel2_T31TCJ_interpolation_dates.txt
CloudThreshold_0.dbf
CloudThreshold_0.prj
- CloudThreshold_0.shp
- database used as mask
- This database is used to mask training polygons according to a number of clear date. See cloud_threshold parameter
CloudThreshold_0.shx
- nbView.tif
- number visits
- number of time a pixel is seen in the whole time series (i.e., excluding clouds, shadows, staturation and no-data)
- final
- final producs
- This folder contains the final products of iota2.All final products will be generated in the
final
directorysee Final folder for details
- ! TMP
regressionResults_seed_0.txt
Regression_Seed_0.csv
T31TCJ_Cloud.tif
T31TCJ_GlobalConfidence_seed_0.tif
T31TCJ_seed_0_CompRef.tif
T31TCJ_seed_0.tif
- Regression_Seed_0.tif
- prediction map
- destandardized float32 values predicted by the regression model
- T31TCJ_metrics.csv
- summary of metrics
- metrics per tile summarized over multiple seeds
- T31TCJ_seed0_metrics.csv
- metrics per seed
- metrics (mae, mse…) between predicted and actual values
- ! formattingVectors
- learning samples
- The learning samples contained in each tiles.Shapefiles in which pixel values from time series have been extracted.
- ! T31TCJ
- temporary directory
- This is a temporary working directory, intermediate files are (re)moved after step completion.
(empty)
T31TCJ.cpg
T31TCJ.dbf
T31TCJ.prj
T31TCJ.shp
T31TCJ.shx
- ! learningSamples
- learning samples
- Sqlite file containing learning samples by regions.Also contains a CSV file containing statistics about samples balance for each seed. See tracing back samples to generate this file manually.
class_statistics_seed0_learn.csv
Samples_region_1_seed0_learn.sqlite
T31TCJ_region_1_seed0_Samples_learn.sqlite
- logs
- logs
- output logs of iota2here sorted by creation order (not alphabetic)
- ! SensorsPreprocess
preprocessing_T31TCJ.err
preprocessing_T31TCJ.out
- ! CommonMasks
common_mask_T31TCJ.err
common_mask_T31TCJ.out
- ! PixelValidity
validity_raster_T31TCJ.err
validity_raster_T31TCJ.out
- ! Envelope
tiles_envelopes.err
tiles_envelopes.out
- ! GenRegionVector
region_generation.err
region_generation.out
- ! VectorFormatting
vector_form_T31TCJ.err
vector_form_T31TCJ.out
- ! SamplesMerge
merge_model_1_seed_0.err
merge_model_1_seed_0.out
- ! StatsSamplesModel
stats_1_S_0_T_T31TCJ.err
stats_1_S_0_T_T31TCJ.out
- ! SamplingLearningPolygons
s_sel_model_1_seed_0.err
s_sel_model_1_seed_0.out
- ! SamplesByTiles
merge_samples_T31TCJ.err
merge_samples_T31TCJ.out
- ! SamplesExtraction
extraction_T31TCJ.err
extraction_T31TCJ.out
- ! SamplesByModels
merge_model_1_seed_0_usually.err
merge_model_1_seed_0_usually.out
- ! TrainRegressionPytorch
learning_model_1_seed_0_hyp_0.err
learning_model_1_seed_0_hyp_0.out
- ! ModelChoice
choose_model_1_seed_0_usually.err
choose_model_1_seed_0_usually.out
- ! PredictRegressionPytorch
regression_T31TCJ_model_1_seed_0_0.err
regression_T31TCJ_model_1_seed_0_0.out
regression_T31TCJ_model_1_seed_0_1.err
regression_T31TCJ_model_1_seed_0_1.out
regression_T31TCJ_model_1_seed_0_2.err
regression_T31TCJ_model_1_seed_0_2.out
- ! ScikitClassificationsMerge
classif_T31TCJ_model_1_seed_0_mosaic.err
classif_T31TCJ_model_1_seed_0_mosaic.out
- ! Mosaic
mosaic.err
mosaic.out
- ! GenerateRegressionMetrics
regression_metrics_T31TCJ_seed_0.err
regression_metrics_T31TCJ_seed_0.out
merge_metrics.err
merge_metrics.out
tasks_status_i2_regression_1.svg
tasks_status_i2_regression_2.svg
tasks_status_i2_regression_3.svg
- run_information.txt
- summary
- summary as displayed by “only_summary” option
- ! model
- desc
- The learned models
model_1_seed_0.txt
- ! samplesSelection
- shapefiles
- Shapefiles containing points (or pixels coordinates) selected for training stage.Also contains a CSV summary of the actual number of samples per class
samples_region_1_seed_0.dbf
samples_region_1_seed_0_outrates.csv
samples_region_1_seed_0.prj
samples_region_1_seed_0_selection.sqlite
samples_region_1_seed_0.shp
samples_region_1_seed_0.shx
samples_region_1_seed_0.xml
T31TCJ_region_1_seed_0_stats.xml
T31TCJ_samples_region_1_seed_0_selection.sqlite
T31TCJ_selection_merge.sqlite
- ! shapeRegion
- desc
- Shapefiles indicating intersection between tiles and region.
MyRegion_region_1_T31TCJ.dbf
MyRegion_region_1_T31TCJ.prj
MyRegion_region_1_T31TCJ.shp
MyRegion_region_1_T31TCJ.shx
MyRegion_region_1_T31TCJ.tif
- ! stats
- statistics
- Optional xml statistics to standardize the data before learning (svm…).
(empty)
- IOTA2_tasks_status.txt
- internal execution status
- Iota2 keeps track of it’s execution using this pickle file (not text) to be allowed to restart from the state where it stopped.
- logs.zip
logs archive
MyRegion.dbf
MyRegion.prj
- MyRegion.shp
- fake region
- When no ecoclimatic region is defined for learning step, Iota2 creates this fake file with a single region.
MyRegion.shx
reference_data.dbf
reference_data.prj
- reference_data.shp
- reference data
- ground_truth data where a column “split” has been added
reference_data.shx
Final folder
The final
folder contains your prediction maps and csv metrics files.
The following validation metrics are computed between the actual values in the validation samples and the corresponding predicted values in the final map.
max_error
mean_absolute_error
mean_squared_error
median_absolute_error
r2_score
A summary over different seeds values (when runs > 1
) is written where the summary value is:
max(vals)
for max errormean(vals)
for maemean(vals)
for msemedian(vals)
for median absolute errormean(vals)
for r2 score
Implementation notes
In order to perform the sampling, a fake column named “split” full of zeros is added to ground truth data. The sampling strategy will be applied on this column (see sample_selection).
The data is standardized before being fed to the model and destandardized after prediction. The scaler is serialized along with the model.
While prediction files have been named as Regression_TILE_*.tif
, the folder name classif
and some steps name classif_*
have been kept for compatibility reasons. Do not be surprised to find things named “classif” even in regression mode.
Available features
Some Iota2 features are not available depending on the mode
external features are not available in scikit workflow
in pytorch workflow, the following features are not supported in regression mode
adaptive learning rate
learning weight
early stop
confidence maps
Known limitations
Regression has some known limitations that remain to be fixed in further contributions.
Prediction is done on a whole map, it is not possible to restrict it to given polygons. That’s why validation metrics can only be achieved after full map prediction.
Keyword arguments can not be passed to the metric (example HuberLoss delta
parameter).
Standardization of target can not be turned off. It is done in one batch and can take a lot of memory if the dataset is huge.
Tutorial
First install Iota2 and download the test dataset IOTA2_TEST_S2.tar.bz2 (8.8 Go).
You can then use Iota2 in regression mode to predict continuous variables. As an example, we provide the ndvi.shp
file containing values of NDVI computed by Iota2 for pixels in the T31TCJ
tile at date 2018-05-11.
Include this in your config file (choose the pytorch or scikit template).
chain:
{
list_tile: "T31TCJ"
ground_truth: "/XXXX/IOTA2_TEST_S2/tuto_regression/ndvi.shp"
data_field: "ndvi"
}
Then run Iota2.
After Iota2 has finished, you should find a final/T31TCJ_seed0_metrics.csv
file in your output folder with a content similar to this:
max_error |
mean_absolute_error |
mean_squared_error |
median_absolute_error |
r2_score |
0.14909570239257 |
0.0009215711540472297 |
0.004862749426385645 |
0.0006626362304687063 |
0.9999464040252317 |
This mean the regressor you choosed was able to predic NDVI values of validation pixels based on spectral bands for these pixels and observation of training pixels.