Iota2 regression
################

While Iota2 has first been used for land use classification (prediction of a *discrete* variable), it also allows to perform regression (prediction of a *continuous* variable). This chapter introduces the use of Iota2 regression builder.


Documentation
*************

The regression pipeline is divided in the following groups:

- **Init**: pre-processing steps
- **Sampling**: input data is split between train and validation set, and training pixels are selected and features extracted
- **Training**: target values are standardized (mean to 0 and std to 1) and passed to a model for training
- **Prediction**: the best model is used to predict pixel values on a map per tile
- **Mosaic**: tiles are merged together
- **Validation**: predicted values are compared to validation ones

In regression mode, two learn/predict workflows are available:

- **Pytorch**: neural networks built inside Iota2
- **Scikit-Learn**: multiple models available from the lib


Configuration file
==================

Here are template configuration files for the two workflows:

Pytorch configuration template
------------------------------

.. include:: examples/i2_config_regression_pytorch.cfg
    :literal:

Scikit configuration template
-----------------------------

.. include:: examples/i2_config_regression_scikit.cfg
    :literal:

Configuration template explanation
----------------------------------

To use regression mode, you must provide a ``builders`` section with only ``i2_regression``. This builder is not compatible with other.

.. code-block:: python

	builders:
	{
	    builders_class_name: ["i2_regression"]
	}

The ``data_field`` represents the target value used to train de model. It must be lowercase, without underscores, and point to a integer of float field of your ``ground_truth`` file.

If you want to use the Pytorch workflow, fill in the ``deep_learning_parameters`` field of the ``arg_train`` section. If you want to use the Scikit workflow fill in the ``scikit_models_parameters`` section. The two are not compatible. One and only one of these must be set. In both case, you must provide the ``python_data_managing`` / ``number_of_chunks`` field to split the prediction in chunks. A too low value will not fit in memory while a too high value will increase the scheduler overhead.

.. code-block:: python

	python_data_managing:
	{
	    number_of_chunks: 20
	}


Pytorch workflow
================

For full configuration, see :doc:`deep_learning`.

To perform regression with Pytorch, you must set ``dl_name`` to one of the available models for regression:

- ``MLPRegressor`` multi layer perceptron defined in `torch_nn_bank.py <https://framagit.org/iota2-project/iota2/-/blob/develop/iota2/learning/pytorch/torch_nn_bank.py>`_
- user defined model in ``dl_module`` inheriting from ``Iota2NeuralNetwork`` 

.. code-block:: python

    arg_train:
    {
        deep_learning_parameters:
        {
            dl_name: "MLPRegressor" # <-- here
            model_optimization_criterion: "MSE"
            model_selection_criterion: "MSE"
            # ...
        }
    }

While in classification mode the only optimization criterion is ``cross_entropy``, the regression mode offers multiple.
Available ``model_optimization_criterion``:

- `MSE <https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss>`_
- `MAE <https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html#torch.nn.L1Loss>`_
- `HuberLoss <https://pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html>`_

After learning models for all combination of hyperparameters in ``hyperparameters_solver``, the best model will be selected according to the  ``model_selection_criterion`` which accepts the same values.

The prediction will then be made in chunk as configured in ``python_data_managing``.

Scikit workflow
===============

For full configuration, see :doc:`use_scikit_learn`.

To perform regression with ScikitLearn, you must set ``model_type`` to one of the available models for regression:

- `RandomForestRegressor <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html>`_
- `RidgeCV <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RidgeCV.html>`_
- `LassoCV <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.LassoCV.html>`_
- `HuberRegressor <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HuberRegressor.html>`_

.. code-block:: python

    scikit_models_parameters:
    {
        model_type: "RandomForestRegressor"
        standardization: True
        keyword_arguments:
        {
            criterion: "squared_error"
        }
    }

Additional parameters can be given to the model using the ``keyword_arguments`` field.

The prediction will then be made in chunk as configured in ``python_data_managing``.

Run Iota2
=========

For complete documentation about Iota2 command line arguments, see :doc:`going_further_i2_launching_params`.

You can then run Iota2 using the following command line to check your configuration file.

.. code-block:: bash

	Iota2.py -config config_regression.cfg -only_summary

In pytorch mode, it should produce something like:

.. code-block:: txt

    Group init:
        [x] Step 1: Sensors pre-processing
        [x] Step 2: Generate a common masks for each sensors
        [x] Step 3: Compute validity raster by tile
    Group sampling:
        [x] Step 4: Generate tile's envelope
        [x] Step 5: Generate a region vector
        [x] Step 6: Prepare samples
        [x] Step 7: merge samples by models
        [x] Step 8: Generate samples statistics by models
        [x] Step 9: Select pixels in learning polygons by models
        [x] Step 10: Split pixels selected to learn models by tiles
        [x] Step 11: Extract pixels values by tiles
        [x] Step 12: Merge samples dedicated to the same model
    Group training:
        [x] Step 13: Train pytorch regression model
    Group prediction:
        [x] Step 14: from pytorch models generated for each hyperparameters couples, choose the best one
        [x] Step 15: Predict with pytorch regression model
        [x] Step 16: Merge tile's classification's part
    Group mosaic:
        [x] Step 17: Mosaic
    Group validation:
        [x] Step 18: Generate regression metrics

Inspect output
==============

After Iota2 has ended, you can go to the output folder and see your results in the ``final`` folder (see :ref:`final`). If the results are not there, you will find why in the logs. You can look at the graphs ``tasks_status_i2_regression_<i>.svg`` to find out which step has failed. Then you can get more details about the error in the logs or by re-launching the given step in debug mode (``-starting_step <j> -ending_step <j> -scheduler_type debug``).
 
.. _output-tree:

Output tree
-----------

This output tree is based on a pytorch run. Your can differ slightly if you choose different parameters.

.. raw:: html
   :file: interactive-tree-root.html

.. container:: interactive-tree-source

	* /Output_Regression_Pytorch
		output folder
			output folder defined in config file `output_path`
		* ! classif
			per tile prediction maps
				| Contains regression maps, for each tile and each region. They will be merged in the ``final`` directory.
			* Regression_T31TCJ_model_1_seed_0.tif
			* ! MASK
				* MASK_region_1_T31TCJ.tif
		* ! dataAppVal
			split train / validation samples
				| Shapefiles obtained after spliting reference data between learning and validation set according a ratio.
			* ! bymodels
				* (empty)
			* T31TCJ_seed_0_learn.sqlite
			* T31TCJ_seed_0_val.sqlite
		* ! dataRegion
			vector data split by region
				| When using eco-climatic region, contains the vector data split by region.
			* (empty)
		* ! envelope
			shapefiles
				| Contains shapefiles, one for each tile.
				| Used to ensure tile priority, with no overlap.
			* T31TCJ.dbf
			* T31TCJ.prj
			* T31TCJ.shp
			* T31TCJ.shx
		* ! features
			useful information
				| for each tile, contains useful information
			* T31TCJ
				* ! tmp
					temporary folder
						| folder created temporarily during the chain execution
					* MaskCommunSL.dbf
					* MaskCommunSL.prj
					* MaskCommunSL.shp
						common scene
							| the common scene of all sensors for this tile.
					* MaskCommunSL.shx
					* MaskCommunSL.tif
					* Sentinel2L3A_T31TCJ_reference.tif
						reference image
							| the image, generated by iota2, used for reprojecting data
					* Sentinel2L3A_T31TCJ_input_dates.txt
						list of dates
							| the list of date detected in ``s2_path`` for the current tile.
					* Sentinel2_T31TCJ_interpolation_dates.txt
				* CloudThreshold_0.dbf
				* CloudThreshold_0.prj
				* CloudThreshold_0.shp
					database used as mask
						| This database is used to mask training polygons according to a number of clear date. See :ref:`cloud_threshold` parameter
				* CloudThreshold_0.shx
				* nbView.tif
					number visits
						| number of time a pixel is seen in the whole time series (i.e., excluding clouds, shadows, staturation and no-data)
		* final
			final producs
				| This folder contains the final products of iota2.
				| All final products will be generated in the ``final`` directory
				| see :ref:`final` for details
			* ! TMP
				* regressionResults_seed_0.txt
				* Regression_Seed_0.csv
				* T31TCJ_Cloud.tif
				* T31TCJ_GlobalConfidence_seed_0.tif
				* T31TCJ_seed_0_CompRef.tif
				* T31TCJ_seed_0.tif
			* Regression_Seed_0.tif
				prediction map
					| destandardized float32 values predicted by the regression model 
			* T31TCJ_metrics.csv
				summary of metrics
					| metrics per tile summarized over multiple seeds
			* T31TCJ_seed0_metrics.csv
				metrics per seed
					| metrics (mae, mse...) between predicted and actual values
		* ! formattingVectors
			learning samples
				| The learning samples contained in each tiles.
				| Shapefiles in which pixel values from time series have been extracted.
			* ! T31TCJ
				temporary directory
					| This is a temporary working directory, intermediate files are (re)moved after step completion. 
				* (empty)
			* T31TCJ.cpg
			* T31TCJ.dbf
			* T31TCJ.prj
			* T31TCJ.shp
			* T31TCJ.shx
		* ! learningSamples
			learning samples
				| Sqlite file containing learning samples by regions.
				| Also contains a CSV file containing statistics about samples balance for each seed. See :ref:`tracing back samples <manual-outrates-concatenation>` to generate this file manually.
			* class_statistics_seed0_learn.csv
			* Samples_region_1_seed0_learn.sqlite
			* T31TCJ_region_1_seed0_Samples_learn.sqlite
		* logs
			logs
				| output logs of iota2
				| here sorted by creation order (not alphabetic)
			* ! SensorsPreprocess
				* preprocessing_T31TCJ.err
				* preprocessing_T31TCJ.out
			* ! CommonMasks
				* common_mask_T31TCJ.err
				* common_mask_T31TCJ.out
			* ! PixelValidity
				* validity_raster_T31TCJ.err
				* validity_raster_T31TCJ.out
			* ! Envelope
				* tiles_envelopes.err
				* tiles_envelopes.out
			* ! GenRegionVector
				* region_generation.err
				* region_generation.out
			* ! VectorFormatting
				* vector_form_T31TCJ.err
				* vector_form_T31TCJ.out
			* ! SamplesMerge
				* merge_model_1_seed_0.err
				* merge_model_1_seed_0.out
			* ! StatsSamplesModel
				* stats_1_S_0_T_T31TCJ.err
				* stats_1_S_0_T_T31TCJ.out
			* ! SamplingLearningPolygons
				* s_sel_model_1_seed_0.err
				* s_sel_model_1_seed_0.out
			* ! SamplesByTiles
				* merge_samples_T31TCJ.err
				* merge_samples_T31TCJ.out
			* ! SamplesExtraction
				* extraction_T31TCJ.err
				* extraction_T31TCJ.out
			* ! SamplesByModels
				* merge_model_1_seed_0_usually.err
				* merge_model_1_seed_0_usually.out
			* ! TrainRegressionPytorch
				* learning_model_1_seed_0_hyp_0.err
				* learning_model_1_seed_0_hyp_0.out
			* ! ModelChoice
				* choose_model_1_seed_0_usually.err
				* choose_model_1_seed_0_usually.out
			* ! PredictRegressionPytorch
				* regression_T31TCJ_model_1_seed_0_0.err
				* regression_T31TCJ_model_1_seed_0_0.out
				* regression_T31TCJ_model_1_seed_0_1.err
				* regression_T31TCJ_model_1_seed_0_1.out
				* regression_T31TCJ_model_1_seed_0_2.err
				* regression_T31TCJ_model_1_seed_0_2.out
			* ! ScikitClassificationsMerge
				* classif_T31TCJ_model_1_seed_0_mosaic.err
				* classif_T31TCJ_model_1_seed_0_mosaic.out
			* ! Mosaic
				* mosaic.err
				* mosaic.out
			* ! GenerateRegressionMetrics
				* regression_metrics_T31TCJ_seed_0.err
				* regression_metrics_T31TCJ_seed_0.out
				* merge_metrics.err
				* merge_metrics.out
			* tasks_status_i2_regression_1.svg
			* tasks_status_i2_regression_2.svg
			* tasks_status_i2_regression_3.svg
			* run_information.txt
				summary
					| summary as displayed by "only_summary" option
		* ! model
			desc
				| The learned models
			* model_1_seed_0.txt
		* ! samplesSelection
			shapefiles
				| Shapefiles containing points (or pixels coordinates) selected for training stage.
				| Also contains a CSV summary of the actual number of samples per class
			* samples_region_1_seed_0.dbf
			* samples_region_1_seed_0_outrates.csv
			* samples_region_1_seed_0.prj
			* samples_region_1_seed_0_selection.sqlite
			* samples_region_1_seed_0.shp
			* samples_region_1_seed_0.shx
			* samples_region_1_seed_0.xml
			* T31TCJ_region_1_seed_0_stats.xml
			* T31TCJ_samples_region_1_seed_0_selection.sqlite
			* T31TCJ_selection_merge.sqlite
		* ! shapeRegion
			desc
				| Shapefiles indicating intersection between tiles and region.
			* MyRegion_region_1_T31TCJ.dbf
			* MyRegion_region_1_T31TCJ.prj
			* MyRegion_region_1_T31TCJ.shp
			* MyRegion_region_1_T31TCJ.shx
			* MyRegion_region_1_T31TCJ.tif
		* ! stats
			statistics
				| Optional xml statistics to standardize the data before learning (svm...).
			* (empty)
		* IOTA2_tasks_status.txt
			internal execution status
				| Iota2 keeps track of it's execution using this *pickle* file (not text) to be allowed to restart from the state where it stopped.
		* logs.zip
			logs archive
		* MyRegion.dbf
		* MyRegion.prj
		* MyRegion.shp
			fake region
				| When no ecoclimatic region is defined for learning step, Iota2 creates this fake file with a single region.
		* MyRegion.shx
		* reference_data.dbf
		* reference_data.prj
		* reference_data.shp
			reference data
				| ground_truth data where a column "split" has been added
		* reference_data.shx

.. _final:

Final folder
------------

The ``final`` folder contains your prediction maps and csv metrics files.

The following validation metrics are computed between the actual values in the validation samples and the corresponding predicted values in the final map.

- ``max_error``
- ``mean_absolute_error``
- ``mean_squared_error``
- ``median_absolute_error``
- ``r2_score``

A summary over different seeds values (when ``runs > 1``) is written where the summary value is:

- ``max(vals)`` for max error
- ``mean(vals)`` for mae
- ``mean(vals)`` for mse
- ``median(vals)`` for median absolute error
- ``mean(vals)`` for r2 score

Implementation notes
====================

In order to perform the sampling, a fake column named "split" full of zeros is added to ground truth data. The sampling strategy will be applied on this column.

The data is standardized before being fed to the model and destandardized after prediction. The scaler is serialized along with the model.

While prediction files have been named as ``Regression_TILE_*.tif``, the folder name ``classif`` and some steps name ``classif_*`` have been kept for compatibility reasons. Do not be surprised to find things named "classif" even in regression mode.

Available features
==================

Some Iota2 features are not available depending on the mode

- external features are not available in scikit workflow
- in pytorch workflow, the following features are not supported in regression mode
  - adaptive learning rate
  - learning weight
  - early stop
  - confidence maps

Known limitations
=================

Regression has some known limitations that remain to be fixed in further contributions.

Prediction is done on a whole map, it is not possible to restrict it to given polygons. That's why validation metrics can only be achieved after full map prediction.

Keyword arguments can not be passed to the metric (example HuberLoss ``delta`` parameter).

Standardization of target can not be turned off. It is done in one batch and can take a lot of memory if the dataset is huge.


Tutorial
********

First :doc:`install Iota2 <HowToGetIOTA2>` and download the test dataset `IOTA2_TEST_S2.tar.bz2 (8.8 Go) <https://docs.iota2.net/data/IOTA2_TEST_S2.tar.bz2>`_.
You can then use Iota2 in regression mode to predict continuous variables. As an example, we provide the ``ndvi.shp`` file containing values of NDVI computed by Iota2 for pixels in the ``T31TCJ`` tile at date 2018-05-11.
Include this in your config file (choose the pytorch or scikit template).

.. code-block:: python

    chain:
    {
        list_tile: "T31TCJ"
        ground_truth: "/XXXX/IOTA2_TEST_S2/tuto_regression/ndvi.shp"
        data_field: "ndvi"
    }

Then run Iota2.

After Iota2 has finished, you should find a ``final/T31TCJ_seed0_metrics.csv`` file in your output folder with a content similar to this:

.. csv-table::

	max_error,mean_absolute_error,mean_squared_error,median_absolute_error,r2_score
	0.14909570239257,0.0009215711540472297,0.004862749426385645,0.0006626362304687063,0.9999464040252317

This mean the regressor you choosed was able to predic NDVI values of validation pixels based on spectral bands for these pixels and observation of training pixels.

.. raw:: html
   :file: interactive-tree.html