********************* How iota2 is designed ********************* Introduction ============ Originally, iota2 was designed as a classification workflow for large scale land cover mapping. As such, it includes a lot of tools from data preprocessing up to the metric computation from a confusion matrix. Among all these tools, many of them are not specific to supervised classification and can be useful for othe data generation purposes (regression, feature extraction, etc.) as long as they need to be applied to several tiles in a scalable way. How is iota2 built ====================== iota2 is built using three concepts: 1. :ref:`Tasks ` 2. :ref:`Steps ` 3. :ref:`Groups ` These three elements allow to manage large amounts of data, and several combinations of inputs. .. _tasks: Tasks ----- A `task` corresponds to the processing applied to some inputs which produces output data which can be fed to the next task or can provide a final product like a land cover map. For instance, in a classification workflow, a task can correspond to computing the cloud cover for a given tile, computing the ratio of samples, training a model or merging several classifications. A task may be run many times for a range of parameters, e.g. a tile, a set of dates, a climatic region, etc. .. _steps: Steps ----- A `step` is a container which manages several tasks which can be run together in sequence. The main goal of a step is to perforn all linked tasks. A set of tasks is put together into a step for computational purposes. For instance some tasks, when connected together, can pass data to one another in memory without needing to write intermediate outputs to disk. Using steps allows to divide the whole pipeline in several independent parts. This way, if errors occur during the execution, the completed steps don't need to be run again. Once the error is corrected, the execution can be resumed starting with the first uncomplete step. The step manages all the parameters and distributes them to the different tasks. It is also in charge of defining how many times each task is called. More information about steps are available :ref:`here ` .. _groups: Groups ------ A `group` contains several steps. Unlike steps, a group has a more abstract value because it is only used for scheduling purposes. A group is used to label a set of steps. The iota2 parameters then allow to access the different groups to perform only a part of the workflow. Simplified classification workflow ---------------------------------- The following example illustrates the link between tasks, steps and groups. To this end, a very simplified workflow is considered: 1. Compute the common mask between all images for each tile 2. Compute a binary mask for clouds, saturations and borders 3. Compute the class ratio between learning and validation 4. Extract the samples for learning 5. Train the classifier 6. Classify the images 7. Merge all the classifications to produce an unique map .. graphviz:: digraph { rankdir=TB; subgraph cluster_0 { color=green; label="Group: Init"; a [label= "Common Mask", shape=box, color=blue] t11 [label="tile 1"] t12 [label="tile 2"] a -> t11 [style=dotted]; a-> t12 [style=dotted]; b [label="Cloud Cover", shape=box, color=blue] a -> b; t1n1 [label="tile N-1"] t1n [label="tile N"] a -> t1n1 [style=dotted]; a -> t1n [style=dotted]; t112 [label="tile 1"] t122 [label="tile 2"] t1n12 [label="tile N-1"] t1n2 [label="tile N"] b -> t112 [style=dotted]; b -> t122 [style=dotted]; b -> t1n12 [style=dotted]; b -> t1n2 [style=dotted]; } Tasks Steps [color=blue, shape=box] Groups [color=green, shape=box] subgraph cluster_1 { color=green; label="Group: Sampling"; c [label="Samples selection", shape=box, color=blue] b -> c; t21 [label="tile 1"] t22 [label="tile 2"] c -> t21 [style=dotted]; c-> t22 [style=dotted]; d [label="Samples Extraction", shape=box, color=blue] c -> d; t2n1 [label="tile N-1"] t2n [label="tile N"] c -> t2n1 [style=dotted]; c -> t2n [style=dotted]; t212 [label="tile 1"] t222 [label="tile 2"] t2n12 [label="tile N-1"] t2n2 [label="tile N"] d -> t212 [style=dotted]; d -> t222 [style=dotted]; d -> t2n12 [style=dotted]; d -> t2n2 [style=dotted]; } subgraph cluster_2 { color=green; label="Learning and Classification"; e [label="Learning", shape=box, color=blue] d -> e; t31 [label="Region 1"] e -> t31 [style=dotted]; f [label="Classification", shape=box, color=blue] e -> f; t312 [label="tile 1"] t322 [label="tile 2"] t3n12 [label="tile N-1"] t3n2 [label="tile N"] f -> t312 [style=dotted]; f -> t322 [style=dotted]; g [label="Merge tiles", color=blue, shape=box] f -> t3n12 [style=dotted]; f -> t3n2 [style=dotted]; f->g; g -> "All tiles"; } } What is a builder ? =================== The sequence of steps as illustrated above represents a builder. It represents the order in which the different steps are performed. The aim of generic builders, introduced here is to encourage users to create their own chains, using the iota2 API. What is contained in a builder ? ================================ A builder builds the task graph, in other terms, it creates a chain of several ``Steps``. To allow custom builders, inheritance is used. The superclass called ``i2_builder``, contains all generic methods used by iota2 for logging or printing workflows. It defines also the most important function: ``build_steps`` which must be implemented in each derived builder. The ``__init__`` function ------------------------- Each builder is composed of two mandatory attributes: * ``steps_groups``: an ordered dict. Each key is an unique string corresponding to a group name. Each key stores an ordered dict dedicated to a group. Each group contains an ordered set of steps. * ``steps``: an ordered list of all steps to be done. This list is filled using the class method ``build_steps``. The ``build_steps`` function: ----------------------------- This is the central point of a user contribution to create a new chain. This function is simply a concatenation of all steps the order in which they need to be processed. Each step is initialized, and added to a ``StepContainer`` which is returned at the end of ``build_steps`` function. It is strongly recommended to add checks on the configuration file variables and also to verify if steps aro allowed to be launch in the same run. This is very important especially if several combination of steps are allowed. For a concrete example, refer to i2_classification builder. Use a new chain =============== With this information, it is possible to create a new chain by simply creating a new builder. To indicate which builder must be used when iota2 is launched, it is necessary to fill the configuration file in the dedicated block: ``builders`` +--------------------+--------+----------+ | Field name | Type | Required | +--------------------+--------+----------+ |builders_paths | string | Optional | +--------------------+--------+----------+ |builders_class_name | list | Optional | +--------------------+--------+----------+ All these parameters are optional, as by default the builder used produces the land cover map over the specified tiles. * ``builders_paths``: locate where builders are stored. If not indicated, the chain will look into the relative path `IOTA2DIR/iota2/sequence_builder` * ``builders_class_name``: the builders class name to use, ordered in a list. If not indicated, the chain will launch the classification builder. A .py file can contain several classes or builders. Choosing the one to be used is mandatory. Available builder are : `i2_classification`, `i2_features_map` and `i2_vectorization`. If one of these parameters is not consistent with the others, the chain will return an error. Example ------- In this section, a new chain will be designed and created entirely. The goal here is to provide a working example, using both new and existing functions from the `iota2` API. Design the new chain ==================== This chain simulates the classification of images based on segmentation. To do this, the chain must perform the initial segmentation, then calculate zonal statistics that will be used to train a classifier. Then, still by exploiting the zonal statistics and the learned model, a classification is carried out, providing the final product expected by the chain. To simplify the construction of the chain and provide a very fast execution, the data are replaced by text files, and the different algorithms will modify the content of these files. However, the programming paradigms of iota2 will be respected, such as tile processing for example. Creating the chain ================== The first thing to do is to code the functions that will be used to perform the processing. Most of the functions presented in this section perform unnecessary or pointless operations. The most important being the construction of the steps and the dependency graph. To write these functions we have to keep in mind the granularity of the processing we want to apply. Functions definition -------------------- The first function usually consists in creating the tree structure of the output folders. This one requires only one input parameter: a path to a directory. .. literalinclude:: examples/builder_example.py :pyobject: create_directories The second step is to perform the segmentation. In this example, we simulate a tile segmentation by writing a random number of segments. This trick is realistic because when segmenting a satellite image, the number of detected objects cannot be known in advance. Whereas the previous function only needed a directory, for this one we perform the processing only for one tile. When using Sentinel-2 data over a whole year, it is not possible to hold several tiles in memory at the same time. .. literalinclude:: examples/builder_example.py :pyobject: write_segments_by_tiles Once the segmentation is completed, we can compute zonal statistics that will be used for training and classification. This processing depends again on a tile, and provides a result for each segment present in the tile. .. literalinclude:: examples/builder_example.py :pyobject: compute_zonal_stats Now, we can move on to training the classifier. To do this, we will search all the files containing the statistics. The training here is fake, we simply count the number of files found. .. literalinclude:: examples/builder_example.py :pyobject: training_model Once the model has been trained, we can move on to classification. For this example, a constraint of the classifier is that it can only process one segment at a time. So we need to call this classifier for each segment. .. literalinclude:: examples/builder_example.py :pyobject: classify Once the classification is finished, we will gather the different segments, first by tile and then in a single file all tiles together. .. literalinclude:: examples/builder_example.py :pyobject: merge_classification_by_tile .. literalinclude:: examples/builder_example.py :pyobject: mosaic Now we have all the functions required to produce a classification from a segmentation. We need to link them to produce the chain. This is done in two phases, the creation of Steps and the creation of builder. Step declaration ---------------- To start writing the steps, you have to keep in mind their sequence, and especially the dependencies between steps. The general workflow is given by the previous section since the functions are listed in logical sequence. In this section the emphasis will be on the links between the steps. The first step is to create the tree structure. So we will write a step that calls the corresponding function. .. literalinclude:: examples/builder_example.py :pyobject: step_create_directories Several important points: - `resources_block_name` is a mandatory attribute, it allows you to allocate resources to a particular processing. Indeed, creating directories is less consuming than computing a segmentation. - the call to `i2_task` allows to provide parameters to the function called in the step. It is important to make sure that all the mandatory parameters are filled in at this stage. - `add_task_to_i2_processing_graph` is the function that adds the different steps to the Dask scheduler. - finally, the class method `step_description` allows a better readability of the different graphical representations of the chain that we will see later. As this is the first step, there are no dependencies to be expressed. The second step is writing the segmentation. We can note that we define as many tasks as tiles, and each task is added to the graph. An important point here concerns the `task_name`, it must depend on the tile name because each `task_name` must be unique in the execution graph. .. literalinclude:: examples/builder_example.py :pyobject: step_write_segment This step can start once the previous one is done. It's expressed through `task_dep_dico={"first_task":[]}`. This means that we wait until all items in the list corresponding to `first_task` are completed. Here we indicate that each task has its own label, with `task_sub_group`, which depends on the tile name. Once the segmentation is done, the zonal statistics can be computed for each tile which has already been processed. Then the step looks like the following. .. literalinclude:: examples/builder_example.py :pyobject: step_compute_zonal_stats Again, the dependencies are expressed as a dictionary containing a list. In this case one simply waits until the corresponding tile has been processed. To train the model, we wait for all the tiles to be processed. The corresponding step expresses the dependencies by waiting for all zonal statistics tasks. .. literalinclude:: examples/builder_example.py :pyobject: step_training Until now, the dependencies between the different steps were set at the level of the tile. For the next step, we need to know the number of segments per tile. Due to the constraint we impose on the classifier, the number of segmentations per tile cannot be known before the calculation of the segmentation . To overcome this problem we need to divide the execution graph in two. This separation could have occurred at any time between the calculation of the segmentation and the classification. We have made this choice in order to illustrate several cases of dependencies. The following figure represents the execution graph, from the creation of the directories to the learning of the model. .. figure:: ./Images/exec_graph_builder.png :scale: 50 % :align: center :alt: execution graph To perform the classification and merge steps, we need to create a second execution graph. This way the classification step does not depend on any previous step. We will see later how to express the dependency between the two graphs. The classification step has an empty dependency list, but generates dependencies that are named using both the tile name and the segment identifier. .. literalinclude:: examples/builder_example.py :pyobject: step_classification Once the classification of a tile is finished we can carry out the merging of the different segments. .. literalinclude:: examples/builder_example.py :pyobject: step_merge_classification_by_tile Finally, when all the segments of all the tiles have been merged by tile, we can make the mosaic of the different tiles. .. literalinclude:: examples/builder_example.py :pyobject: step_mosaic .. figure:: ./Images/exec_graph_builder_2.png :align: center :alt: Second part of execution graph At this stage, we have created all the necessary steps for the development of the processing chain. We have expressed the dependencies between the different steps. However, we still lack the central part to make `iota2` understand that this is a new processing chain. It is thus necessary to write the builder. Writing the builder ------------------- .. currentmodule:: iota2.sequence_builder.i2_sequence_builder A builder is a class which inherits from :class:`i2_builder`. First the `__init__` method, instantiates attributes and defines groups. Groups are only labels allowing to restart the chain for a specific block. For this example, we have only one group named `global`. .. literalinclude:: examples/builder_example.py :pyobject: i2_example_builder.__init__ The function `build_steps` is the most important. As we need two graphs, we define two `StepContainer`. We add steps to the container. As we need to know how many segments we have by tile, we use the function `get_number_of_segment_by_tiles`, which fills a dictionary containing the number of segments for each tile. This dictionary is filled before the chain instantiates the second container, after the execution of all steps of the first `StepContainer`. Then the complete builder looks like this: .. literalinclude:: examples/builder_example.py :pyobject: i2_example_builder Congratulations, the new chain is created, but before using it, we must fill the corresponding configuration file. Filling a configuration file ---------------------------- In iota2, all the parameters defined and used are known by all builders. When creating a new builder, it can be useful to use the existing parameters instead of creating new ones. For this example, we use existing blocks like `chain` or `builder` which are mandatory, and create a new one `MyParams`. In `MyParams` we define a parameter `tiles` which is identical to existing `listTile` in order to show that there is no conflict between different blocks. .. include:: examples/config_example.cfg :literal: .. currentmodule:: iota2.sequence_builders.i2_sequence_builder_merger .. warning:: Iota2 is able to merge builders into one by the use of :class:`workflow_merger`. Currently, builders sharing the same step cannont be merged.