Dimensionality Reduction in iota2
#################################

Iota2 provides a functionality called external features which allows manipulating time series
through Python functions. We will show how to use this functionality for dimensionality reduction.

A dedicated documentation for configuring this workflow is available in :doc:`External features <external_features>`.

In this example, we will use the `PCA <https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html>`_
function from scikit-learn, but any other dimensionality reduction methods can be used.

.. code-block:: python

    def pca_reduction(self: DataContainer) -> tuple[np.ndarray, list]:
        """
        Perform Principal Component Analysis (PCA) on the interpolated data to reduce its dimensionality.
        """
        original_shape = self.interpolated_data.shape
        reshaped_data = self.interpolated_data.reshape(-1, original_shape[2])

        # Apply PCA
        n_components = 3
        pca = PCA(n_components=n_components)
        pca_result = pca.fit_transform(reshaped_data)

        reduced_data = pca_result.reshape(original_shape[0], original_shape[1], n_components)
        labels = [
            I2Label(sensor_name="pca", feat_name=i + 1) for i in range(reduced_data.shape[2])
        ]
        return reduced_data, labels

In this example, the `PCA_reduction` function applies dimensionality reduction directly to the interpolated
data `self.interpolated_data`. However, due to the size of the data being manipulated, this data is often
divided into chunks. Consequently, `self.interpolated_data` often represents only a portion of the image on
a given tile. However, in a traditional usage of iota2, the data is distributed over several tiles.

To achieve a dimensionality reduction that is coherent with the area of interest, we can rely on the
data present in the databases created for model training. These files are located in the
`learningSamples` folder of the iota2 output directory. For example, the file
`learningSamples/Samples_region_1_seed0_learn.sqlite`.

Here is an example of python functions that can be used to build a PCA model and save it to disk.
For this example, all the columns containing the string ‘sentinel’ are used to build the PCA model.


.. code-block:: python

    import sqlite3

    import joblib
    import pandas as pd
    from sklearn.decomposition import PCA


    def get_sentinel_columns(db_path: str, table_name: str) -> list:
        """
        Retrieve columns containing 'sentinel' from the specified SQLite table.

        Parameters
        ----------
        db_path : str
            Path to the SQLite database file.
        table_name : str
            Name of the table to query.

        Returns
        -------
        list
            List of columns containing 'sentinel'.
        """
        conn = sqlite3.connect(db_path)
        query = f"PRAGMA table_info({table_name})"
        table_info = pd.read_sql_query(query, conn)
        conn.close()
        sentinel_columns = table_info[table_info["name"].str.contains("sentinel")][
            "name"
        ].tolist()
        return sentinel_columns


    def load_sentinel_data(
        db_path: str, table_name: str, sentinel_columns: list
    ) -> pd.DataFrame:
        """
        Load data from the specified columns of the SQLite table.

        Parameters
        ----------
        db_path : str
            Path to the SQLite database file.
        table_name : str
            Name of the table to query.
        sentinel_columns : list
            List of columns to load data from.

        Returns
        -------
        pd.DataFrame
            DataFrame with the loaded data.
        """
        if sentinel_columns:
            columns_query = ", ".join(sentinel_columns)
            query = f"SELECT {columns_query} FROM {table_name}"
            conn = sqlite3.connect(db_path)
            df = pd.read_sql_query(query, conn)
            conn.close()
        else:
            df = (
                pd.DataFrame()
            )  # If no columns contain "sentinel", return an empty DataFrame
        return df


    def apply_pca(data: pd.DataFrame, n_components: int = 3) -> PCA:
        """
        Apply PCA to the given data.

        Parameters
        ----------
        data : pd.DataFrame
            Input data for PCA.
        n_components : int, optional
            Number of PCA components, by default 3.

        Returns
        -------
        PCA
            PCA object after fitting the data.
        """
        values = data.values
        pca = PCA(n_components=n_components)
        pca.fit(values)
        return pca


    def build_pca(db_path: str, output_pca_model: str) -> None:
        """
        Build and save a PCA model based on sentinel columns in the specified SQLite table.

        Parameters
        ----------
        db_path : str
            Path to the SQLite database file.
        output_pca_model : str
            Path to store the PCA model.
        """
        table_name = "output"

        sentinel_columns = get_sentinel_columns(db_path, table_name)
        df = load_sentinel_data(db_path, table_name, sentinel_columns)

        if not df.empty:
            pca = apply_pca(df)
            joblib.dump(pca, output_pca_model)
            print("PCA model saved successfully.")
        else:
            print("No sentinel columns found in the table.")


    if __name__ == "__main__":
        build_pca(
            "learningSamples/Samples_region_1_seed0_learn.sqlite",
            "pca_model.joblib"
        )

Once the model is saved on disk, we can reuse it in our function, which models it on the Sentinel-2 data set.

Our initial function can then become

.. code-block:: python

    def pca_reduction(self: DataContainer, pca_model_file: str) -> tuple[np.ndarray, list]:
        """
        Perform Principal Component Analysis (PCA) on the interpolated data to reduce its dimensionality.
        """
        pca_loaded = joblib.load(pca_file)
        original_shape = self.interpolated_data.shape
        reshaped_data = self.interpolated_data.reshape(-1, original_shape[2])

        # Apply PCA
        pca_result = pca_loaded.transform(reshaped_data)

        reduced_data = pca_result.reshape(original_shape[0], original_shape[1], n_components)
        labels = [
            I2Label(sensor_name="pca", feat_name=i + 1) for i in range(reduced_data.shape[2])
        ]
        return reduced_data, labels


The configuration file may looks like

.. code-block:: python

    "external_features": {
        "functions": [["PCA_reduction", {"pca_file":"/path/to/pca_model.joblib"}]]
        "module": "/absolute/path/to/pca_redudction.py"
        "concat_mode": False
    },

.. Warning::

    It is important not to concatenate the result of the dimension reduction with the rest of the primitives thanks to the :ref:`concat_mode <i2_features_map.external_features.concat_mode>` parameter, which must be set to False.

Summary of the steps to perform dimensionality reduction:

    - Run iota2 up to the generation of training samples by models (step `Merge samples dedicated to the same model`) without dimensionality reduction.
    - Run the script that will train the dimensionality reduction model as described above.
    - Restart iota2 from the beginning, adding the user-defined dimensionality reduction function.