xhydro_lstm package
Hydrological modelling using LSTMs.
Submodules
xhydro_lstm.create_datasets module
Tools to create the datasets to be used in LSTM model training and simulation.
- xhydro_lstm.create_datasets.create_dataset_flexible(filename: str | PathLike, dynamic_var_tags: list, qsim_pos: list, static_var_tags: list)[source]
Prepare the arrays of dynamic, static and observed flow variables.
- A few things are absolutely required:
a “watershed” coordinate that contains the ID of watersheds, such that we can preallocate the size of the matrices.
a “qobs” variable that contains observed flows for the catchments.
The size of the catchments in the “drainage_area” variable. This is used to compute scaled streamflow values for regionalization applications.
- Parameters:
filename (str or os.Pathlike) – Path to the netcdf file containing the required input and target data for the LSTM. The ncfile must contain a dataset named “qobs” and “drainage_area” for the code to work, as these are required as target and scaling for training, respectively.
dynamic_var_tags (list of str) – List of dataset variables to use in the LSTM model training. Must be part of the input_data_filename ncfile.
qsim_pos (list of bool) – List of same length as dynamic_var_tags. Should be set to all False EXCEPT where the dynamic_var_tags refer to flow simulations (ex: simulations from a hydrological model such as HYDROTEL). Those should be set to True.
static_var_tags (list of str) – List of the catchment descriptor names in the input_data_filename ncfile. They need to be present in the ncfile and will be used as inputs to the regional model, to help the flow regionalization process.
- Returns:
arr_dynamic (np.ndarray) – Tensor of size [watersheds x timestep x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
arr_static (np.ndarray) – Tensor of size [watersheds x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
q_stds (np.ndarray) – Tensor of size [watersheds] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in arr_dynamic and arr_static.
- xhydro_lstm.create_datasets.create_dataset_flexible_local(filename: str | PathLike, dynamic_var_tags: list, qsim_pos: list)[source]
Prepare the arrays of dynamic and observed flow variables.
- A few things are absolutely required:
a “watershed” variable that contains the ID of watersheds, such that we can preallocate the size of the matrices.
a “qobs” variable that contains observed flows for the catchments.
- Parameters:
filename (str or os.PathLike) – Path to the netcdf file containing the required input and target data for the LSTM. The ncfile must contain a dataset named “qobs” and “drainage_area” for the code to work, as these are required as target and scaling for training, respectively.
dynamic_var_tags (list of str) – List of dataset variables to use in the LSTM model training. Must be part of the input_data_filename ncfile.
qsim_pos (list of bool) – List of same length as dynamic_var_tags. Should be set to all False EXCEPT where the dynamic_var_tags refer to flow simulations (ex: simulations from a hydrological model such as HYDROTEL). Those should be set to True.
- Returns:
arr_dynamic (np.ndarray) – Tensor of size [watersheds x timestep x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=1 is the observed flow.
arr_qobs (np.ndarray) – Array containing the observed flow vector.
- xhydro_lstm.create_datasets.create_lstm_dataset(arr_dynamic: ndarray, arr_static: ndarray, q_stds: ndarray, window_size: int, watershed_list: list, idx: ndarray, remove_nans: bool = True)[source]
Create the LSTM dataset and shape the data using look-back windows and preparing all data for training.
- Parameters:
arr_dynamic (np.ndarray) – Tensor of size [watersheds x timestep x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
arr_static (np.ndarray) – Tensor of size [watersheds x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
q_stds (np.ndarray) – Tensor of size [watersheds] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in arr_dynamic and arr_static.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
watershed_list (list) – The total number of watersheds that will be used for training and simulation. Corresponds to the watershed in the input file, i.e. in the arr_dynamic array axis 0.
idx (np.ndarray) – 2-element array of indices of the beginning and end of the desired period for which the LSTM model should be simulated.
remove_nans (bool) – Flag indicating that the NaN values associated to the observed streamflow should be removed. Required for training but can be kept to False for simulation to ensure simulation on the entire period.
- Returns:
x (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during training.
x_static (np.ndarray) – Tensor of size [(timesteps * watersheds) x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
x_q_stds (np.ndarray) – Tensor of size [(timesteps * watersheds)] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in x and x_static. Each data point could come from any catchment and this q_std variable helps scale the objective function.
y (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x, x_static and x_q_stds. Usually the observed streamflow for the day associated to each of the training points.
- xhydro_lstm.create_datasets.create_lstm_dataset_local(arr_dynamic: ndarray, window_size: int, idx: ndarray, remove_nans: bool = True)[source]
Create the local LSTM dataset and shape the data using look-back windows and preparing all data for training.
- Parameters:
arr_dynamic (np.ndarray) – Tensor of size [watersheds x timestep x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
idx (np.ndarray) – 2-element array of indices of the beginning and end of the desired period for which the LSTM model should be simulated.
remove_nans (bool) – Flag indicating that the NaN values associated to the observed streamflow should be removed. Required for training but can be kept to False for simulation to ensure simulation on the entire period.
- Returns:
x (np.ndarray) – Tensor of size [timesteps x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during training.
y (np.ndarray) – Tensor of size [timesteps] containing the target variable for the same time point as in x, x_static and x_q_stds. Usually the observed streamflow for the day associated to each of the training points.
- xhydro_lstm.create_datasets.remove_nans_func(y: ndarray, x: ndarray, x_q_std: ndarray, x_static: ndarray)[source]
Check for nans in the variable “y” and remove all lines containing those nans in all datasets.
- Parameters:
y (np.ndarray) – Array of target variables for training, that might contain NaNs.
x (np.ndarray) – Array of dynamic variables for LSTM model training and simulation.
x_q_std (np.ndarray) – Array of observed streamflow standard deviations for catchments in regional LSTM models.
x_static (np.ndarray) – Array of static variables for LSTM model training and simulation, specifically for regional LSTM models.
- Returns:
y (np.ndarray) – Array of target variables for training, with all NaNs removed.
x (np.ndarray) – Array of dynamic variables for LSTM model training and simulation, with values associated to NaN “y” values removed.
x_q_std (np.ndarray) – Array of observed streamflow standard deviations for catchments in regional LSTM models, with values associated to NaN “y” values removed.
x_static (np.ndarray) – Array of static variables for LSTM model training and simulation, specifically for regional LSTM models, with values associated to NaN “y” values removed.
- xhydro_lstm.create_datasets.remove_nans_func_local(y: ndarray, x: ndarray)[source]
Check for nans in the variable “y” and remove all lines containing those nans in all datasets.
- Parameters:
y (np.ndarray) – Array of target variables for training, that might contain NaNs.
x (np.ndarray) – Array of dynamic variables for LSTM model training and simulation.
- Returns:
y (np.ndarray) – Array of target variables for training, with all NaNs removed.
x (np.ndarray) – Array of dynamic variables for LSTM model training and simulation, with values associated to NaN “y” values removed.
xhydro_lstm.lstm_controller module
Control the LSTM training and simulation tools to make clean workflows.
- xhydro_lstm.lstm_controller.control_local_lstm_training(input_data_filename: str, dynamic_var_tags: list, qsim_pos: list, batch_size: int = 32, epochs: int = 200, window_size: int = 365, train_pct: int = 60, valid_pct: int = 20, use_cpu: bool = True, use_parallel: bool = False, do_train: bool = True, model_structure: str = 'dummy_local_lstm', do_simulation: bool = True, training_func: str = 'kge', filename_base: str = 'LSTM_results', simulation_phases: list | None = None, name_of_saved_model: str | None = None)[source]
Control the regional LSTM model training and simulation.
- Parameters:
input_data_filename (str) – Path to the netcdf file containing the required input and target data for the LSTM. The ncfile must contain a dataset named “qobs” and “drainage_area” for the code to work, as these are required as target and scaling for training, respectively.
dynamic_var_tags (list of str) – List of dataset variables to use in the LSTM model training. Must be part of the input_data_filename ncfile.
qsim_pos (list of bool) – List of same length as dynamic_var_tags. Should be set to all False EXCEPT where the dynamic_var_tags refer to flow simulations (ex: simulations from a hydrological model such as HYDROTEL). Those should be set to True.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
epochs (int) – Number of training evaluations. Larger number of epochs means more model iterations and deeper training. At some point, training will stop due to a stop in validation skill improvement.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
train_pct (int) – Percentage of days from the dataset to use as training. The higher, the better for model training skill, but it is important to keep a decent amount for validation and testing.
valid_pct (int) – Percentage of days from the dataset to use as validation. The sum of train_pct and valid_pct needs to be less than 100, such that the remainder can be used for testing. A good starting value is 20%. Validation is used as the stopping criteria during training. When validation stops improving, then the model is overfitting and training is stopped.
use_cpu (bool) – Flag to force the training and simulations to be performed on the CPU rather than on the GPU(s). Must be performed on a CPU that has AVX and AVX2 instruction sets, or tensorflow will fail. CPU training is very slow and should only be used as a last resort (such as for CI testing and debugging).
use_parallel (bool) – Flag to make use of multiple GPUs to accelerate training further. Models trained on multiple GPUs can have larger batch_size values as different batches can be run on different GPUs in parallel. Speedup is not linear as there is overhead related to the management of datasets, batches, the gradient merging and other steps. Still very useful and should be used when possible.
do_train (bool) – Indicate that the code should perform the training step. This is not required as a pre-trained model could be used to perform a simulation by passing an existing model in “name_of_saved_model”.
model_structure (str) – The version of the LSTM model that we want to use to apply to our data. Must be the name of a function that exists in lstm_static.py.
do_simulation (bool) – Indicate that simulations should be performed to obtain simulated streamflow and KGE metrics on the watersheds of interest, using the “name_of_saved_model” pre-trained model. If set to True and ‘do_train’ is True, then the new trained model will be used instead.
training_func (str) – For a regional model, it is highly recommended to use the scaled nse_loss variable that uses the standard deviation of streamflow as inputs. For a local model, the “kge” function is preferred. Defaults to “kge” if unspecified by the user. Can be one of [“kge”, “nse_scaled”].
filename_base (str) – Name of the trained model that will be trained if it does not already exist. Do not add the “.keras” extension, it will be added automatically.
simulation_phases (list of str, optional) – List of periods to generate the simulations. Can contain [‘train’,’valid’,’test’,’full’], corresponding to the training, validation, testing and complete periods, respectively.
name_of_saved_model (str, optional) – Path to the model that has been pre-trained if required for simulations.
- Returns:
kge_results (array-like) – Kling-Gupta Efficiency metric values for each of the watersheds in the input_data_filename ncfile after running in simulation mode (thus after training). Contains n_watersheds items, each containing 4 values representing the KGE values in training, validation, testing and full period, respectively. If one or more simulation phases are not requested, the items will be set to None.
flow_results (array-like) – Streamflow simulation values for each of the watersheds in the input_data_filename ncfile after running in simulation mode (thus after training). Contains n_watersheds items, each containing 4x 2D-arrays representing the observed and simulation series in training, validation, testing and full period, respectively. If one or more simulation phases are not requested, the items will be set to None.
name_of_saved_model (str) – Path to the model that has been trained, or to the pre-trained model if it already existed.
- xhydro_lstm.lstm_controller.control_regional_lstm_training(input_data_filename: str, dynamic_var_tags: list, qsim_pos: list, static_var_tags: list, batch_size: int = 32, epochs: int = 200, window_size: int = 365, train_pct: int = 60, valid_pct: int = 20, use_cpu: bool = True, use_parallel: bool = False, do_train: bool = True, model_structure: str = 'dummy_regional_lstm', do_simulation: bool = True, training_func: str = 'nse_scaled', filename_base: str = 'LSTM_results', simulation_phases: list | None = None, name_of_saved_model: str | None = None)[source]
Control the regional LSTM model training and simulation.
- Parameters:
input_data_filename (str) – Path to the netcdf file containing the required input and target data for the LSTM. The ncfile must contain a dataset named “qobs” and “drainage_area” for the code to work, as these are required as target and scaling for training, respectively.
dynamic_var_tags (list of str) – List of dataset variables to use in the LSTM model training. Must be part of the input_data_filename ncfile.
qsim_pos (list of bool) – List of same length as dynamic_var_tags. Should be set to all False EXCEPT where the dynamic_var_tags refer to flow simulations (ex: simulations from a hydrological model such as HYDROTEL). Those should be set to True.
static_var_tags (list of str) – List of the catchment descriptor names in the input_data_filename ncfile. They need to be present in the ncfile and will be used as inputs to the regional model, to help the flow regionalization process.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
epochs (int) – Number of training evaluations. Larger number of epochs means more model iterations and deeper training. At some point, training will stop due to a stop in validation skill improvement.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
train_pct (int) – Percentage of days from the dataset to use as training. The higher, the better for model training skill, but it is important to keep a decent amount for validation and testing.
valid_pct (int) – Percentage of days from the dataset to use as validation. The sum of train_pct and valid_pct needs to be less than 100, such that the remainder can be used for testing. A good starting value is 20%. Validation is used as the stopping criteria during training. When validation stops improving, then the model is overfitting and training is stopped.
use_cpu (bool) – Flag to force the training and simulations to be performed on the CPU rather than on the GPU(s). Must be performed on a CPU that has AVX and AVX2 instruction sets, or tensorflow will fail. CPU training is very slow and should only be used as a last resort (such as for CI testing and debugging).
use_parallel (bool) – Flag to make use of multiple GPUs to accelerate training further. Models trained on multiple GPUs can have larger batch_size values as different batches can be run on different GPUs in parallel. Speedup is not linear as there is overhead related to the management of datasets, batches, the gradient merging and other steps. Still very useful and should be used when possible.
do_train (bool) – Indicate that the code should perform the training step. This is not required as a pre-trained model could be used to perform a simulation by passing an existing model in “name_of_saved_model”.
model_structure (str) – The version of the LSTM model that we want to use to apply to our data. Must be the name of a function that exists in lstm_static.py.
do_simulation (bool) – Indicate that simulations should be performed to obtain simulated streamflow and KGE metrics on the watersheds of interest, using the “name_of_saved_model” pre-trained model. If set to True and ‘do_train’ is True, then the new trained model will be used instead.
training_func (str) – For a regional model, it is highly recommended to use the scaled nse_loss variable that uses the standard deviation of streamflow as inputs. For a local model, the “kge” function is preferred. Defaults to “nse_scaled” if unspecified by the user. Can be one of [“kge”, “nse_scaled”].
filename_base (str) – Name of the trained model that will be trained if it does not already exist. Do not add the “.keras” extension, it will be added automatically.
simulation_phases (list of str, optional) – List of periods to generate the simulations. Can contain [‘train’,’valid’,’test’,’full’], corresponding to the training, validation, testing and complete periods, respectively.
name_of_saved_model (str, optional) – Path to the model that has been pre-trained if required for simulations.
- Returns:
kge_results (array-like) – Kling-Gupta Efficiency metric values for each of the watersheds in the input_data_filename ncfile after running in simulation mode (thus after training). Contains n_watersheds items, each containing 4 values representing the KGE values in training, validation, testing and full period, respectively. If one or more simulation phases are not requested, the items will be set to None.
flow_results (array-like) – Streamflow simulation values for each of the watersheds in the input_data_filename ncfile after running in simulation mode (thus after training). Contains n_watersheds items, each containing 4x 2D-arrays representing the observed and simulation series in training, validation, testing and full period, respectively. If one or more simulation phases are not requested, the items will be set to None.
name_of_saved_model (str) – Path to the model that has been trained, or to the pre-trained model if it already existed.
xhydro_lstm.lstm_functions module
Collection of functions required to process LSTM models and their required data.
- xhydro_lstm.lstm_functions.perform_initial_train(model_structure: str, use_parallel: bool, window_size: int, batch_size: int, epochs: int, x_train: ndarray, x_train_static: ndarray, x_train_q_stds: ndarray, y_train: ndarray, x_valid: ndarray, x_valid_static: ndarray, x_valid_q_stds: ndarray, y_valid: ndarray, name_of_saved_model: str, training_func: str, use_cpu: bool = False)[source]
Train the LSTM model using preprocessed data.
- Parameters:
model_structure (str) – The version of the LSTM model that we want to use to apply to our data. Must be the name of a function that exists in lstm_static.py.
use_parallel (bool) – Flag to make use of multiple GPUs to accelerate training further. Models trained on multiple GPUs can have larger batch_size values as different batches can be run on different GPUs in parallel. Speedup is not linear as there is overhead related to the management of datasets, batches, the gradient merging and other steps. Still very useful and should be used when possible.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
epochs (int) – Number of training evaluations. Larger number of epochs means more model iterations and deeper training. At some point, training will stop due to a stop in validation skill improvement.
x_train (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during training.
x_train_static (np.ndarray) – Tensor of size [(timesteps * watersheds) x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
x_train_q_stds (np.ndarray) – Tensor of size [(timesteps * watersheds)] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in x_train and x_train_static. Each data point could come from any catchment and this x_train_q_std variable helps scale the objective function.
y_train (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_train, x_train_static and x_train_q_stds. Usually the observed streamflow for the day associated to each of the training points.
x_valid (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during validation.
x_valid_static (np.ndarray) – Tensor of size [(timesteps * watersheds) x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during validation.
x_valid_q_stds (np.ndarray) – Tensor of size [(timesteps * watersheds)] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in x_valid and x_valid_static. Each data point could come from any catchment and this x_valid_q_std variable helps scale the objective function for the validation points.
y_valid (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_valid, x_valid_static and x_valid_q_stds. Usually the observed streamflow for the day associated to each of the validation points.
name_of_saved_model (str) – Path to the model that has been pre-trained if required for simulations.
training_func (str) – Name of the objective function used for training. For a regional model, it is highly recommended to use the scaled nse_loss variable that uses the standard deviation of streamflow as inputs.
use_cpu (bool) – Flag to force the training and simulations to be performed on the CPU rather than on the GPU(s). Must be performed on a CPU that has AVX and AVX2 instruction sets, or tensorflow will fail. CPU training is very slow and should only be used as a last resort (such as for CI testing and debugging).
- xhydro_lstm.lstm_functions.perform_initial_train_local(model_structure: str, use_parallel: bool, window_size: int, batch_size: int, epochs: int, x_train: ndarray, y_train: ndarray, x_valid: ndarray, y_valid: ndarray, name_of_saved_model: str, training_func: str, use_cpu: bool = False)[source]
Train the LSTM model using preprocessed data on a local catchment.
- Parameters:
model_structure (str) – The version of the LSTM model that we want to use to apply to our data. Must be the name of a function that exists in lstm_static.py.
use_parallel (bool) – Flag to make use of multiple GPUs to accelerate training further. Models trained on multiple GPUs can have larger batch_size values as different batches can be run on different GPUs in parallel. Speedup is not linear as there is overhead related to the management of datasets, batches, the gradient merging and other steps. Still very useful and should be used when possible.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
epochs (int) – Number of training evaluations. Larger number of epochs means more model iterations and deeper training. At some point, training will stop due to a stop in validation skill improvement.
x_train (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during training.
y_train (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_train, x_train_static and x_train_q_stds. Usually the observed streamflow for the day associated to each of the training points.
x_valid (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during validation.
y_valid (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_valid, x_valid_static and x_valid_q_stds. Usually the observed streamflow for the day associated to each of the validation points.
name_of_saved_model (str) – Path to the model that has been pre-trained if required for simulations.
training_func (str) – Name of the objective function used for training. For a regional model, it is highly recommended to use the scaled nse_loss variable that uses the standard deviation of streamflow as inputs.
use_cpu (bool) – Flag to force the training and simulations to be performed on the CPU rather than on the GPU(s). Must be performed on a CPU that has AVX and AVX2 instruction sets, or tensorflow will fail. CPU training is very slow and should only be used as a last resort (such as for CI testing and debugging).
- Returns:
Adding this just because linter will not let me put nothing. Exits with 0 if all is normal.
- Return type:
code
- xhydro_lstm.lstm_functions.run_model_after_training(w: int, arr_dynamic: ndarray, arr_static: ndarray, q_stds: ndarray, window_size: int, train_idx: ndarray, batch_size: int, watershed_areas: ndarray, name_of_saved_model: str, valid_idx: ndarray, test_idx: ndarray, all_idx: ndarray, simulation_phases: list)[source]
Simulate streamflow on given input data for a user-defined number of periods.
- Parameters:
w (int) – Number of the watershed from the list of catchments that will be simulated.
arr_dynamic (np.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
arr_static (np.ndarray) – Tensor of size [time_steps x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
q_stds (np.ndarray) – Tensor of size [time_steps] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in arr_dynamic and arr_static.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
train_idx (np.ndarray) – Indices of the training period from the complete period. Contains 2 values per watershed: start and end indices.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
watershed_areas (np.ndarray) – Area of the watershed, in square kilometers, as taken from the training dataset initial input ncfile.
name_of_saved_model (str) – Path to the model that has been pre-trained if required for simulations.
valid_idx (np.ndarray) – Indices of the validation period from the complete period. Contains 2 values per watershed: start and end indices.
test_idx (np.ndarray) – Indices of the testing period from the complete period. Contains 2 values per watershed: start and end indices.
test_idx – Indices of the testing period from the complete period. Contains 2 values per watershed: start and end indices.
all_idx (np.ndarray) – Indices of the full period. Contains 2 values per watershed: start and end indices.
simulation_phases (list of str) – List of periods to generate the simulations. Can contain [‘train’,’valid’,’test’,’full’], corresponding to the training, validation, testing and complete periods, respectively.
- Returns:
kge (list) – A list of size 4, with one float per period in [‘train’,’valid’,’test’,’all’]. Each KGE value is comupted between observed and simulated flows for the watershed of interest and for all specified periods. Unrequested periods return None.
flows (list) – A list of np.ndarray objects of size 4, with one 2D np.ndarray per period in [‘train’,’valid’,’test’,’all’]. Observed (column 1) and simulated (column 2) streamflows are computed for the watershed of interest and for all specified periods. Unrequested periods return None.
- xhydro_lstm.lstm_functions.run_model_after_training_local(arr_dynamic: ndarray, window_size: int, train_idx: ndarray, batch_size: int, name_of_saved_model: str, valid_idx: ndarray, test_idx: ndarray, all_idx: ndarray, simulation_phases: list)[source]
Simulate streamflow on given input data for a user-defined number of periods.
- Parameters:
arr_dynamic (np.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
train_idx (np.ndarray) – Indices of the training period from the complete period. Contains 2 values per watershed: start and end indices.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
name_of_saved_model (str) – Path to the model that has been pre-trained if required for simulations.
valid_idx (np.ndarray) – Indices of the validation period from the complete period. Contains 2 values per watershed: start and end indices.
test_idx (np.ndarray) – Indices of the testing period from the complete period. Contains 2 values per watershed: start and end indices.
test_idx – Indices of the testing period from the complete period. Contains 2 values per watershed: start and end indices.
all_idx (np.ndarray) – Indices of the full period. Contains 2 values per watershed: start and end indices.
simulation_phases (list of str) – List of periods to generate the simulations. Can contain [‘train’,’valid’,’test’,’full’], corresponding to the training, validation, testing and complete periods, respectively.
- Returns:
kge (list) – A list of floats of size 4, with one float per period in [‘train’,’valid’,’test’,’all’]. Each KGE value is comupted between observed and simulated flows for the watershed of interest and for all specified periods. Unrequested periods return None.
flows (list) – A list of np.ndarray objects of size 4, with one 2D np.ndarray per period in [‘train’,’valid’,’test’,’all’]. Observed (column 1) and simulated (column 2) streamflows are computed for the watershed of interest and for all specified periods. Unrequested periods return None.
- xhydro_lstm.lstm_functions.scale_dataset(input_data_filename: str, dynamic_var_tags: list, qsim_pos: list, static_var_tags: list, train_pct: int, valid_pct: int)[source]
Scale the datasets using training data to normalize all inputs, ensuring weighting is unbiased.
- Parameters:
input_data_filename (str) – Path to the netcdf file containing the required input and target data for the LSTM. The ncfile must contain a dataset named “qobs” and “drainage_area” for the code to work, as these are required as target and scaling for training, respectively.
dynamic_var_tags (list of str) – List of dataset variables to use in the LSTM model training. Must be part of the input_data_filename ncfile.
qsim_pos (list of bool) – List of same length as dynamic_var_tags. Should be set to all False EXCEPT where the dynamic_var_tags refer to flow simulations (ex: simulations from a hydrological model such as HYDROTEL). Those should be set to True.
static_var_tags (list of str) – List of the catchment descriptor names in the input_data_filename ncfile. They need to be present in the ncfile and will be used as inputs to the regional model, to help the flow regionalization process.
train_pct (int) – Percentage of days from the dataset to use as training. The higher, the better for model training skill, but it is important to keep a decent amount for validation and testing.
valid_pct (int) – Percentage of days from the dataset to use as validation. The sum of train_pct and valid_pct needs to be less than 100, such that the remainder can be used for testing. A good starting value is 20%. Validation is used as the stopping criteria during training. When validation stops improving, then the model is overfitting and training is stopped.
- Returns:
watershed_areas (np.ndarray) – Area of the watershed, in square kilometers, as taken from the training dataset initial input ncfile.
watersheds_ind (np.ndarray) – List of watershed indices to use during training.
arr_dynamic (np.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
arr_static (np.array) – Tensor of size [time_steps x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
q_stds (np.ndarray) – Tensor of size [time_steps] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in arr_dynamic and arr_static.
train_idx (np.ndarray) – Indices of the training period from the complete period. Contains 2 values per watershed: start and end indices.
valid_idx (np.ndarray) – Indices of the validation period from the complete period. Contains 2 values per watershed: start and end indices.
test_idx (np.ndarray) – Indices of the testing period from the complete period. Contains 2 values per watershed: start and end indices.
all_idx (np.ndarray) – Indices of the full period. Contains 2 values per watershed: start and end indices.
- xhydro_lstm.lstm_functions.scale_dataset_local(input_data_filename: str, dynamic_var_tags: list, qsim_pos: list, train_pct: int, valid_pct: int)[source]
Scale the datasets using training data to normalize all inputs, ensuring weighting is unbiased.
- Parameters:
input_data_filename (str) – Path to the netcdf file containing the required input and target data for the LSTM. The ncfile must contain a dataset named “qobs” and “drainage_area” for the code to work, as these are required as target and scaling for training, respectively.
dynamic_var_tags (list of str) – List of dataset variables to use in the LSTM model training. Must be part of the input_data_filename ncfile.
qsim_pos (list of bool) – List of same length as dynamic_var_tags. Should be set to all False EXCEPT where the dynamic_var_tags refer to flow simulations (ex: simulations from a hydrological model such as HYDROTEL). Those should be set to True.
train_pct (int) – Percentage of days from the dataset to use as training. The higher, the better for model training skill, but it is important to keep a decent amount for validation and testing.
valid_pct (int) – Percentage of days from the dataset to use as validation. The sum of train_pct and valid_pct needs to be less than 100, such that the remainder can be used for testing. A good starting value is 20%. Validation is used as the stopping criteria during training. When validation stops improving, then the model is overfitting and training is stopped.
- Returns:
watershed_areas (np.ndarray) – Area of the watershed, in square kilometers, as taken from the training dataset initial input ncfile.
watersheds_ind (np.ndarray) – List of watershed indices to use during training.
arr_dynamic (np.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
train_idx (np.ndarray) – Indices of the training period from the complete period. Contains 2 values per watershed: start and end indices.
valid_idx (np.ndarray) – Indices of the validation period from the complete period. Contains 2 values per watershed: start and end indices.
test_idx (np.ndarray) – Indices of the testing period from the complete period. Contains 2 values per watershed: start and end indices.
all_idx (np.ndarray) – Indices of the full period. Contains 2 values per watershed: start and end indices.
- xhydro_lstm.lstm_functions.split_dataset(arr_dynamic: array, arr_static: array, q_stds: array, watersheds_ind: array, train_idx: array, window_size: int, valid_idx: array)[source]
Extract only the required data from the entire dataset according to the desired period.
- Parameters:
arr_dynamic (np.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
arr_static (np.ndarray) – Tensor of size [time_steps x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
q_stds (np.ndarray) – Tensor of size [time_steps] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in arr_dynamic and arr_static.
watersheds_ind (np.ndarray) – List of watershed indices to use during training.
train_idx (np.ndarray) – Indices of the training period from the complete period. Contains 2 values per watershed: start and end indices.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
valid_idx (np.ndarray) – Indices of the validation period from the complete period. Contains 2 values per watershed: start and end indices.
- Returns:
x_train (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during training.
x_train_static (np.ndarray) – Tensor of size [(timesteps * watersheds) x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
x_train_q_stds (np.ndarray) – Tensor of size [(timesteps * watersheds)] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in x_train and x_train_static. Each data point could come from any catchment and this x_train_q_std variable helps scale the objective function.
y_train (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_train, x_train_static and x_train_q_stds. Usually the observed streamflow for the day associated to each of the training points.
x_valid (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during validation.
x_valid_static (np.ndarray) – Tensor of size [(timesteps * watersheds) x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during validation.
x_valid_q_stds (np.ndarray) – Tensor of size [(timesteps * watersheds)] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in x_valid and x_valid_static. Each data point could come from any catchment and this x_valid_q_std variable helps scale the objective function for the validation points.
y_valid (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_valid, x_valid_static and x_valid_q_stds. Usually the observed streamflow for the day associated to each of the validation points.
- xhydro_lstm.lstm_functions.split_dataset_local(arr_dynamic: ndarray, train_idx: ndarray, window_size: int, valid_idx: ndarray)[source]
Extract only the required data from the entire dataset according to the desired period.
- Parameters:
arr_dynamic (np.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
train_idx (np.ndarray) – Indices of the training period from the complete period. Contains 2 values per watershed: start and end indices.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
valid_idx (np.ndarray) – Indices of the validation period from the complete period. Contains 2 values per watershed: start and end indices.
- Returns:
x_train (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during training.
y_train (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_train, x_train_static and x_train_q_stds. Usually the observed streamflow for the day associated to each of the training points.
x_valid (np.ndarray) – Tensor of size [(timesteps * watersheds) x window_size x n_dynamic_variables] that contains the dynamic (i.e. timeseries) variables that will be used during validation.
y_valid (np.ndarray) – Tensor of size [(timesteps * watersheds)] containing the target variable for the same time point as in x_valid, x_valid_static and x_valid_q_stds. Usually the observed streamflow for the day associated to each of the validation points.
xhydro_lstm.lstm_static module
LSTM model definition and tools for LSTM model training.
- class xhydro_lstm.lstm_static.TestingGenerator(x_set, x_set_static, batch_size)[source]
Bases:
PyDatasetCreate a testing generator to manage the GPU memory during training.
- Parameters:
x_set (numpy.ndarray) – Tensor of size [batch_size x window_size x n_dynamic_variables] that contains the batch of dynamic (i.e. timeseries) variables that will be used during training.
x_set_static (numpy.ndarray) – Tensor of size [batch_size x n_static_variables] that contains the batch of static (i.e. catchment descriptors) variables that will be used during training.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
- Returns:
self
- Return type:
An object containing the subset of the total data that is selected for this batch.
- class xhydro_lstm.lstm_static.TestingGeneratorLocal(x_set, batch_size)[source]
Bases:
PyDatasetCreate a testing generator to manage the GPU memory during training.
- Parameters:
x_set (numpy.ndarray) – Tensor of size [batch_size x window_size x n_dynamic_variables] that contains the batch of dynamic (i.e. timeseries) variables that will be used during training.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
- Returns:
self
- Return type:
An object containing the subset of the total data that is selected for this batch.
- class xhydro_lstm.lstm_static.TrainingGenerator(x_set, x_set_static, x_set_q_stds, y_set, batch_size)[source]
Bases:
PyDatasetCreate a training generator to manage the GPU memory during training.
- Parameters:
x_set (numpy.ndarray) – Tensor of size [batch_size x window_size x n_dynamic_variables] that contains the batch of dynamic (i.e. timeseries) variables that will be used during training.
x_set_static (numpy.ndarray) – Tensor of size [batch_size x n_static_variables] that contains the batch of static (i.e. catchment descriptors) variables that will be used during training.
x_set_q_stds (numpy.ndarray) – Tensor of size [batch_size] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in x_set and x_set_static. Each data point could come from any catchment and this q_std variable helps scale the objective function.
y_set (numpy.ndarray) – Tensor of size [batch_size] containing the target variable for the same time point as in x_set, x_set_static and x_set_q_stds. Usually the streamflow for the day associated to each of the training points.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
- Returns:
self
- Return type:
An object containing the subset of the total data that is selected for this batch.
- class xhydro_lstm.lstm_static.TrainingGeneratorLocal(x_set, y_set, batch_size)[source]
Bases:
PyDatasetCreate a training generator to manage the GPU memory during training.
- Parameters:
x_set (numpy.ndarray) – Tensor of size [batch_size x window_size x n_dynamic_variables] that contains the batch of dynamic (i.e. timeseries) variables that will be used during training.
y_set (numpy.ndarray) – Tensor of size [batch_size] containing the target variable for the same time point as in x_set, x_set_static and x_set_q_stds. Usually the streamflow for the day associated to each of the training points.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
- Returns:
self
- Return type:
An object containing the subset of the total data that is selected for this batch.
- xhydro_lstm.lstm_static.get_list_of_LSTM_models(model_structure) Callable[source]
Create a training generator to manage the GPU memory during training.
- Parameters:
model_structure (str) – The name of the LSTM model to use for training. Must correspond to one of the models present in lstm_static.py. The “model_structure_dict” must be updated when new models are added.
- Returns:
Handle to the LSTM model function.
- Return type:
Callable
- xhydro_lstm.lstm_static.run_trained_model(arr_dynamic: ndarray, arr_static: ndarray, q_stds: ndarray, window_size: int, w: int, idx_scenario: ndarray, batch_size: int, watershed_areas: ndarray, name_of_saved_model: str, remove_nans: bool)[source]
Run the trained regional LSTM model on a single catchment from a larger set.
- Parameters:
arr_dynamic (numpy.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
arr_static (numpy.ndarray) – Tensor of size [time_steps x n_static_variables] that contains the static (i.e. catchment descriptors) variables that will be used during training.
q_stds (numpy.ndarray) – Tensor of size [time_steps] that contains the standard deviation of scaled streamflow values for the catchment associated to the data in arr_dynamic and arr_static.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
w (int) – Number of the watershed from the list of catchments that will be simulated.
idx_scenario (numpy.ndarray) – 2-element array of indices of the beginning and end of the desired period for which the LSTM model should be simulated.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
watershed_areas (np.ndarray) – Area of the watershed, in square kilometers, as taken from the training dataset initial input ncfile.
name_of_saved_model (str) – Path to the model that has been pre-trained if required for simulations.
remove_nans (bool) – Remove the periods for which observed streamflow is NaN for both observed and simulated flows.
- Returns:
kge (float) – KGE value between observed and simulated flows computed for the watershed of interest and for a specified period.
flows (np.ndarray) – Observed and simulated streamflows computed for the watershed of interest and for a specified period.
- xhydro_lstm.lstm_static.run_trained_model_local(arr_dynamic: ndarray, window_size: int, idx_scenario: ndarray, batch_size: int, name_of_saved_model: str, remove_nans: bool)[source]
Run the trained regional LSTM model on a single catchment from a larger set.
- Parameters:
arr_dynamic (numpy.ndarray) – Tensor of size [time_steps x window_size x (n_dynamic_variables+1)] that contains the dynamic (i.e. time-series) variables that will be used during training. The first element in axis=2 is the observed flow.
window_size (int) – Number of days of look-back for training and model simulation. LSTM requires a large backwards-looking window to allow the model to learn from long-term weather patterns and history to predict the next day’s streamflow. Usually set to 365 days to get one year of previous data. This makes the model heavier and longer to train but can improve results.
idx_scenario (np.ndarray) – 2-element array of indices of the beginning and end of the desired period for which the LSTM model should be simulated.
batch_size (int) – Number of data points to use in training. Datasets are often way too big to train in a single batch on a single GPU or CPU, meaning that the dataset must be divided into smaller batches. This has an impact on the training performance and final model skill, and should be handled accordingly.
name_of_saved_model (str) – Path to the model that has been pre-trained if required for simulations.
remove_nans (bool) – Remove the periods for which observed streamflow is NaN for both observed and simulated flows.
- Returns:
kge (float) – KGE value between observed and simulated flows computed for the watershed of interest and for a specified period.
flows (np.ndarray) – Observed and simulated streamflows computed for the watershed of interest and for a specified period.