Jupyter

The IPS Framework supports automatically creating Jupyter-based workflows. You can automatically upload Jupyter Notebooks and associated data files to the IPS Portal, which will in turn upload these to the appropriate JupyterHub directory.

This guide covers two aspects of how to use Jupyter-based workflows:

What you will need to set up on the IPS Framework side. This mostly involves understanding the APIs the framework provides, and configuration you will need to include.
How the IPS Portal creates files on JupyterHub for you, and how you can utilize the IPS Analysis API in your Jupyter Notebook.

The IPS Analysis API is an _indexing_ tool for allowing runs to quickly find any child or ensemble runs associated with them, and to quickly find the locations of any data files included with a run. Loading the data from the file locations, and constructing visualizations from the data, is left up to the end user.

IPS Framework

Environment Variables

The following variables are additional variables which are mandatory for an IPS simulation wanting to utilize the Jupyter workflow. They are required and do not utilize any default values.

PORTAL_URL - This should be the hostname of the IPS web portal you are interacting with (do not include any subpath). The IPS Portal will associate your run with a specific ID, which is used on JupyterHub/JupyterLab .

PORTAL_API_KEY - To use the JupyterHub capabilities of the IPS Portal, an API key is required. This API key should not be committed directly to a public version control repository. It is recommended that you set this as an environment variable in the run.

Notebook Input File information

You can load template notebooks in your input directory which can automatically generate analyses visible on a remote JupyterHub instance. The IPS Framework instance will copy your template notebook and add some initialization code in a new cell at the beginning.

In your template code, you can reference the variable DATA_FILES to load the current data mapping. This data mapping is a dictionary of timestamps (floating point) to filepaths of the data file.

IPS Framework Usage

To set up Jupyter integration, you will need to call “self.services.initialize_jupyter_notebook” inside of an IPS Component. These statements should only be executed once, for example in an “init” function. For example:

from ipsframework import Component

SOURCE_NOTEBOOK_NAME='base_notebook.ipynb'

class Driver(Component):
    def init(self, timestamp=0.0):
        # ...
        # assumes your notebooks are configured in the input directory
        # if you have an absolute path on the filesystem to your notebook, staging the input notebook is not required
        self.services.stage_input_files([SOURCE_NOTEBOOK_NAME])
        self.services.initialize_jupyter_notebook(
            dest_notebook_name='jupyterhub_visible_notebook.ipynb',
            source_notebook_path=SOURCE_NOTEBOOK_NAME,
        )
        # call self.services.initialize_jupyter_notebook for EACH notebook you want to initialize
        # ...

This code initializes JupyterHub to work with this run and contacts the web portal to associate a runid with this specific run.

—

For updating data files, we generally accommodate for two approaches: one where you want to multiple data files for each timestamp called, and one where you maintain multiple data files for a single timestamp but replace it per timestamp call. Both workflows utilize self.services.add_analysis_data_file .

For the approach where data files for multiple timestamps are maintained, the below code provides an example of loading it from a file which is regularly updated with the IPS state:

import os
from ipsframework import Component

class Monitor(Component):
    def step(self, timestamp=0.0):
        # assume that we have already written IPS state earlier into this file,
        # and that this file is updated per timestamp call
        # In this example, we just want to snapshot our IPS state and save it in our JupyterHub workflow
        data_file = f'{timestamp}_state.json' # get current data file
        self.services.add_analysis_data_file(
            current_data_file_path=data_file,
            timestamp=timestamp,
        )

If you do not set timestamp yourself, it will default to “0.0” .

Or, if you only want to maintain a single timestamp, set the “replace” flag to True:

import os
from ipsframework import Component

class Monitor(Component):
    def step(self, timestamp=0.0):
        # assume that we continually update our state
        data_file = 'state.json' # get current data file
        self.services.add_analysis_data_file(
            current_data_file_path=data_file,
            replace=True,
        )

“Replace” will allow you to completely overwrite an existing timestamp entry with new data. If you don’t set the flag but try to overwrite a specific timestamp, a ValueError is raised.

Note that if you attempt to overwrite an existing data file without setting replace=True, the file will not be overwritten remotely. You can check your IPS log file for “Portal Error” statements.

IPS Framework Jupyter API reference

ServicesProxy.initialize_jupyter_notebook(source_notebook_path: str, dest_notebook_name: str | None = None) → None

If the IPS Portal is available, this function loads a notebook from source_notebook_path, adds a cell to load the data, and then saves the concatenated notebook to the Portal.

If a connection to the IPS Portal cannot be verified for this run, this function does nothing.

Does not modify the source notebook.

Params:

source_notebook_path: location you want to load the source notebook from. This can be either an absolute path, or an IPS-appropriate relative path.
dest_notebook_name: (optional, default None) filename of the notebook to use when saving it to the IPS Portal. If not provided, this will defauly to the filename of the source notebook.

ServicesProxy.add_analysis_data_files(current_data_file_paths: list[str], timestamp: float = 0.0, replace: bool = False) → None

If the IPS Portal is available, saves data files to IPS Portal. Files are indexed via specific timestamps.

If a connection to the IPS Portal cannot be verified for this run, this function does nothing.

Parameters:

current_data_file_paths – list of paths to the current data files we want to copy to the Jupyter directory. These paths may be either absolute paths or IPS-appropriate relative paths. If path is a directory, add all files in directory and preserve directory structure on the IPS Portal.
timestamp – label to assign to the data (currently must be a floating point value)
replace – If True, replace the last data file added with the new data file. If False, simply append the new data file. (default: False) Note that if replace is not True but you attempt to overwrite it, a ValueError will be thrown.

IPS Portal

IPS Notebook Analysis API Guide

NOTE: while you can update the notebook on the Portal side, it’s best to have the completed notebook ready on the Framework side.

The IPS Portal will generate a cell prior to your own notebook which initializes a variable called ips_analysis_api, which contains a number of helper functions for finding specific data locations.

ips_analysis_api.get_data() - this generates a generic IPS mapping - a mapping of floating-point timesteps to a list of data file paths (absolute). Note that your notebook will need to handle the actual loading of the data.
ips_analysis_api.get_child_data() - this generates a mapping of child runids to the “generic IPS mapping” described above.
ips_analysis_api.get_child_data_not_ensembles() - get the child runid mapping as described above, but only use child runids NOT associated with ensembles.
ips_analysis_api.get_child_data_by_ensemble_names() - gets the child runid mapping as described above, but will only retrieve child runids associated with ensembles. You can further filter this by ensemble name by providing an optional list of component names and an optional list of ensemble names; for example, ips_analysis_api.get_child_data_by_ensemble_names(ensemble_names=['ensemble_name_1', 'ensemble_name_2']) will ONLY fetch the child runids associated with ‘ensemble_name_1’ and ‘ensemble_name_2’, but will search all components for this.

IPS Notebook Analysis API Reference

class doc.reference.portal_jupyter_api.ips_analysis_api_v1.IPSAnalysisApi(run_directory: Path)

This class should get used directly by notebooks directly associated with a run.

The IPS Portal will insert a new code cell prior to your actual cells; this cell will initialize the ips_analysis_api object, an instance of this class. This object can be used in your own code to look up information about the run by calling various methods. None of the methods called should mutate any state.

get_child_data() → dict[int, dict[float, list[str]]]

This allows you to obtain all data information about all child runs.

Note that it is your responsibility to handle loading the actual data itself.

Returns:: a dictionary of child runids to an additional mapping of timesteps to a list of data filepaths associated with the timestep.

get_child_data_by_ensemble_names(component_names: list[str] | None = None, ensemble_names: list[str] | None = None) → dict[int, dict[float, list[str]]]

This allows you to obtain all data information about child runs which are also ensembles. You can use ensemble names as an additional filter.

Note that it is your responsibility to handle loading the actual data itself.

Parameters:

component_names – (default: None) list of components you want to search in the return data. If None, search all of them.
ensemble_names – (default: None) list of ensembles you want to include in the return data. If None, return all children which are ensembles.

Returns:

a mapping of this run’s child runids (associated with ensembles, and potentially filtered by the specific ensembles in the ensemble_names param) to an additional mapping of timesteps to a list of data filepaths associated with the timestep

get_child_data_not_ensembles() → dict[int, dict[float, list[str]]]

This allows you to obtain all data information about child runs which are NOT ensembles.

Returns:: a mapping of this run’s child runids (excluding those associated with ensembles) to an additional mapping of timesteps to a list of data filepaths associated with the timestep

get_data() → dict[float, list[str]]

This allows you to obtain all data information associated with this run.

Note that it is your responsibility to handle loading the actual data itself.

Returns:: a dictionary of timestep keys (as floating point values) to a list of data filepaths associated with the timestep

JupyterHub Filesystem Notes

The IPS Portal will always be reading and writing files to a specific directory on a JupyterHub filesystem. From there, the filesystem organization will look somewhat like this:

.
├── username1
└── username2
    ├── 1   # this is the runid as tracked by the IPS Portal
    │   ├── basic.ipynb
    │   ├── bokeh-plots.ipynb
    │   ├── data
    │   │   ├── 10.666666666666666_state.json
    │   │   ├── 1.0_state.json
    │   │   ├── 11.633333333333333_state.json
    │   │   ├── 12.6_state.json
    │   │   ├── 13.566666666666666_state.json
    │   │   ├── 14.533333333333333_state.json
    │   │   ├── 15.5_state.json
    │   │   ├── 16.46666666666667_state.json
    │   │   ├── 17.433333333333334_state.json
    │   │   ├── 18.4_state.json
    │   │   ├── 19.366666666666667_state.json
    │   │   ├── 1.9666666666666668_state.json
    │   │   ├── 20.333333333333332_state.json
    │   │   ├── 21.3_state.json
    │   │   ├── 22.266666666666666_state.json
    │   │   ├── 23.233333333333334_state.json
    │   │   ├── 24.2_state.json
    │   │   ├── 25.166666666666668_state.json
    │   │   ├── 26.133333333333333_state.json
    │   │   ├── 27.1_state.json
    │   │   ├── 28.066666666666666_state.json
    │   │   ├── 29.033333333333335_state.json
    │   │   ├── 2.9333333333333336_state.json
    │   │   ├── 30.0_state.json
    │   │   ├── 3.9_state.json
    │   │   ├── 4.866666666666667_state.json
    │   │   ├── 5.833333333333333_state.json
    │   │   ├── 6.8_state.json
    │   │   ├── 7.766666666666667_state.json
    │   │   ├── 8.733333333333334_state.json
    │   │   └── 9.7_state.json
    |   ├── ensembles
    │   │   ├── DriverComponent
    │   │   │   ├── my_first_ensemble.csv
    │   │   │   └── my_second_ensemble.csv
    │   ├── ips_analysis_api_child_runs.txt
    │   └── ips_analysis_api_data_listing.json
    ├── 2
    │   ├── basic.ipynb
    |   ├── data
    │   │   └── 0.0_state.json
    |   ├── ensembles
    │   ├── ips_analysis_api_child_runs.txt
    │   └── ips_analysis_api_data_listing.json
    ├── api_v1_notebook.ipynb
    └── api_v1.py

From base directory, runs are organized into specific usernames.
From the username directory, the directory tree will continue based on runids as managed by the IPS Portal. Note that files titled api_v*.py and api_v*_notebook.ipynb will be added to this directory as well. These files may potentially be overwritten by the framework, but should always be done so in a backwards compatible manner.
From the runid directory, a few additional files will be added:
- Notebooks generated from your input notebooks. You should not change a notebook’s name, but may freely edit its content.
- IPS analysis files used for the IPS Analysis API to help organize run information (ips_analysis_api_child_runs.txt, ips_analysis_api_data_listing.json). These files should not be modified.
- A data directory which will contain all data files you added during the run. (Note that the data files are determined on the domain science side, and can be of any content-type, not just JSON.) You should not change the names of these files.
- An ensembles directory which will contain the CSV files summarizing any ensembles this run initiated. Each CSV file is named after the name of the ensemble, and all CSV files are organized into additional directories named after the component which launched them. Do not modify any of these files.