Starting with MOPPeR

A typical workflow to post-process an ACCESS or UM model output requires two steps. The first step is creating the mapping for a specific simulation, and it is done only once for an experiment. The second step is to setup and run the actual post-processing.

Step1: create a template for a mapping file

mopdb template -f <path-to-model-output> -v <access-version> -a <alias>

$ mopdb template -f /scratch/.../exp1/atmos -v CM2 -a exp1
Opened database /home/581/pxp581/.local/lib/python3.10/site-packages/data/access.db successfully
Found more than 1 definition for fld_s16i222:
[('psl', 'AUS2200', 'AUS2200_A10min', '10minPt'), ('psl', 'AUS2200', 'AUS2200_A1hr', '1hr')]
Using psl from AUS2200_A10min
Variable list for cw323a.pm successfully written
Opened database /home/581/pxp581/.local/lib/python3.10/site-packages/data/access.db successfully
Derived variables: {'treeFracBdlEvg', 'grassFracC4', 'shrubFrac', 'prc', 'mrsfl', 'landCoverFrac', 'mmrbc', 'mmrso4', 'theta24', 'sftgif', 'treeFracNdlEvg', 'snw', 'rtmt', 'nwdFracLut', 'sifllatstop', 'prw', 'mrfso', 'rlus', 'mrsll', 'baresoilFrac', 'c4PftFrac', 'wetlandFrac', 'mrro', 'c3PftFrac', 'treeFracBdlDcd', 'od550lt1aer', 'treeFracNdlDcd', 'residualFrac', 'wetss', 'sbl', 'vegFrac', 'rsus', 'cropFrac', 'mmrdust', 'grassFrac', 'mmrss', 'od550aer', 'hus24', 'dryss', 'fracLut', 'mrlso', 'mc', 'od440aer', 'grassFracC3', 'nep', 'mmroa', 'cropFracC3', 'snm', 'agesno'}
Changing cl-CMIP6_Amon units from 1 to %
Changing cli-CMIP6_Amon units from 1 to kg kg-1
Changing clt-CMIP6_Amon units from 1 to %
Changing clw-CMIP6_Amon units from 1 to kg kg-1
Variable husuvgrid-CM2_mon not found in cmor table
...
mopdb template takes as input:
  • -f/–fpath : the path to the model output

  • -v/–version : the access version to use as preferred mapping. ESM1.5, CM2, OM2 and AUS2200 are currently available.

  • -a/–alias : an optional alias, if omitted default names will be used for the output files.

Alternatively, a list of variables can be created separately using the varlist command and this can be passed directly to template using the fpath option.

mopdb template -f <varlist.csv> -v <access-version> -a <alias>

It produces a csv file with a list of all the variables from raw output mapped to CMIP style variables. These mappings also consider the frequency and include variables that can be potentially calculated with the listed fields. The console output lists these, as shown above.

This file should be considered only a template (hence the name) as the possible matches depends on the mappings available in the access.db database. This is distributed with the repository, or an alternative custom database can be passed with the –dbname option. The mappings can be different between different version and/or configurations of the model. And the database doesn’t necessarily contain all the possible combinations.

Starting with version 0.6 the list includes matches based on the standard_name, as these rows often list more than one option per field, it’s important to either edit or remove these rows before using the mapping file. The Customing section covers what to do for an experiment using a new configuration which is substantially different from the ones which are available. It also provides an intermediate varlist_<alias>.csv file that shows the information derived directly from the files. This can be useful to debug in case of issues with the mapping. This file is checked before the mapping step to make sure the tool has detected sensible frequency and realm, if the check fails the mapping won’t proceed but the varlist file can be edited appropriately.

Warning

Always check that the resulting template is mapping the variables correctly. This is particularly true for derived variables. Comment lines are inserted to give some information on what assumptions were done for each group of mappings.

Step2: Set up the working environment

mop setup -c <conf_exp.yaml>

$ mop setup -c exp_conf.yaml
Simulation to process: cy286
Setting environment and creating working directory
Output directory '/scratch/v45/pxp581/MOPPER_output/cy286' exists.
Delete and continue? [Y,n]
Y
Preparing job_files directory...
Creating variable maps in directory '/scratch/v45/pxp581/MOPPER_output/cy286/maps'

CMIP6_Omon:
could not find match for CMIP6_Omon-msftbarot-mon check variables defined in mappings
    Found 22 variables

CMIP6_Emon:
    Found 3 variables

CM2_mon:
    Found 2 variables
creating & using database: /scratch/v45/pxp581/MOPPER_output/cy286/mopper.db
Opened database /scratch/v45/pxp581/MOPPER_output/cy286/mopper.db successfully
Found experiment: cy286
Number of rows in filelist: 27
Estimated total files size before compression is: 7.9506173729896545 GB
number of files to create: 27
number of cpus to be used: 24
total amount of memory to be used: 768GB
app job script: /scratch/v45/pxp581/MOPPER_output/cy286/mopper_job.sh
Exporting config data to yaml file

The mop setup command takes as input a yaml configuration file which contains all the information necessary to post-process the data. The repository two templates which can be modified by the user: ACDD_conf.yaml and CMIP6_conf.yaml to get a CMIP6 compliant output. It is divided into 2 sections:

cmor

This part contains all the file paths information for input files, mapping file, custom CMOR tables if they exists and where the output should be saved. It’s also how a user can control the queue jobs settings, and which variables will be processed.

A user can select to process one variable at the time, a specific or all CMOR tables, or a specific list of variables passed as a yaml file. Whichever way, only tables and variables included in the mapping file are considered. If they are not available mop will skip them. If they are available at a higher frequency it will setup resample to calculate them. .. dropdown:: Example

################################################################
# USER OPTIONS
# Settings to manage cmorisation and set tables/variables to process
cmor:
    # If test true it will just run the setup but not launch the job automatically
    test: false
    appdir:  /g/data/ua8/Working/packages/ACCESS-MOPPeR 
    # output directory for all generated data (CMORISED files & logs)
    # if default it is set to /scratch/$project/$user/MOPPER_OUTPUT<exp>
    outpath: default
    # if true override files already exsiting in outpath 
    override: !!bool true
    # location of input data must point to dir above experiment;
    #  and experiment subdir must contain atmos/[,ocean/, ice/]
    datadir: /g/data/...
    # from exp_to_process: local name of experiment
    exp: expname 
    # Interval to cmorise inclusive of end_date
    # NB this will be used to select input files to include.
    # Use also hhmm if you want more control on subdaily data
    # start_date = "20220222T0000"
    # sometimes this can be defined at end of timestep so to get all data for your last day
    # you should use 0000 time of next day
    start_date: "19800101"                        
    end_date: "20201231"                        
    # select one of: [CM2, ESM1.5, OM2[-025], AUS2200]
    # if adding a new version other defaults might need to be set
    # see documentation
    access_version: CM2 
    # reference date for time units (set as 'default' to use start_date)
    reference_date: 1970-01-01             
    path_template: "{product_version}/{frequency}"
    # date_range is automatically added at the end of filename
    file_template: "{variable_id}_{source_id}_{experiment_id}_{frequency}"
    # maximum file size in MB: this is meant as uncompressed, compression might reduce it by 50%
    max_size: 8192 
    # deflate_level sets the internal compression level, 
    # level 4-6 good compromise between reducing size and write/read speed
    # shuffle 0: off 1:on Shuffle reduces size without impacting speed
    deflate_level: 4
    shuffle: 1
    # Variables to CMORise:
    # CMOR table/variable to process; default is 'all'.
    # 'all' will use all the tables listed in the mapping file
    # Or create a yaml file listing variables to process (VAR_SUBSET[_LIST]).
    # each line: <table: [var1, var2, var3 ..]>
    tables: CMIP6_Amon
    variable_to_process: tas 
    var_subset: !!bool False
    var_subset_list: ''
    # if subhr data is included specify actual frequency as ##min
    subhr: 10min
    # model vertical levels number
    levnum: 85 
    # Mappings, vocab and tables settings
    # default=data/dreq/cmvme_all_piControl_3_3.csv
    # Leave as set unless publishing for CMIP6
    dreq: default
    force_dreq: !!bool False
    dreq_years: !!bool False
    # mapping file created with cli_db.py based on the actual model output
    master_map: "localdata/map_expname.csv"
    # CMOR tables path, these define what variables can be extracted
    # see documentation to add new tables/variables
    # use this to indicate the path used for new or modified tables
    # these will be used in preference to the package tables
    tables_path: ""
    # ancillary files path
    # when running model with payu ancil files are copied to work/<realm>/INPUT
    # you can leave these empty if processing only atmos
    ancils_path: "localdata/ancils"
    grid_ocean: ""
    grid_ice: ""
    mask_ocean: ""
    land_frac: ""
    tile_frac: ""
    # defines Controlled Vocabularies and required attributes
    # leave ACDD to follow NCI publishing requirements 
    _control_vocabulary_file: "ACDD_CV.json"
    # leave this empty unless is CMIP6
    _cmip6_option:  
    _AXIS_ENTRY_FILE: "ACDD_coordinate.json"
    _FORMULA_VAR_FILE: "ACDD_formula_terms.json"
    grids: "ACDD_grids.json"
  # Additional NCI information:
    # NCI project to charge compute; $PROJECT = your default project
    # NCI queue to use; hugemem is recommended
    project: v45 
    # additional NCI projects to be included in the storage flags, comma separated list
    addprojs: []
    # queue and memory (GB) per CPU (depends on queue),
    # hugemem is recommended for high reoslution data and/or derived variables
    # hugemem requires a minimum of 6 cpus this is handled by the code
    queue: hugemem
    mem_per_cpu: 32
    max_cpus: 24
    # Mopper uses multiprocessing to produce files in parallel, usually 1 cpu per worker
    # is a good compromise, occasionally you might want to pass a higher number
    # if running out of memory
    cpuxworker: 1
    # walltime in "hh:mm:ss"
    walltime: '8:00:00'
    mode: custom
    # conda_env to use by default hh5 analysis3-unstable
    # as this has the code and all dependecies installed
    # you can override that by supplying the env to pass to "source"
    # Ex
    # conda_env: <custom-env-path>/bin/activate
    # to allow other settings use "test: true" and modify mopper_job.sh manually
    conda_env: default

Note

From version 1.1 we introduced more keys to control the PBS directives, but also how CPUs are handled by the multiprocessing Pool used in the code. The number of CPUs used by the project is derived by default based on the number of files to process, the queue and a max_cpus value that can be now controlled. mop run uses Pool to work on each file separately and by default will allocate 1 CPU per worker and launch a maximum number of workers equal to the number of CPUs. This can now be controlled by setting a cpuxworker number. This can be useful to allocate more memory to each worker.

attributes

The second part is used to define the global attributes to add to every file. CMOR uses a controlled vocabulary file to list required attributes. We provide the official CMIP6 and a custom-made controlled vocabulary as part of the repository data. Hence, we created two templates one for CMIP6 compliant files, the other for ACDD compliant files. The ACDD conventions help producing reasonably well-documented files when a specific standard is not required, they are also the conventions requested by NCI to publish data as part of their collection. While the CMIP6 file should be followed exactly, the ACDD template is just including a minimum number of required attributes, any other attribute deemed necessary can always be added.

Example
# Global attributes: these will be added to each files comment unwanted ones
# Using ACDD CV vocab to check validity of global attributes
# see data/custom-cmor-tables/ACDD_CV.json
# For CMIP6 global attributes explanation:
# https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit
attrs:
    Conventions: "CF-1.7, ACDD-1.3"
    title: "ACCESS CM2  historical simulation ...."
    experiment_id: exp-id
    # Use to provide a short description of the experiment. 
    # It will be written to file as "summary" 
    exp_description: "A global simulation of ...."
    product_version: v1.0
    date_created: "2023-05-12"
    # NB source and source_id need to be defined in ACDD_CV.json 
    # if using new different model configuration
    # currently available: AUS2200, ACCESS-ESM1-5, ACCESS-CM2,
    #                      ACCESS-OM2, ACCESS-OM2-025 
    source_id: 'ACCESS-CM2'
    # AUS2200 description
    source: "ACCESS - CM2 ..."
    # ACCESS-CM2 description
    #source: "ACCESS-CM2 (2019): aerosol: UKCA-GLOMAP-mode, atmos: MetUM-HadGEM3-GA7.1 (N96; 192 x 144 longitude/latitude; 85 levels; top level 85 km), atmosChem: none, land: CABLE2.5, landIce: none, ocean: ACCESS-OM2 (GFDL-MOM5, tripolar primarily 1deg; 360 x 300 longitude/latitude; 50 levels; top grid cell 0-10 m), ocnBgchem: none, seaIce: CICE5.1.2 (same grid as ocean)"
    # ACCESS-ESM1.5 description
    #source: "ACCESS-ESM1.5 (2019): aerosol: CLASSIC (v1.0), atmos: HadGAM2 (r1.1, N96; 192 x 145 longitude/latitude; 38 levels; top level 39255 m), atmosChem: none, land: CABLE2.4, landIce: none, ocean: ACCESS-OM2 (MOM5, tripolar primarily 1deg; 360 x 300 longitude/latitude; 50 levels; top grid cell 0-10 m), ocnBgchem: WOMBAT (same grid as ocean), seaIce: CICE4.1 (same grid as ocean)"
    license: "https://creativecommons.org/licenses/by/4.0/"
    institution: University of ... 
    # not required
    organisation: Centre of Excellence for Climate Extremes
    # see here: https://acdguide.github.io/Governance/tech/keywords.html
    # use of FOR codes is reccomended
    keywords: "Climate change processes, Adverse weather events, Cloud physics"
    references: "" 
    # contact email of person running post-processing or author
    contact: <contact-email>                
    creator_name: <main-author-name>
    creator_email: <main-author-email>
    creator_url: <main-author-researcher-id>
    # not required details of any contributor including who run post processing
    # if different from creator. If more than one spearate with commas
    # see here for datacite contributor role definitions:
    # https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/properties/recommended_optional/property_contributor.html#a-contributortype
    contributor_name: <contributor1>, <contributor2>
    contributor_role: data_curator, data_curator 
    contributor_email:  <contributor1-email>, <contributor2-email> 
    contributor_url:  <contributor1-researcher-id>, <contributor2-researcher-id>
    # Not required use if publishing, otherwise comment out
    #publisher_name:
    #publisher_email:
    # The following refer to the entire dataset rather than the specific file
    time_coverage_start: 1980-01-01
    time_coverage_end: 2020-12-31
    geospatial_lat_min: -90.0 
    geospatial_lat_max: 90.0
    geospatial_lon_min: -180.0
    geospatial_lon_max: 180.0
    # The following attributes will be added automatically:
    # experiment, frequency, realm, variable
    # Add below whatever other global attributes you want to add
    forcing: "GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)"
    calendar: "proleptic_gregorian"
    grid: "native atmosphere N96 grid (192 x 144 latxlon)"
    # nearest value from cmip6 is 2.5 km
    nominal_resolution: "250 km"
    #
    # Parent experiment details if any
    # if parent=false, all parent fields are automatically set to "no parent".
    # If true, defined values are used.
    parent: !!bool false 
    # CMOR will add a tracking_id if you want to define a prefix add here
    tracking_id_prefix: 
    comment: "post-processed using ACCESS-MOPPeR v0.6.0 https://doi.org/10.5281/zenodo.10346216"

Note

These two configurations are based on CMOR Controlled Vocabularies currently available with the repository. A user can define and set their own CV and then modify the configuration yaml file correspondingly. However, CMOR still had some hardcoded attributes that cannot be bypassed, see the CMOR3 section for more information.

Running the post-processing

mop setup sets up the working environment by default in

/scratch/<project>/<userid>/MOPPeR-Output/

This includes the mopper_job.sh job to submit to the queue. If test is set to False in the configuration file, the job is automatically submitted.

Note

mop run is used to execute the post-processing and it is called in mopper_job.sh. It takes a final experiment configuration yaml file generated in the same setup step to finalise the run settings. This file will contain all necessary information (including more details added by the tool itself) and can be kept for provenance and reproducibility.

MOPPeR workflow

A more detailed overview of the workflow going on when calling mop.

setup

  • Reads from configuration file: output file attributes, paths (input data, working dir, ancillary files), queue job settings, variables to process

  • Defines and creates output paths

  • Updates CV json file if necessary

  • Selects variables and corresponding mappings based on table and constraints passed in config file

  • Produces mop_var_selection.yaml file with variables matched for each table

  • Creates/updates database filelist table to list files to create

  • Finalises configuration and save in new yaml file

  • Writes job executable file and submits (optional) to queue

run

  • Reads from mopper.db list of files to create

  • Sets up the concurrent future pool executor and submits each file db list db table as a process.

  • Each process: 1. Sets up variable log file 2. Sets up CMOR dataset, tables and axis 3. Extracts or calculates variable 4. Writes to file using CMOR3

  • When all processes are completed results are returned to log files and status is updated in filelist database table

Working directory and output

The mop setup command generates the working and output directory based on the yaml configuration file passed as argument.

The directory path is determined by the output field. This can be a path or if default is set to:

/scratch/<project-id>/<user-id>/MOPPER_output/<exp>/

where exp is also defined in the configuration file.

Note

mop setup also produces the map_var_selection.yaml file which includes lists of matched variables for each table. This can be used as a list of variables to select by passing it in a configuration files as the varlist field. It can be useful to run first mop setup with tables: all to see which variables can be matched across all available tables and then rerun it using mop_var_selection.yaml as a varlist after refining the selection.

This folder will contain the following files:

  • mopper.db

    A database with a filelist table where each row represents a file to produce

    columns
    • infile - path + filename pattern for input files

    • filepath - expected output filepath

    • filename - expected output filename

    • vin - one or more input variables

    • variable_id - cmor name for variable

    • ctable - cmor table containing variable definition

    • frequency - output variable frequency

    • realm - output variable realm

    • timeshot - cell_methods value for time: point, mean, sum, max, min

    • axes - The cmor names of the axes used in variable definition

    • tstart - datetime stamp for time range start

    • tend - datetime stamp for time range end

    • sel_start - datetime stamp to use for input file selection (start)

    • sel_end - datetime stamp to use for input file selection (end)

    • status - file status: unprocessed, processed, processing_failed, … Files are post-processed only if status “unprocessed”

    • file_size - estimated uncompressed file size in MB

    • exp_id - experiment id

    • calculation - string representing the calculation to perform, as it will be evaluated by python “eval” (optional)

    • resample - if input data has to be resample the timestep to be used by resample (optional)

    • in_units - units for main input variable

    • positive - “up” or “down” if attribute present in variable definition (optional)

    • cfname - CF conventions standard_name if available

    • source_id - model id

    • access_version - model version

    • json_file_path - filepath for CMOR json experiment file

    • reference_date - reference date to use for time axis

    • version - version label for output

  • mopper_job.sh

    The PBS job to submit to the queue to run the post-processing.

    Example
    #!/bin/bash
    #PBS -P v45
    #PBS -q hugemem
    #PBS -l storage=gdata/hh5+gdata/ua8+scratch/ly62+scratch/v45+gdata/v45
    #PBS -l ncpus=24,walltime=12:00:00,mem=768GB,wd
    #PBS -j oe
    #PBS -o /scratch/v45/pxp581/MOPPER_output/ashwed1980/job_output.OU
    #PBS -N mopper_ashwed1980

    # the code assumes you are running this on gadi and have access to the hh5 project modules
    # if this is not the case make sure you have loaded alternative python modules
    # for a list of packages

    module use /g/data/hh5/public/modules
    module load conda/analysis3
    source mopper_env/bin/activate # if using conda option

    cd /g/data/ua8/Working/packages/ACCESS-MOPPeR
    mop run -c ashwed1980_config.yaml # –debug (uncomment to run in debug mode)
    echo ‘APP completed for exp ashwed1980.’
  • experiment-id.json

    The json experiment file needed by CMOR to create the files

  • maps/

    A folder containing one json file for each CMOR table used, each file contains the mappings for all selected variables.

  • tables/

    A folder containing one json file for each CMOR table used, each file contains the CMOR definition for all selected variables.

  • mopper_log.txt

    A log file capturing messages from the main run process

  • cmor_logs/

    A folder containing a log for each file created with cmor logging messages.

  • variable_logs/

    A folder containing a log for each file created, detailing the processing steps and, if run in debug mode, debug messages.

  • update_db.py

    A basic python code to update file status in the mopper.db database after a run