Starting with MOPPeR
A typical workflow to post-process an ACCESS or UM model output requires two steps. The first step is creating the mapping for a specific simulation, and it is done only once for an experiment. The second step is to setup and run the actual post-processing.
Step1: create a template for a mapping file
mopdb template -f <path-to-model-output> -v <access-version> -a <alias>
$ mopdb template -f /scratch/.../exp1/atmos -v CM2 -a exp1
Opened database /home/581/pxp581/.local/lib/python3.10/site-packages/data/access.db successfully
Found more than 1 definition for fld_s16i222:
[('psl', 'AUS2200', 'AUS2200_A10min', '10minPt'), ('psl', 'AUS2200', 'AUS2200_A1hr', '1hr')]
Using psl from AUS2200_A10min
Variable list for cw323a.pm successfully written
Opened database /home/581/pxp581/.local/lib/python3.10/site-packages/data/access.db successfully
Derived variables: {'treeFracBdlEvg', 'grassFracC4', 'shrubFrac', 'prc', 'mrsfl', 'landCoverFrac', 'mmrbc', 'mmrso4', 'theta24', 'sftgif', 'treeFracNdlEvg', 'snw', 'rtmt', 'nwdFracLut', 'sifllatstop', 'prw', 'mrfso', 'rlus', 'mrsll', 'baresoilFrac', 'c4PftFrac', 'wetlandFrac', 'mrro', 'c3PftFrac', 'treeFracBdlDcd', 'od550lt1aer', 'treeFracNdlDcd', 'residualFrac', 'wetss', 'sbl', 'vegFrac', 'rsus', 'cropFrac', 'mmrdust', 'grassFrac', 'mmrss', 'od550aer', 'hus24', 'dryss', 'fracLut', 'mrlso', 'mc', 'od440aer', 'grassFracC3', 'nep', 'mmroa', 'cropFracC3', 'snm', 'agesno'}
Changing cl-CMIP6_Amon units from 1 to %
Changing cli-CMIP6_Amon units from 1 to kg kg-1
Changing clt-CMIP6_Amon units from 1 to %
Changing clw-CMIP6_Amon units from 1 to kg kg-1
Variable husuvgrid-CM2_mon not found in cmor table
...
- mopdb template takes as input:
-f/–fpath : the path to the model output
-v/–version : the access version to use as preferred mapping. ESM1.5, CM2, OM2 and AUS2200 are currently available.
-a/–alias : an optional alias, if omitted default names will be used for the output files.
Alternatively, a list of variables can be created separately using the varlist command and this can be passed directly to template using the fpath option.
mopdb template -f <varlist.csv> -v <access-version> -a <alias>
It produces a csv file with a list of all the variables from raw output mapped to CMIP style variables. These mappings also consider the frequency and include variables that can be potentially calculated with the listed fields. The console output lists these, as shown above.
This file should be considered only a template (hence the name) as the possible matches depends on the mappings available in the access.db database. This is distributed with the repository, or an alternative custom database can be passed with the –dbname option. The mappings can be different between different version and/or configurations of the model. And the database doesn’t necessarily contain all the possible combinations.
Starting with version 0.6 the list includes matches based on the standard_name, as these rows often list more than one option per field, it’s important to either edit or remove these rows before using the mapping file. The Customing section covers what to do for an experiment using a new configuration which is substantially different from the ones which are available. It also provides an intermediate varlist_<alias>.csv file that shows the information derived directly from the files. This can be useful to debug in case of issues with the mapping. This file is checked before the mapping step to make sure the tool has detected sensible frequency and realm, if the check fails the mapping won’t proceed but the varlist file can be edited appropriately.
Warning
Always check that the resulting template is mapping the variables correctly. This is particularly true for derived variables. Comment lines are inserted to give some information on what assumptions were done for each group of mappings.
Step2: Set up the working environment
mop setup -c <conf_exp.yaml>
$ mop setup -c exp_conf.yaml
Simulation to process: cy286
Setting environment and creating working directory
Output directory '/scratch/v45/pxp581/MOPPER_output/cy286' exists.
Delete and continue? [Y,n]
Y
Preparing job_files directory...
Creating variable maps in directory '/scratch/v45/pxp581/MOPPER_output/cy286/maps'
CMIP6_Omon:
could not find match for CMIP6_Omon-msftbarot-mon check variables defined in mappings
Found 22 variables
CMIP6_Emon:
Found 3 variables
CM2_mon:
Found 2 variables
creating & using database: /scratch/v45/pxp581/MOPPER_output/cy286/mopper.db
Opened database /scratch/v45/pxp581/MOPPER_output/cy286/mopper.db successfully
Found experiment: cy286
Number of rows in filelist: 27
Estimated total files size before compression is: 7.9506173729896545 GB
number of files to create: 27
number of cpus to be used: 24
total amount of memory to be used: 768GB
app job script: /scratch/v45/pxp581/MOPPER_output/cy286/mopper_job.sh
Exporting config data to yaml file
The mop setup command takes as input a yaml configuration file which contains all the information necessary to post-process the data. The repository two templates which can be modified by the user: ACDD_conf.yaml and CMIP6_conf.yaml to get a CMIP6 compliant output. It is divided into 2 sections:
cmor
This part contains all the file paths information for input files, mapping file, custom CMOR tables if they exists and where the output should be saved. It’s also how a user can control the queue jobs settings, and which variables will be processed.
A user can select to process one variable at the time, a specific or all CMOR tables, or a specific list of variables passed as a yaml file. Whichever way, only tables and variables included in the mapping file are considered. If they are not available mop will skip them. If they are available at a higher frequency it will setup resample to calculate them. .. dropdown:: Example
################################################################ # USER OPTIONS # Settings to manage cmorisation and set tables/variables to process cmor: # If test true it will just run the setup but not launch the job automatically test: false appdir: /g/data/ua8/Working/packages/ACCESS-MOPPeR # output directory for all generated data (CMORISED files & logs) # if default it is set to /scratch/$project/$user/MOPPER_OUTPUT<exp> outpath: default # if true override files already exsiting in outpath override: !!bool true # location of input data must point to dir above experiment; # and experiment subdir must contain atmos/[,ocean/, ice/] datadir: /g/data/... # from exp_to_process: local name of experiment exp: expname # Interval to cmorise inclusive of end_date # NB this will be used to select input files to include. # Use also hhmm if you want more control on subdaily data # start_date = "20220222T0000" # sometimes this can be defined at end of timestep so to get all data for your last day # you should use 0000 time of next day start_date: "19800101" end_date: "20201231" # select one of: [CM2, ESM1.5, OM2[-025], AUS2200] # if adding a new version other defaults might need to be set # see documentation access_version: CM2 # reference date for time units (set as 'default' to use start_date) reference_date: 1970-01-01 path_template: "{product_version}/{frequency}" # date_range is automatically added at the end of filename file_template: "{variable_id}_{source_id}_{experiment_id}_{frequency}" # maximum file size in MB: this is meant as uncompressed, compression might reduce it by 50% max_size: 8192 # deflate_level sets the internal compression level, # level 4-6 good compromise between reducing size and write/read speed # shuffle 0: off 1:on Shuffle reduces size without impacting speed deflate_level: 4 shuffle: 1 # Variables to CMORise: # CMOR table/variable to process; default is 'all'. # 'all' will use all the tables listed in the mapping file # Or create a yaml file listing variables to process (VAR_SUBSET[_LIST]). # each line: <table: [var1, var2, var3 ..]> tables: CMIP6_Amon variable_to_process: tas var_subset: !!bool False var_subset_list: '' # if subhr data is included specify actual frequency as ##min subhr: 10min # model vertical levels number levnum: 85 # Mappings, vocab and tables settings # default=data/dreq/cmvme_all_piControl_3_3.csv # Leave as set unless publishing for CMIP6 dreq: default force_dreq: !!bool False dreq_years: !!bool False # mapping file created with cli_db.py based on the actual model output master_map: "localdata/map_expname.csv" # CMOR tables path, these define what variables can be extracted # see documentation to add new tables/variables # use this to indicate the path used for new or modified tables # these will be used in preference to the package tables tables_path: "" # ancillary files path # when running model with payu ancil files are copied to work/<realm>/INPUT # you can leave these empty if processing only atmos ancils_path: "localdata/ancils" grid_ocean: "" grid_ice: "" mask_ocean: "" land_frac: "" tile_frac: "" # defines Controlled Vocabularies and required attributes # leave ACDD to follow NCI publishing requirements _control_vocabulary_file: "ACDD_CV.json" # leave this empty unless is CMIP6 _cmip6_option: _AXIS_ENTRY_FILE: "ACDD_coordinate.json" _FORMULA_VAR_FILE: "ACDD_formula_terms.json" grids: "ACDD_grids.json" # Additional NCI information: # NCI project to charge compute; $PROJECT = your default project # NCI queue to use; hugemem is recommended project: v45 # additional NCI projects to be included in the storage flags, comma separated list addprojs: [] # queue and memory (GB) per CPU (depends on queue), # hugemem is recommended for high reoslution data and/or derived variables # hugemem requires a minimum of 6 cpus this is handled by the code queue: hugemem mem_per_cpu: 32 max_cpus: 24 # Mopper uses multiprocessing to produce files in parallel, usually 1 cpu per worker # is a good compromise, occasionally you might want to pass a higher number # if running out of memory cpuxworker: 1 # walltime in "hh:mm:ss" walltime: '8:00:00' mode: custom # conda_env to use by default hh5 analysis3-unstable # as this has the code and all dependecies installed # you can override that by supplying the env to pass to "source" # Ex # conda_env: <custom-env-path>/bin/activate # to allow other settings use "test: true" and modify mopper_job.sh manually conda_env: default
Note
From version 1.1 we introduced more keys to control the PBS directives, but also how CPUs are handled by the multiprocessing Pool used in the code. The number of CPUs used by the project is derived by default based on the number of files to process, the queue and a max_cpus value that can be now controlled. mop run uses Pool to work on each file separately and by default will allocate 1 CPU per worker and launch a maximum number of workers equal to the number of CPUs. This can now be controlled by setting a cpuxworker number. This can be useful to allocate more memory to each worker.
attributes
The second part is used to define the global attributes to add to every file. CMOR uses a controlled vocabulary file to list required attributes. We provide the official CMIP6 and a custom-made controlled vocabulary as part of the repository data. Hence, we created two templates one for CMIP6 compliant files, the other for ACDD compliant files. The ACDD conventions help producing reasonably well-documented files when a specific standard is not required, they are also the conventions requested by NCI to publish data as part of their collection. While the CMIP6 file should be followed exactly, the ACDD template is just including a minimum number of required attributes, any other attribute deemed necessary can always be added.
Example
# Global attributes: these will be added to each files comment unwanted ones
# Using ACDD CV vocab to check validity of global attributes
# see data/custom-cmor-tables/ACDD_CV.json
# For CMIP6 global attributes explanation:
# https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit
attrs:
Conventions: "CF-1.7, ACDD-1.3"
title: "ACCESS CM2 historical simulation ...."
experiment_id: exp-id
# Use to provide a short description of the experiment.
# It will be written to file as "summary"
exp_description: "A global simulation of ...."
product_version: v1.0
date_created: "2023-05-12"
# NB source and source_id need to be defined in ACDD_CV.json
# if using new different model configuration
# currently available: AUS2200, ACCESS-ESM1-5, ACCESS-CM2,
# ACCESS-OM2, ACCESS-OM2-025
source_id: 'ACCESS-CM2'
# AUS2200 description
source: "ACCESS - CM2 ..."
# ACCESS-CM2 description
#source: "ACCESS-CM2 (2019): aerosol: UKCA-GLOMAP-mode, atmos: MetUM-HadGEM3-GA7.1 (N96; 192 x 144 longitude/latitude; 85 levels; top level 85 km), atmosChem: none, land: CABLE2.5, landIce: none, ocean: ACCESS-OM2 (GFDL-MOM5, tripolar primarily 1deg; 360 x 300 longitude/latitude; 50 levels; top grid cell 0-10 m), ocnBgchem: none, seaIce: CICE5.1.2 (same grid as ocean)"
# ACCESS-ESM1.5 description
#source: "ACCESS-ESM1.5 (2019): aerosol: CLASSIC (v1.0), atmos: HadGAM2 (r1.1, N96; 192 x 145 longitude/latitude; 38 levels; top level 39255 m), atmosChem: none, land: CABLE2.4, landIce: none, ocean: ACCESS-OM2 (MOM5, tripolar primarily 1deg; 360 x 300 longitude/latitude; 50 levels; top grid cell 0-10 m), ocnBgchem: WOMBAT (same grid as ocean), seaIce: CICE4.1 (same grid as ocean)"
license: "https://creativecommons.org/licenses/by/4.0/"
institution: University of ...
# not required
organisation: Centre of Excellence for Climate Extremes
# see here: https://acdguide.github.io/Governance/tech/keywords.html
# use of FOR codes is reccomended
keywords: "Climate change processes, Adverse weather events, Cloud physics"
references: ""
# contact email of person running post-processing or author
contact: <contact-email>
creator_name: <main-author-name>
creator_email: <main-author-email>
creator_url: <main-author-researcher-id>
# not required details of any contributor including who run post processing
# if different from creator. If more than one spearate with commas
# see here for datacite contributor role definitions:
# https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/properties/recommended_optional/property_contributor.html#a-contributortype
contributor_name: <contributor1>, <contributor2>
contributor_role: data_curator, data_curator
contributor_email: <contributor1-email>, <contributor2-email>
contributor_url: <contributor1-researcher-id>, <contributor2-researcher-id>
# Not required use if publishing, otherwise comment out
#publisher_name:
#publisher_email:
# The following refer to the entire dataset rather than the specific file
time_coverage_start: 1980-01-01
time_coverage_end: 2020-12-31
geospatial_lat_min: -90.0
geospatial_lat_max: 90.0
geospatial_lon_min: -180.0
geospatial_lon_max: 180.0
# The following attributes will be added automatically:
# experiment, frequency, realm, variable
# Add below whatever other global attributes you want to add
forcing: "GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFC11, CFC12, CFC113, HCFC22, HFC125, HFC134a)"
calendar: "proleptic_gregorian"
grid: "native atmosphere N96 grid (192 x 144 latxlon)"
# nearest value from cmip6 is 2.5 km
nominal_resolution: "250 km"
#
# Parent experiment details if any
# if parent=false, all parent fields are automatically set to "no parent".
# If true, defined values are used.
parent: !!bool false
# CMOR will add a tracking_id if you want to define a prefix add here
tracking_id_prefix:
comment: "post-processed using ACCESS-MOPPeR v0.6.0 https://doi.org/10.5281/zenodo.10346216"
Note
These two configurations are based on CMOR Controlled Vocabularies currently available with the repository. A user can define and set their own CV and then modify the configuration yaml file correspondingly. However, CMOR still had some hardcoded attributes that cannot be bypassed, see the CMOR3 section for more information.
Running the post-processing
mop setup sets up the working environment by default in
/scratch/<project>/<userid>/MOPPeR-Output/
This includes the mopper_job.sh job to submit to the queue. If test is set to False in the configuration file, the job is automatically submitted.
Note
mop run is used to execute the post-processing and it is called in mopper_job.sh. It takes a final experiment configuration yaml file generated in the same setup step to finalise the run settings. This file will contain all necessary information (including more details added by the tool itself) and can be kept for provenance and reproducibility.
MOPPeR workflow
A more detailed overview of the workflow going on when calling mop.
setup
Reads from configuration file: output file attributes, paths (input data, working dir, ancillary files), queue job settings, variables to process
Defines and creates output paths
Updates CV json file if necessary
Selects variables and corresponding mappings based on table and constraints passed in config file
Produces mop_var_selection.yaml file with variables matched for each table
Creates/updates database filelist table to list files to create
Finalises configuration and save in new yaml file
Writes job executable file and submits (optional) to queue
run
Reads from mopper.db list of files to create
Sets up the concurrent future pool executor and submits each file db list db table as a process.
Each process: 1. Sets up variable log file 2. Sets up CMOR dataset, tables and axis 3. Extracts or calculates variable 4. Writes to file using CMOR3
When all processes are completed results are returned to log files and status is updated in filelist database table
Working directory and output
The mop setup command generates the working and output directory based on the yaml configuration file passed as argument.
The directory path is determined by the output field. This can be a path or if default is set to:
/scratch/<project-id>/<user-id>/MOPPER_output/<exp>/
where exp is also defined in the configuration file.
Note
mop setup also produces the map_var_selection.yaml file which includes lists of matched variables for each table. This can be used as a list of variables to select by passing it in a configuration files as the varlist field. It can be useful to run first mop setup with tables: all to see which variables can be matched across all available tables and then rerun it using mop_var_selection.yaml as a varlist after refining the selection.
This folder will contain the following files:
mopper.db
A database with a filelist table where each row represents a file to produce
columns
infile - path + filename pattern for input files
filepath - expected output filepath
filename - expected output filename
vin - one or more input variables
variable_id - cmor name for variable
ctable - cmor table containing variable definition
frequency - output variable frequency
realm - output variable realm
timeshot - cell_methods value for time: point, mean, sum, max, min
axes - The cmor names of the axes used in variable definition
tstart - datetime stamp for time range start
tend - datetime stamp for time range end
sel_start - datetime stamp to use for input file selection (start)
sel_end - datetime stamp to use for input file selection (end)
status - file status: unprocessed, processed, processing_failed, … Files are post-processed only if status “unprocessed”
file_size - estimated uncompressed file size in MB
exp_id - experiment id
calculation - string representing the calculation to perform, as it will be evaluated by python “eval” (optional)
resample - if input data has to be resample the timestep to be used by resample (optional)
in_units - units for main input variable
positive - “up” or “down” if attribute present in variable definition (optional)
cfname - CF conventions standard_name if available
source_id - model id
access_version - model version
json_file_path - filepath for CMOR json experiment file
reference_date - reference date to use for time axis
version - version label for output
mopper_job.sh
The PBS job to submit to the queue to run the post-processing.
Example
#!/bin/bash#PBS -P v45#PBS -q hugemem#PBS -l storage=gdata/hh5+gdata/ua8+scratch/ly62+scratch/v45+gdata/v45#PBS -l ncpus=24,walltime=12:00:00,mem=768GB,wd#PBS -j oe#PBS -o /scratch/v45/pxp581/MOPPER_output/ashwed1980/job_output.OU#PBS -N mopper_ashwed1980# the code assumes you are running this on gadi and have access to the hh5 project modules# if this is not the case make sure you have loaded alternative python modules# for a list of packagesmodule use /g/data/hh5/public/modulesmodule load conda/analysis3source mopper_env/bin/activate # if using conda optioncd /g/data/ua8/Working/packages/ACCESS-MOPPeRmop run -c ashwed1980_config.yaml # –debug (uncomment to run in debug mode)echo ‘APP completed for exp ashwed1980.’experiment-id.json
The json experiment file needed by CMOR to create the files
maps/
A folder containing one json file for each CMOR table used, each file contains the mappings for all selected variables.
tables/
A folder containing one json file for each CMOR table used, each file contains the CMOR definition for all selected variables.
mopper_log.txt
A log file capturing messages from the main run process
cmor_logs/
A folder containing a log for each file created with cmor logging messages.
variable_logs/
A folder containing a log for each file created, detailing the processing steps and, if run in debug mode, debug messages.
update_db.py
A basic python code to update file status in the mopper.db database after a run