Additional documentation on parallel developments that are sunducted under this framework can be found here:
Tools to perform machine learning studies for tau lepton reconstruction and identification at CMS.
-
Clone package from the github without loading any additional environment (like CMSSW):
git clone --recursive -o cms-tau-pog -b master [email protected]:cms-tau-pog/TauMLTools.git
-
Go to
TauMLTools
directory and load appropriate environment by runningenv.sh
:source env.sh [ ENV_NAME ]
where supported
ENV_NAME
are:- if no
ENV_NAME
is specified, the LCG environment is used by default; - if
ENV_NAME
containscmssw
- enable to CMSSW-dependent environment package is enabled; - if
ENV_NAME
containsconda
- using tau-ml conda environment -- this is the recommended environment to perform an actual NN training
N.B. If you want to use existing
conda
installation, make sure that it is activated and the path to theconda
executable is included inPATH
variable. Ifconda
installation not found,env.sh
will make install it from the official web site and configure it to work with the current TauMLTools installation.N.B. If cmssw environment was specified when sourcing env.sh, to run a command in the CMSSW environment, one should add
cmsEnv
before the command. - if no
The second step (source env.sh ENV_NAME
) should be repeated each time you open a new shell session.
Law is part of conda installation. Standalone installation without conda environment is available on lxplus. This installation step is only required if you running with LCG environment on sites that are not yet supported by TauMLTools. Some parts of the ntuple production and network training can be split into different jobs and run on the HTCondor cluster. To do so, we use the law package. Law can be installed via conda's pip (reccommended if running jobs which run in a conda environent) using the environment yaml file, or via standard pip (recommended when running jobs in the CMSSW environment) running the command
python -m pip install law
If installing with standard pip, be sure that law is added to the PATH variable and that the LAW libraries (e.g. $HOME/.local/lib/pythonX.Y/site-packages/) are added to the PYTHONPATH variable. It may also be needed to change the shebang of the law executable (e.g. e.g. $HOME/.local/bin/law) from #!/some/path/to/python3 to #!/usr/bin/env python3 in order to access CMSSW python modules. As an alternative to this last step, one can install law with the command below (not fully tested)
LAW_INSTALL_EXECUTABLE="/usr/bin/env python3" python3 -m pip install law --user --no-binary :all:
which automatically sets the correct shebang.
Steps below describe how to process input datasets starting from MiniAOD up to the representation that can be directly used as an input for the training.
The big root-tuples, TauTuple
, contain an extensive information about reconstructed tau, seeding jet, and all reco objects within tau signal and isolation cones: PF candidates, pat::Electrons and pat::Muons, and tracks.
TauTuple
also contain the MC truth that can be used to identify the tau decay.
Each TauTuple
entry corresponds to a single tau candidate.
Branches available within TauTuple
are defined in Analysis/interface/TauTuple.h
These branches are filled in CMSSW module Production/plugins/TauTupleProducer.cc that converts information stored in MiniAOD events into TauTuple
format.
Production/python/Production.py contains the configuration that allows to run TauTupleProducer
with cmsRun
.
Here is an example how to run TauTupleProducer
on 1000 DY events using one MiniAOD file as an input:
cmsEnv cmsRun TauMLTools/Production/python/Production.py sampleType=MC_18 inputFiles=/store/mc/RunIIAutumn18MiniAOD/DYJetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/102X_upgrade2018_realistic_v15-v1/00000/A788C40A-03F7-4547-B5BA-C1E01CEBB8D8.root maxEvents=1000
In order to run a large-scale production for the entire datasets, the CMS computing grid should be used via CRAB interface. Submission and status control can be performed using crab_submit.py and crab_cmd.py commands.
- Go to
$CMSSW_BASE/src
and load CMS and environments:cd $CMSSW_BASE/src cmsenv source /cvmfs/cms.cern.ch/common/crab-setup.sh
- Enable VOMS proxy:
voms-proxy-init -rfc -voms cms -valid 192:00 export X509_USER_PROXY=`voms-proxy-info -path`
- Submit task in a config file (or a set of config files) using
crab_submit.py
:crab_submit.py --workArea work-area --cfg TauMLTools/Production/python/Production.py --site T2_CH_CERN --output /store/group/phys_tau/TauML/prod_2018_v2/crab_output TauMLTools/Production/crab/configs/2018/CONFIG1.txt TauMLTools/Production/crab/configs/2018/CONFIG2.txt ...
- For more command line options use
crab_submit.py --help
. - For big dataset file-based splitting should be used
- For Embedded samples a custom DBS should be specified
- For more command line options use
- Regularly check task status using
crab_cmd.py
:crab_cmd.py --workArea work-area --cmd status
- Once all jobs within a given task are finished you can move it from
work-area
tofinished
folder (to avoid rerunning status each time) and setdone
for the dataset in the production google-doc. - If some jobs are failed: try to understand the reason and use standard crab tools to solve the problem (e.g.
crab resubmit
with additional arguments). In very problematic cases a recovery task could be created. - Once production is over, all produced
TauTuples
will be moved in/eos/cms/store/group/phys_tau/TauML/prod_2018_v2/full_tuples
.
Production should be run on the server that have the crab stageout area mounted to the file system.
-
Load environment on CentOS8 machine
source $PWD/env.sh cmssw voms-proxy-init -voms cms -rfc -valid 192:00
-
Check that all datasets are present and valid (replace path to
yaml
s accordingly):cat Production/crab/Run3Winter23_HLT/*.yaml | grep -v -E '^( +| *#)' | grep -E ' /' | sed -E 's/.*: (.*)/\1/' | xargs python RunKit/checkDatasetExistance.py
If all ok, there should be no output.
-
Modify output and other site-specific settings in
NanoProd/crab/overseer_cfg_HLT.yaml
. In particular:- site
- crabOutput
- localCrabOutput
- finalOutput
- renewKerberosTicket
-
Test that the code works locally (take one of the RAW files as an input). E.g.
mkdir -p tmp && cd tmp cmsEnv python3 $ANALYSIS_PATH/RunKit/cmsRunWrapper.py cmsRunCfg=$ANALYSIS_PATH/Production/python/hlt_configs/hltMC.py maxEvents=10 inputFiles=/store/mc/Run3Winter23Digi/TT_TuneCP5_13p6TeV_powheg-pythia8/GEN-SIM-RAW/126X_mcRun3_2023_forPU65_v1_ext1-v2/40002/cbcb2b23-174a-4e7f-a385-152d9c5c5b87.root output=nano_mc_tmp.root writePSet=True copyInputsToLocal=False skimCfg=$ANALYSIS_PATH/Production/config/skim_HLT.yaml skimSetup=skim_setup cmsEnv $ANALYSIS_PATH/RunKit/crabJob.sh
Check that output file
nano_0.root
is created correctly. After that, you can removetmp
directory:cd $ANALYSIS_PATH rm -r tmp
-
Test a dryrun crab submission
python RunKit/crabOverseer.py --work-area crab_test --cfg Production/crab/overseer_cfg_HLT.yaml --no-loop Production/crab/crab_test_HLT.yaml
- If successful, the last line output to the terminal should be
Tasks: 1 Total, 1 Submitted
- NB. Crab estimates of processing time will not be accurate, ignore them.
- After the test, remove
crab_test
directory:rm -r crab_test
- If successful, the last line output to the terminal should be
-
Test that post-processing task is known to law:
law index law run ProdTask --help
-
Submit tasks using
RunKit/crabOverseer.py
and monitor the process. It is recommended to runcrabOverseer
inscreen
.python RunKit/crabOverseer.py --cfg Production/crab/overseer_cfg_HLT.yaml Production/crab/Run3Winter23_HLT/FILE1.yaml Production/crab/Run3Winter23_HLT/FILE2.yaml ...
- Use
Production/crab/Run3Winter23_HLT/*.yaml
to submit all the tasks - For more information about available command line arguments run
python3 RunKit/crabOverseer.py --help
- For consecutive runs, if there are no modifications in the configs, it is enough to run
crabOverseer
without any arguments:python RunKit/crabOverseer.py
- Use
This step represents a Shuffle and Merge procedure that aims to minimize usage of physicical memory as well as provide direct control over the shape of final (pt,eta) spectrum in a stochastic manner. Before running the following step on HTCondor for all the samples it is essential to analyze and tune the parameters of ShuffleMergeSpectral.
Spectrum of the initial data is needed to calculate the probability to take tau candidate of some genuine tau-type and from some pt-eta bin. To generate the specturum histograms for a dataset and lists with number of entries per datafile, run:
run_cxx $ANALYSIS_PATH/Preprocessing/CreateSpectralHists \
--output "spectrum_file.root" \
--output_entries "entires_file.txt" \
--input-dir "path/to/dataset/dir" \
--pt-hist "n_bins_pt, pt_min, pt_max" \
--eta-hist "n_bins_eta, |eta|_min, |eta|_max" \
--n-threads 1
on this step it is important to have high granularity binning to be able later to re-bin into custom, non-uniform pt-eta bins on the next step.
Alternatively, if one wants to process several datasets the following python script can be used (parameters of pt and eta binning to be hardcoded in the script):
python $ANALYSIS_PATH/Preprocessing/CreateSpectralHists.py \
--input /path/to/input/dir/ \
--output /path/to/output/dir/ \
--filter ".*(DY).*" \
--rewrite
After the following step spectrums and .txt files with the number of entries will be created in the output folder. To merge all the .txt files into one and mix the lines:
cat <path_to_spectrums>/*.txt | shuf - > filelist_mix.txt
Mixing is needed to have shuffled files within one data group. After this step, filelist_mix.txt should contain filenames and the number of entries in the corresponding files:
./DYJetsToLL_M-50/eventTuple_1-70.root 249243
./TTJets_ext1/eventTuple_1-308.root 59414
./TTToHadronic/eventTuple_467.root 142781
./WJetsToLNu/eventTuple_25.root 400874
...
After spectrums are created for all datasets, the final procedure of Shuffle and Merge can be performed with:
run_cxx Preprocessing/ShuffleMergeSpectral.cxx --cfg Analysis/config/2018/training_inputs_MC.cfg
--input filelist_mix.txt
--prefix prefix_string
--output <path_to_output_file.root>
--mode MergeAll
--n-threads 1
--disabled-branches "trainingWeight, tauType"
--input-spec /afs/cern.ch/work/m/myshched/public/root-tuple-v2-hists
--pt-bins "20, 30, 40, 50, 60, 70, 80, 90, 100, 1000"
--eta-bins "0., 0.6, 1.2, 1.8, 2.4"
--tau-ratio "jet:1, e:1, mu:1, tau:1"
--lastbin-disbalance 100.0
--lastbin-takeall true
--refill-spectrum true
--enable-emptybin true
--job-idx 0 --n-jobs 500
--cfg
is a configuration file where data groups are constructed, each line represents data group in the format of (name, dataset_dir_regexp, file_regexp, tau_types to consider in this group), e.g:Analysis/config/2018/training_inputs_MC.cfg
--input
is the file containing the list of input files and entries (read line-by-line). The abdolute path is not taken from this file, only../dataset/datafile.root n_events
is read. The rest of the path (the path to the dataset folders) should be specified in the--prefix
argument.--prefix
is the prefix which will be placed before the path of each file read form--input
. Please note that this prefix will not be placed before the--input-spec
value. This value can include a remote path compatible with xrootd.- the last pt bin is taken as a high pt region, all entries from it are taken without rejection.
--tau-ratio "jet:1, e:1, mu:1, tau:1"
defines proportion of TauTypes in final root-tuple.--refill-spectrum
to recalculated spectrums of the input data on flight, only events and files that correspond to the current job--job-idx
will be considered. It was studied that in case of heterogeneity within a data group, the common spectrum of the whole data group poorly represents the spectrum of the sub-chunk of this data group if we use--n-jobs 500
, so it is needed to fill spectrum histograms in the beginning of every job, to obtain required uniformity of the output.--lastbin-takeall
due to the poor statistic in the high-pt region the option to take all candidates from last pt-bin is included (contrary to--lastbin-takeall false
, in this case we put the requirement on n_entries in the last bin to be equal to n_entries in other bins)--lastbin-disbalance
the argument is relevant in case of-lastbin-takeall true
, it put the requirement on the ratio of (all entries up to the last bin) to the (number of entries in the last bin) not to be greater than required number, otherwise less events will be taken from all bins up to the last.--enable-emptybin
in case of empty pt-eta bin, the probability in this bin will be set to 0 (that is needed in cases when datasets are statistically poor or the number of jobs--n-jobs
is big in case of--refill-spectrum true
mode)--n-jobs 500 --job-idx 0
defines into how many parts to split the input files and the index of corresponding job
In order to find appropriate binning and --tau-ratio
in correspondence to the present statistics it might be useful to execute one job in --refill-spectrum false --lastbin-takeall false
mode and study the output of ./out/*.root
files. In the <DataGroupName>_n.root files the number of entries in required --pt-bins --eta-bins
can be found. <DataGroupName>.root files show the probability of accepting candidate from corresponding pt-eta bin.
ShuffleMergeSpectral can be executed on condor through the law package. To run it, first install law following the instructions in the first section. Then, set up the environment
law index
Jobs can be submitted running
law run ShuffleMergeSpectral --version vx --params --n-jobs N
where --params are the ShuffleMergeSpectral parameters. In particular
- --input has been renamed to --input-path
- --output has been renamed to --output-path
NOTA BENE: --output-path should be a full path, otherwise the output will be lost The full list of parameters accepted by law and ShuffleMergeSpectral can be printed with the commands
ShuffleMergeSpectral --help
law run ShuffleMergeSpectral --help
Jobs are created by the script using the --start-entry and --end-entry parameters.
Additional arguments can be used to control the condor submission:
- --workflow local will run locally. If omitted, condor will be used
- --max-runtime condor runtime in hours
- --max-memory condor RAM request in MB
- --batch-name batch name to be used on condor. Default is "TauML_law"
At this point, the worker will start reporting the job status. As long as the worker is alive, it will automatically resubmit failed jobs. The worker can be killed with ctrl+C once all jobs have been submitted. Failed jobs can be resubmitted running the same command used in the first submission (from the same working directory).
A data directory is created. This directory contains information about the jobs as well as the log, output and erorr files created by condor.
A validation can be run on shuffled samples to ensure that different parts of the training set have compatible distributions. To run the validation tool, a ROOT version greater or equal to 6.16 is needed:
source /cvmfs/sft.cern.ch/lcg/views/LCG_97apython3/x86_64-centos7-clang10-opt/setup.sh
Then, run:
python TauMLTools/Production/scripts/validation_tool.py --input input_directory \
--id_json /path/to/dataset_id_json_file \
--group_id_json /path/to/dataset_group_id_json_file \
--output output_directory \
--n_threads n_threads \
--legend > results.txt
The id_json (group_id_json) points to a json file containing the list of datasets names (dataset group names) and their hash values, used to identify them inside the shuffled ROOT tuples. These files are needed in order to create a unique identifier which can be handled by ROOT. These files are produced at the shuffle and merge step. The script will create the directory "output_directory" containing the results of the test. Validation is run on the following ditributions with a Kolmogorov-Smirnov test:
- dataset_id, dataset_group_id, lepton_gen_match, sampleType
- tau_pt and tau_eta for each bin of the previous
- dataset_id for each bin of dataset_group_id
If a KS test is not successful, a warning message is print on screen.
Optional arguments are available running:
python TauMLTools/Production/scripts/validation_tool.py --help
A time benchmark is available here.
For the following training step, the pT-eta spectrum of these input events is needed and can be computed with the command:
CreateSpectralHists \
--outputfile output_spectrum_file.root \
--output_entries entries_per_file.txt \
--input-dir /path/to/shuffle/and/merge/files
--pt-hist "980, 20, 1000"
--eta-hist "250, 0.0, 2.5"
--file-name-pattern ".*2_rerun\/Shuffle.*.root"
--n-threads 1
where:
- --outputfile will store the pT-eta spectrum needed for the training
- --output_entries will store the list of the shuffle-and-merge files and the number of entries per file
- --input-dir is the location of the shuffle-and-merge files
- --pt-hist and --eta-hist are the binnings used for the pT and eta variables
- --file-name-pattern is the regex pattern used to match the input file names
The path to the output file has to be specified in the yaml configuration file described in the following sections.
To speedup dataloading, tensorflow datasets should be created from S&M root files before running the training. This step is done by the create_dataset.py script.
The production of TF datasets can be parallelized (over input files) with law on HTCondor. The corresponding law task is RootToTF.py. This task imports the needed function from create_dataset.py. To start the conversion, run
law run RootToTF --cfg CFG --dataset-type DTYPE --version VERSION --environment conda --output-path OUTPUT
where CFG is the path to a yaml file such as create_dataset.yaml, DTYPE is the dataset type (validation, training) as specified in the CFG file, VERSION is the job batch identifier and OUTPUT is the destination folder.
NOTE: OUTPUT overwrites the yaml configuration. This is needed to bypass the different filesystem on the HTCondor machine.
During the loading of data into the model, values for each feature in the input grid tensor will undergo a sequential transformation with a following order: subtract mean
, divide by std
, clamp to a range [lim_min, lim_max]
(see Scale()
function in DataLoader_main.h
). Here, mean
, std
, lim_min
, lim_max
are parameters unique for each input feature and they have to be precomputed for a given input data set prior to running the training. This can be done with a script Training/python/feature_scaling.py
, for example, as follows:
python feature_scaling.py --cfg ../configs/training_v1.yaml --var_types TauFlat
--cfg
(str, required) is a relative path to a main yaml configuration file used for the training--var_types
(list of str, optional, default: -1) is a list of variable types to be run computation on. Should be the ones from fieldFeatures_all
in the main training cfg file.
The scaling procedure is further configured in the dedicated Scaling_setup
field of the training yaml cfg file (../configs/training_v1.yaml
in the example above). There, one needs to specify the following parameters:
file_path
, path to input ROOT files (e.g. after Shuffle & Merge step) which are used for the trainingoutput_json_folder
, path to an output directory for storing json files with scaling coefficientsfile_range
, list indicating the range of input files to be used in the computation, -1 to run on all the files infile_path
tree_name
, name of TTree with data branches inside of input ROOT fileslog_step
, number of files after which make a log of currently computed scaling paramsversion
, string added as a postfix to the output file names
Then, there are cone_definition
and cone_selection
fields which define the configuration for cone splitting. Scaling parameters are computed separately for constituents in the inner cone of the tau candidate (constituent_dR <= dR_signal_cone
) and in the outer ((constituent_dR > dR_tau_signal_cone) & (constituent_dR < dR_tau_outer_cone)
). Therefore, in cone_definition
one should define the inner/outer cone dimensions and in cone_selection
variable names (per variable type) in input TTree
to be used to compute dR. Also cone_types
field allows to specify the cones per variable type for which the script should compute the scaling parameters.
When it comes to variables, scaling module shares the list of ones to be used with the main training module via Features_all
field of configs/training_v1.yaml
cfg file. Under this field, each variable type (TauFlat, etc.) stores a list of corresponding variables. Each entry in the list is a dictionary, where the key is the variable name, and the list has the format (selection_cut, aliases, scaling_type, *lim_params)
:
selection_cut
, string, cuts to be applied on-the-fly byuproot
when loading (done on feature-by-feature basis) data intoawkward
array. Sinceyaml
format allows for the usage of aliases, there are a few of those defined in theselection
field of the cfg file.aliases
, dictionary, definitions of variables not present in original file but needed for e.g. applying selection. Added on-the-fly byuproot
when loading data.scaling_type
, string, one of["no scaling", "categorical", "linear", "normal"]
, see below for their definition.lim_params
, if passed, should be either a list of two numbers or a dictionary. Depending on the scaling type (see below), specifies the range to which each variable should be clamped.
NB: Please note, that as of now selection criteria selection_cut
are used only by the scaling module and are a duplicate of those specified internally inside of the DataLoader
class. One needs to ensure that those two are in synch with each other.
The following scaling types are supported:
- for
no scaling
andcategorical
cases it automatically fills in the output json file withmean=0
,std=1
,lim_min=-inf
,lim_max=inf
(no linear transformation and no clamping). Both cases are treated similarly and "categorical" label is introduced only to distinguish those variables in the cfg file for a possible dedicated preprocessing in the future. linear
case assumes the variable needs to be first clamped to a specified range, and then linearly transformed to the range[-1,1]
. Here one should pass aslim_params
in the cfg exactly those range, for which they would want to clamp the data first (recall that inDataLoader
it is done as the last step). The correspondingmean
,std
,lim_min
/lim_max
to meet the DataLoader notation are derived and filled in the output json automatically in the functioninit_dictionaries()
ofscaling_utils.py
normal
case assumes the variable to be standardised bymean
(subtract) andstd
(divide) and clamped to a specified range (in number of sigmas). Here one should passlim_params
as the number of sigmas to which they want to clamp the data after being scaled by mean and std (e.g. [-5, 5]). It is also possible to skiplim_params
argument. In that case, it is automatically filledlim_min=-inf
,lim_max=inf
Also note that lim_params
(for both linear and normal cases) can be a dictionary with keys "inner" and/or "outer" and values as lists of two elements as before. In that case lim_min
/lim_max
will be derived separately for each specified cone.
As one may notice, it is "normal" features which require actual computation, since for other types scaling parameters can be derived easily based on specified lim_params
. The computation of means and stds in that case is performed in an iterative manner, where the input files are opened one after another and for each variable the sum of its values, squared sum of values and counts of entries are being aggregated as the files are being read. Then, every log_step
number of files, means/stds are computed based on so far aggregated sums and counts and together with other scaling parameters are logged into a json file. This cumulative nature of the approach also allows for a more flexible scan of the data for its validation (e.g. by comparison of aggregated statistics, not necessarily mean/std across file ranges). Moreover, for every file every variable's quantiles are stored, allowing for a validation of the scaling procedure (see section below).
The result of running the scaling script will be a set of log json files further referred to as snapshots (e.g. scaling_params_v*_log_i.json
), where each file corresponds to computation of mean/std/lim_min/lim_max after having accumulated sums/sums2/counts for i*log_step
files; the json file (scaling_params_v*.json
) which corresponds to processing of all given files; json file storing variables' quantiles per file (quantile_params_v*.json
). scaling_params_v*.json
should be further provided to DataLoader
in the training step to perform the scaling of inputs.
Since the feature scaling computation follows a cumulative approach, one can be interested to see how the estimates of mean/std are converging to stable values from one snapshot to another as more data is being added. The convergence of the approach can be validated with Training/python/plot_scaling_convergence.py
, e.g. for TauFlat variable type:
python plot_scaling_convergence.py --train-cfg ../configs/training_v1.yaml -p /afs/cern.ch/work/o/ofilatov/public/scaling_v2/json/all_types_10files_log1 -pr scaling_params_v2 -n 1 -t -1 -c global -c inner -c outer -o scaling_plots
--train-cfg
, path to yaml configuration file used for training-p/--path-to-snapshots
, path to the directory with scaling json snapshot files-pr/--snapshot-prefix
, prefix used in the name of json files to tag a scaling version-n/--nfiles-step
, number of processed files per log step (log_step
parameter), as it was set while running the scaling computation-t/--var-types
, variable types for which scaling parameters are to be plotted, -1 to run on all of them-c/--cone-types
, cone types for which scaling parameters are to be plotted-o/--output-folder
, output directory to save plots
This will produce a series of plots which combine all the variables of the given type into a boxplot representation and show its evolution as an ensemble throughout the scaling computation. Please have a look at this presentation for more details on the method.
Furthermore, once the scaling parameters are computed, one might be interested to see how the computed clamped range for a given feature relates to the actual quantiles of the feature distribution. Indeed, clamping might significantly distort the distribution if the derived (in "normal" case) or specified (in "linear" case) clamped range is not "wide enough", i.e. doesn't cover the most bulk of the distribution. The comparison of clamped and quantile ranges can be performed with Training/python/plot_quantile_ranges.py
:
python plot_quantile_ranges.py --train-cfg ../configs/training_v1.yaml --scaling-file /afs/cern.ch/work/o/ofilatov/public/scaling_v2/scaling_params_v2.json --quantile-file /afs/cern.ch/work/o/ofilatov/public/scaling_v2/quantile_params_v2.json --file-id 0 --output-folder quantile_plots/fid_0 --only-suspicious False
--train-cfg
, path to yaml configuration file used for training--scaling-file
, path to json file with scaling parameters--quantile-file
, path to json file with variables' quantiles--file-id
, id (integer fromrange(len(input_files))
) of the input ROOT file. Since quantile range of variables are derived per input file, will use for plotting those offile-id
-th file.--output-folder
, output directory to save plots--only-suspicious
, whether to save only suspicious plots. Please have a look at the suspiciousness criteria inplot_quantile_ranges.py
This will produce a plot of the clamped range plotted side-by-side with quantile ranges for the corresponding feature's distribution, so that it's possible to compare if the former overshoots/undershoots the actual distribution. However, please note that this representation is dependant on the input file (--file-id
argument), so it is advisable to check whether quantiles are robust to a file change. For more details please have a look at this presentation of the method.
As an alternative to the interactive mode described above, the scaling parameter computation can be distributed to multiple jobs and run on HTCondor. The code run on HTCondor is the same that would be run interactivelly. To run with law, first activate the right conda environment (note: it's recommended to have law installed via conda's pip), then:
law index
then run the job submission
law run FeatureScaling --version version_tag --environment conda --cfg /path/to/yaml/cfg.yaml --output-path /path/to/dir/ --file-per-job M --n-jobs N
where --version
specifies the job submission version (used by law to monitor the status of the jobs), --environment
specifies the shell environment to be used (conda in this case, see TauMLTools/bootstrap.sh), --cfg
is the yaml configuration file (the same used in interactive mode), --files-per-job
specifies the number of files processed by each job, --n-jobs
specifies the number of jobs to run, and --output-path
specifies the path to the output directory, used to save the results.
The law task run by the code above (stored in TauMLTools/LawWorkflows/FeatureScaling.py) implements the same script used locally, with the following caveat:
- the number of files per job is specified by the
--files-per-job
argument - the total number of jobs is defined by the
--n-jobs
argument, which, together with the--files-per-job
argument, determines the total number of files read from the input path. Leave this to the default value (0) to run on all the input files with--files-per-job
files per job - the yaml configuration parameters
Scaling_setup/file_range
andScaling_setup/log_step
are ignored when running on law - the output directory is specified using an argument, so the configuration in the .yaml file if ignored.
The results from each single job can be merged together to obtain the final set of scaling parameters, using the interactive python script TauMLTools/Training/python/merge_feature_scaling_jobs.py, as follows:
python merge_feature_scaling_jobs.py --output /path/to/dir --input "job*/scaling_params_v5.json"
where --output
determines the output directory which stores the merged results and --input
is a string pointing to the results of the single jobs (using a glob pattern, as in the example above. NOTE use the quotation marks!).
In order to create convergence plots using the Training/python/plot_scaling_convergence.py
script described in the main section above, merge_feature_scaling_jobs.py accepts the --step
int argument. If this value is specified, the script will create intermediate output files merging an increasing number of jobs with step equal to --step
(e.g., if --step
is equal to 10, the first intermediate output file will merge 10 jobs, the second 20, the third 30 and so on).
In order to have organized storage of models and its associated files, mlflow is used from the training step onwards. At the training step, it takes care of logging necessary configuration parameters used to run the given training, plus additional artifacts, i.e. associated files like model/cfg files or output logs). Conceptually, mlflow augments the training code with additional logging of requested parameters/files whenever it is requested. Please note that hereafter mlflow notions of run (a single training) and experiment (a group of runs) will be used.
The script to perform the training is TauMLTools/Training/python/Training_CNN.py. It is supplied with a corresponding TauMLTools/Training/python/train.yaml configuration file which specifies input configuration parameters. Here, hydra is used to compose the final configuration given the specification inside of train.yaml
and, optionally, from a command line (see below), which is then passed as OmegaConf dictionary to main()
function.
The specification consists of:
- the main training configuration file
TauMLTools/Training/configs/training_v1.yaml
describingDataLoader
and model configurations, which is "imported" by hydra as a defaults list - mlflow parameters, specifically
path_to_mlflow
(to the folder storing mlflow runs, usually its default name ismlruns
) andexperiment_name
. These two parameters are needed to either create an mlflow experiment with a givenexperiment_name
(if it doesn't exist) or to find the experiment to which attribute the current run scaling_cfg
which specifies the path to the json file with feature scaling parametersgpu_cfg
which sets the gpu related parameterslog_suffix
used as a postfix in output directory names
Also note that one can conveniently add/remove/override any configuration items from the command line. Otherwise, running python Training_CNN.py
will use default values as they are composed by hydra based on train.yaml
.
An example of running the training would be:
python Training_CNN.py experiment_name=run3_cnn_ho2 'training_cfg.SetupNN.tau_net={ activation: "PReLU", dropout_rate: 0.2, reduction_rate: 1.4, first_layer_width: "2*n*(1+drop)", last_layer_width: "n*(1+drop)" }' 'training_cfg.SetupNN.comp_net={ activation: "PReLU", dropout_rate: 0.2, reduction_rate: 1.6, first_layer_width: "2*n*(1+drop)", last_layer_width: "n*(1+drop)" }' 'training_cfg.SetupNN.comp_merge_net={ activation: "PReLU", dropout_rate: 0.2, reduction_rate: 1.6, first_layer_width: "n", last_layer_width: 64 }' 'training_cfg.SetupNN.conv_2d_net={ activation: "PReLU", dropout_rate: 0.2, reduction_rate: null, window_size: 3 }' 'training_cfg.SetupNN.dense_net={ activation: "PReLU", dropout_rate: 0.2, reduction_rate: 1, first_layer_width: 200, last_layer_width: 200, min_n_layers: 4 }'
Here, a new mlflow run will be created under the "run3_cnn_ho2" experiment (if the experiment doesn't exist, it will be created). Then, the model is composed and compiled and the training proceeds via a usual fit()
method with DataLoader
class instance used as a batch yielder. Several callbacks are also implemented to monitor the training process, specifically CSVLogger
, TimeCheckpoint
and TensorBoard
callback. Lastly, note that the parameters related to the NN setup and initially specified in training_v1.yaml
are overriden via the command line (training_cfg.SetupNN.{param}=...
)
Furthermore, for the sake of convenience, submission of multiple trainings in parallel to the batch system is implemented as a dedicated law task. As an example, running the following commands will set up law and submit the trainings specified in TauMLTools/Training/configs/input_run3_cnn_ho1.txt
to htcondor
:
source setup.sh
law index
law run Training --version input_run3_cnn_ho1 --workflow htcondor --working-dir /nfs/dust/cms/user/mykytaua/softDeepTau/DeepTau_master_hyperparam-law/TauMLTools/Training/python/2018v1 --input-cmds /nfs/dust/cms/user/mykytaua/softDeepTau/DeepTau_master_hyperparam-law/TauMLTools/Training/configs/input_run3_cnn_ho1.txt --conda-path /nfs/dust/cms/user/mykytaua/softML/miniconda3/bin/conda --max-memory 40000 --max-runtime 20
It might be the case that trainings from a common mlflow experiment are distributed amongst several people (e.g. for the purpose of hyperparameter optimisation), so that each person runs their fraction on their personal/institute hardware, and then all the mlflow runs are to be combined together under the same folder. Naive copying and merging of mlruns
folders under the common folder wouldn't work because several parameters in meta.yaml
files (mlflow internal cfg files stored per run and per experiment in corresponding folders) are not synchronised accross the team in their mlflow experiments. Specifically, in each of those files the following parameters require changes:
artifact_location
(meta.yaml
for each experiment) needs to be set to a common merged directory and the same change needs to be propagated toartifact_uri
(meta.yaml
for each run).experiment_id
needs to be set to a new common experiment ID (meta.yaml
for each experiment and for each run).
These points are not a problem if the end goal is a simple file storage. However, mlflow is being used in the current training procedure mostly to improve UX with dedicated mlflow UI
for a better navigation through the trainings and their convenient comparison. In that case, mlflow UI requires a properly structured mlruns
folder to correctly read and visualise run-related information.
For the purpose of preparing individual's mlruns
folder for merging, set_mlflow_paths.py
script was written. What it does is:
- recursively going through the folders in user-specified
path_to_mlflow
and corresponding experimentexp_id
therein, opening firstly the experiment-relatedmeta.yaml
and resetting thereartifact_location
tonew_path_to_exp
, which is defined as a concatenation of user-specifiednew_path_to_mlflow
andexp_id
ornew_exp_id
(if passed) - opening run-related
meta.yaml
files and resettingartifact_uri
key tonew_path_to_exp
, plus resettingexperiment_id
tonew_exp_id
(if passed) - in case
new_exp_id
argument is provided, the whole experiment folder is renamed to this new ID
It has the following input arguments:
-p
,--path-to-mlflow
: (required) Path to local folder with mlflow experiments-id
,--exp-id
: (required) Experiment id in the specified mlflow folder to be modified.-np
,--new-path-to-mlflow
: (required) New path to be set throughoutmeta.yaml
configs for a specified mlflow experiment.-nid
,--new-exp-id
: (optional, default=None) If passed, will also reset the current experiment id.-nn
,--new-exp-name
: (optional, default=None) If passed, will also reset the current experiment name.
Therefore, the following workflow is suggested:
- One runs their fraction of a common mlflow experiment at machine(s) of their choice. At this point it is not really important to have a common experiment name/experiment ID across all the team since it will be anyway unified afterwards. The outcome of this step is one has got locally a folder
${WORKDIR}/mlruns/${EXP_ID}
which contains the trainings as mlflow run folders + experiment-relatedmeta.yaml
. - The team aggrees that the combined experiment name is
${NEW_EXP_NAME}
with a corresponding ID${NEW_EXP_ID}
and the directory where the runs distributed across the team are to be stored is${SHARED_DIR}/mlruns
. Below will assume that this directory is located onlxplus
. - Everyone in the team runs
set_mlflow_paths.py
script at the machine where the training was happening:python set_mlflow_paths.py -p ${WORKDIR}/mlruns -id ${exp_id} -np ${SHARED_DIR}/mlruns -nid ${NEW_EXP_ID} -nn ${NEW_EXP_NAME}
- Everyone in the team copies runs from the shared experiment from their local machines to
${SHARED_DIR}
, e.g. usingrsync
(note that below--dry-run
option is on, remove it to do the actual synching):NB: this command will synchronise all experiment folders inside ofrsync --archive --compress --verbose --human-readable --progress --inplace --dry-run ${WORKDIR}/mlruns ${USERNAME}@lxplus.cern.ch:${SHARED_DIR}
${WORKDIR}/mlruns
with those of${SHARED_DIR}/mlruns
. Make sure that this won't cause troubles and clashes for experiments which are not${EXP_ID}
. - Now, it should be possible to bring up mlflow UI and inspect the merged runs altogether:
cd ${SHARED_DIR} mlflow ui -p 5000 # 5000 is the default port # forward this port to a machine with graphics
In order to check the learning process stability and convergence, tensorboard monitoring is utilized in the training script. In order to compare all existing runs in the mlruns
folder, the following command can be used:
./Training/script/run_tensorboard.sh <PATH_TO_MLRUNS_FOLDER>
In order to check the final training/validation loss function and corresponding model info:
./Training/script/print_results.sh <PATH_TO_MLRUNS_FOLDER>
The first step in the performance evaluation pipeline is producing predictions for a given model and a given data sample. This can be achieved with Evaluation/apply_training.py script, where its input arguments are configured in Evaluation/configs/apply_training.yaml. It is important to mention that at this step DataLoader
class is also used to yield batches for the model, therefore there are few parameters which configure its behaviour:
-
training_cfg_upd
describes parameters which will be overriden in theDataLoader
class initialisation. For example, for evaluation all taus should be included in batches, henceinclude_mismatched=True
. Or, train/val split is not needed comparing to the train step, sovalidation_split=0.
. Note that the base configuration by default is taken from mlflow logs and therefore is the one which was used for the training (path_to_training_cfg
key). -
scaling_cfg
describes the path to a file with json scaling parameters and by default points to mlflow logs, i.e. the file which used for the training. -
checkout_train_repo
describes whether a checkout of the git repo state as it was during the training (fetched from mlflow logs) should be made. This might be needed because the current repo state can include developments inDataLoader
not present at the training step, thereforeimport DataLoader
statement may not work. Moreover, this is needed to ensure that the sameDataLoader
is used for training and evaluation.
The next group of parameters in apply_training.yaml
is the mlflow group (path_to_mlflow/experiment_id/run_id
), which describes which mlflow run ID should be used to retrieve the associated model and to store the resulting predictions.
The remaining two arguments path_to_file
and sample_alias
describe I/O naming. apply_training.py
works in a single file mode, so it expects one input ROOT file located in path_to_file
and will output one h5 prediction file which will be stored under specified mlflow run ID in artifacts/predictions/{sample_alias}/{basename(input_file_name)}_pred.h5
. So sample_alias
here describes the sample to which the file belongs to (e.g. DY, or ggH). This sample_alias
will be needed for the reference in the following eval pipeline steps.
As the last remark, apply_training.py
automatically stores the mapping between input file and prediction file. This is kept in artifacts/predictions/{sample_alias}/pred_input_filemap.json
and this mapping will be used downstream to automatically retrieve corresponding input files (not logged to mlflow) for mlflow-logged predictions.
An example of usage apply_training.py
would be:
python apply_training.py path_to_mlflow=../Training/python/2018v1/mlruns experiment_id=2 run_id=1e6b4fa83d874cf8bc68857049d7371d path_to_file=eval_data/GluGluHToTauTau_M125/GluGluHToTauTau_M125_1.root sample_alias=GluGluHToTauTau_M125
The second step in the evaluation pipeline is Evaluation/evaluate_performance.py with the corresponding main hydra cfg file being Evaluation/configs/eval/run3.yaml. This step involves complex configuration setting in order to cover the variety of use cases in a generalised way. For a history of its developments and more details please refer to these presentations [1], [2], [3].
The first definition in run3.yaml
is the one of default lists discriminator
and selection
. These are essentially another yaml files from configs/eval
folder with their content being a part of the final hydra configuration. discriminator
is in fact a subfolder in configs/eval
which contains yaml configurations to be selected from. These configurations are used to initialise Discriminator
class which is used internally to uniformly manage ROC curve construction and metrics/param storage for various (heterogeneous) discriminator types. selection.yaml
stores aliases for different selections to be applied to input files so that they can be easily referred to in run3.yaml
(see cuts
parameter).
As usual, the next section in run3.yaml
describes mlflow parameters (path_to_mlflow/experiment_id/run_id
) to communicate reading/writing of necessary files/parameters for a specified mlflow run ID.
Then there is a section defining a so-called "phase space region". Conceptually, what evaluate_performance.py
step does is takes a given model and computes the metrics (currently, ROC curves only) for a specified region of parameters. These are at the moment vs_type
, dataset_alias
and pt_bins
/ eta_bins
/ dm_bins
:
-
vs_type
conventionally is used to defined tau types against which ROC curve should be constructed, e.g. tau vs jet (vs_type=jet
). Based on this label, different selection/plotting criteria will be triggered across the module, please check them throughout discriminator/plotting/ROC curve composition steps. Moreover, as described below, in principle a mixture of particle types is also possible to be served as a "vs leg". -
dataset_alias
defines how to refer to a mixture ofinput_samples
at the downstream plotting steps. -
pt_bins
/eta_bins
/dm_bins
specifies in which bins (of pt, eta, and HPS decay mode) the ROC curve should be computed, which will internally apply a corresponding cut on a combined mixed dataframe. This bin definition will also be used as an alias to retrieve corresponding ROC curve for plotting.
As was mentioned before, one can also apply the selection to the events before producing ROC curves. This is specified in cuts
as a string, passed then to the events DataFrame query()
method. Also, option to require events to pass specified working points (WPs) is available (WPs_to_require
, dictionary tau_type -> wp_name
), where the working point configuration (either thresholds or WP column) is taken as specified in the discriminator yaml file.
The last important section is dedicated to I/O. There are three ingredients to it: inputs, predictions and targets, which can differ depending on the use case.
-
In order to compute ROC curves a dataset containing genuine taus and e/mu/jet fakes is needed. The idea behind construction of such a dataset is to combine together various sources with selecting specific tau types from each of those.
input_samples
dictionary describes exactly this kind of mapping in a formsample_alias
->['tau_type_1', 'tau_type_2', ...]
. It assumes by default thatsample_alias
is an alias defined atapply_training.py
step (to refer to mlflow logged predictions automatically), but this don't have to be necessarily the case.tau_type_*
specified in the mapping are used to apply matching to a feature with a namegen_{tau_type}
. Ifpath_to_target
is provided, this will be taken from there (seeadd_group()
function ineval_tools.py
), otherwise it is assumed that this branch is already present in the input files (and added viainput_branches
argument inrun3.yaml
). -
Having defined a sample mixture, in
evaluate_performance.py
there is a loop over those samples and then for eachsample_alias
file lists with input/prediction/target files are constructed. The common template for file paths persample_alias
are specified viapath_to_input/path_to_pred/path_to_target
which has the option of filling placeholders in its name ({sample_alias}
) and support for "*" asterisk. -
These file lists are constructed in function
prepare_filelists()
ofeval_tools.py
. Conceptually, if present, it will expandpath_to_*
to a file list viaglob
and sort it according to a number id present in the file name (seepath_splitter()
function). The only non-trivial exception is the case whenpath_to_pred
points to mlflow artifacts. Here, a mapping input<-> prediction fromartifacts/predictions/{sample_alias}/pred_input_filemap.json
will be used to fetch input samples corresponding to specifiedpath_to_pred
(and hencepath_to_input
is ignored).
Given all the information above, the metrics (e.g. ROC curve) are computed within Discriminator
and RocCurve
classes (see Evaluation/utils/evaluation.py
) for a specified phase space region (vs_type/dataset_alias/pt_bins/eta_bins/dm_bins
). They are then represented as a dictionary and dumped into .../artifacts/performance.json
file which is used to collect all the "snapshots" of model evaluation across various input configurations, e.g. ROC curve TPR/FPR, discriminator info. It is this file where the dowstream plotting modules will search for entries to be retrieved and visualised.
Let's consider a comparison of a new Run3 training with already deployed DeepTau_v2 / MVA models. We will represent DeepTau Run3 training as a ROC curve, DeepTau_v2 as a ROC curve with manually defined WPs and for MVA we will take already centrally defined WPs. It should be noted that for the two latter cases the scores/WP bin codes are already present in the original input ROOT files used for DeepTau Run3 training. That means that there is nothing to apply_training.py
to. However, we still need to introduce this models to the mlflow ecosystem under common experiment_name=run3_cnn_ho2
, which can be done with Training/python/log_to_mlflow.py
:
python log_to_mlflow.py path_to_mlflow=2018v1/mlruns experiment_name=run3_cnn_ho2 files_to_log=null params_to_log=null
This will create a new (empty) run ID under run3_cnn_ho2
mlflow experiment (this is where DeepTau_run3 run ID is located), and we need to execute the command two times in total: one to be identified with DeepTau_v2 and the other with MVA model. After this, let's consider the case of plotting a ROC curve for vs_type=jet
, where genuine taus are to be taken from sample_alias=GluGluHToTauTau_M125
and jets from sample_alias=TTToSemiLeptonic
. These are the data samples with corresponding process excluded from the training and set aside to evaluate the training performance. Suppose the corresponding input ROOT files are located in path_to_input=Evaluation/eval_data/{sample_alias}/*.root
and predictions for DeepTau_run3 model were already produced for them with apply_training.py
(hence stored in mlflow artifacts).
Then, since there are three models in total which are moreover fundamentally different, we will identify them with three seperate discriminator
cfg files. Corresponding templates can be found in configs/eval/discriminator
folder with the parameters steering the behaviour of Discriminator
class (defined in Evaluation/utils/evaluation.py
). Please have a look there to understand the behaviour under the hood for each of the parameters' values. Also note that at this point it might be useful to create a new yaml cfg file for a discriminator if its behaviour isn't described by these template config files.
Finally, given that run3_cnn_ho2
-> experiment_id=2
and input samples cfg in run3.yaml
are set as:
input_samples:
GluGluHToTauTau_M125: ['tau']
TTToSemiLeptonic: ["jet"]
evaluating the performance for DeepTau_run3 can be done with (run IDs/paths below are specific to this example only):
python evaluate_performance.py path_to_mlflow=../Training/python/2018v1/mlruns experiment_id=2 run_id=06f9305d6e0b478a88af8ea234bcec20 discriminator=DeepTau_run3 path_to_input=null 'path_to_pred="${path_to_mlflow}/${experiment_id}/${run_id}/artifacts/predictions/{sample_alias}/*_pred.h5"' 'path_to_target="${path_to_mlflow}/${experiment_id}/${run_id}/artifacts/predictions/{sample_alias}/*_pred.h5"' vs_type=jet dataset_alias=ggH_TT
for DeepTau_v2 (note the changed paths where inputs are manually specified and targets are taken from DeepTau_run3 logged predictions, and also a change to wp_from=pred_column
to use manually defined WPs from working_points_thrs
):
python evaluate_performance.py path_to_mlflow=../Training/python/2018v1/mlruns experiment_id=2 run_id=90c83841fe224d48b1581061eba46e86 discriminator=DeepTau_v2p1 'path_to_input="eval_data/{sample_alias}/*.root"' path_to_pred=null 'path_to_target="${path_to_mlflow}/${experiment_id}/06f9305d6e0b478a88af8ea234bcec20/artifacts/predictions/{sample_alias}/*_pred.h5"' vs_type=jet dataset_alias=ggH_TT discriminator.wp_from=pred_column
for MVA:
python evaluate_performance.py path_to_mlflow=../Training/python/2018v1/mlruns experiment_id=2 run_id=d2ec6115624d44c9bf60f88460b09b54 discriminator=MVA_jinst_vs_jet 'path_to_input="eval_data/{sample_alias}/*.root"' path_to_pred=null 'path_to_target="${path_to_mlflow}/${experiment_id}/06f9305d6e0b478a88af8ea234bcec20/artifacts/predictions/{sample_alias}/*_pred.h5"' vs_type=jet dataset_alias=ggH_TT
Now one can inspect performance.json
files in corresponding mlflow run artifacts to get the intuition of how the skimmed performance info looks like. For example, since internally WP and ROC curve are defined and treated as instances of the same RocCurve
class, output in performance.json
for MVA model looks structurally the same as for DeepTau_run3, although for the former we just plot a set of working points, and for the latter the whole ROC curve.
The third step in the evaluation pipeline is Evaluation/plot_roc.py with the corresponding Evaluation/configs/plot_roc.yaml cfg file. In the latter one need to specify:
- mlflow
experiment_id
, which assumes that all run IDs below belong to this experiment ID discriminators
: a dictionary mappingrun_id
->disc_cfg
(see example below) withcurve_types
being a list of eitherroc_curve
orroc_wp
(or both), describing which types of ROC curves should be plotted.reference
: a pairrun_id
:curve_type
which will be used as the reference curve to plot the ratio for other discriminants.vs_type/dataset_alias/pt_bin/eta_bin/dm_bin
: parameters identifying the region of interest. These are used to retrieve the corresponding entries inperformance.json
file for each of the runs to be plotted, seeEvaluation/utils/evaluation.select_curve()
function which does that.output_dir
,output_name
: the folder name (within mlflow artifacts) and the file name of the output pdf respectively.
Note the usage of the default list plot_setup
, which means that the parameters from Evaluation/configs/plot_setup.yaml
will be used to directly instantiate the class PlotSetup
, defined in Evaluation/utils/evaluation.py
. It is this class that is used to define the style of the ROC curve plots, so the parameters in the yaml file should be set according to the desired stylistics. Additionally, some plotting parameters can be specified on the per-discriminator basis (as plt_cfg
dictionary per each discriminator entry).
Continuing the example of the previous section, setting the following in plot_roc.yaml
:
path_to_mlflow: ../Training/python/2018v1/mlruns
experiment_id: 2
discriminators:
06f9305d6e0b478a88af8ea234bcec20:
name: "DeepTau (v2.5)"
curve_types: ['roc_curve', 'roc_wp']
plot_cfg:
color: black
dashed: false
alpha: 1.
dots_only: false
with_errors: true
90c83841fe224d48b1581061eba46e86:
name: "DeepTau (v2.1)"
curve_types: ['roc_curve', 'roc_wp']
plot_cfg:
color: grey
dashed: false
alpha: 1.
dots_only: false
with_errors: true
d2ec6115624d44c9bf60f88460b09b54:
name: "MVA vs. jet (JINST)"
curve_types: ['roc_wp']
plot_cfg:
color: silver
dashed: false
alpha: 1.
dots_only: true
with_errors: true
reference:
06f9305d6e0b478a88af8ea234bcec20: 'roc_curve'
and running (note some parameters which are being overriden):
python plot_roc.py vs_type=jet dataset_alias=ggH_TT 'pt_bin=[20,100]' 'eta_bin=[0,2.3]' 'dm_bin=[0, 1, 2, 10, 11]' output_dir="plots/test" 'output_name="vs_${vs_type}_${dataset_alias}_pt_${pt_bin[0]}_${pt_bin[1]}_eta_${eta_bin[0]}_${eta_bin[1]}_dm_"' 'plot_setup_map.jet.ylim=[2e-4,1]' 'plot_setup_map.jet.ratio_ylim=[0.5,1.1]'
will produce the plot with specified models plotted side-by-side and will store it in mlruns/2/06f9305d6e0b478a88af8ea234bcec20/artifacts/plots/
, which then can be viewed directly or through mlflow UI.
For a given model, one usually has to define working points (WPs): that is, to derive thresholds on the model output scores yielding classification with predefined genuine tau efficiency, in the specified region of the phase space. This can be done with the script Evaluation/derive_wp.py and the corresponding cfg file Evaluation/configs/derive_wp.yaml. The script contains the definition of a WPMaker
class which takes care of the computation procedure + the code to run it, while the yaml file specifies the arguments to the class and to create_df()
function defined in Evaluation/utils/data_processing.py. The computation follows the logic:
-
Take genuine taus (
create_df.tau_type_to_select=tau
) from all the samples in the dictionarycreate_df.pred_samples
(will read files based oncreate_df.pred_samples.{sample_name}.filename_pattern
) withincreate_df.path_to_preds
(by default assumed to be within mlflow artifacts, but can be specified differently) which passcreate_df.selection
. Feature values for selection (e.g.tau_pt
) will be taken from either the input file corresponding to every prediction file (create_df.add_columns_from: inputs
) using the mapping frompred_input_filemap.json
, or from the prediction file itself (create_df.add_columns_from: predictions
). The name of the object to select from the input file (either TTree or hdf group) is specified viacreate_df.group_or_tree_name
. The group names and column prefixes, needed to extract predictions and labels from the input files, can also be configured via the corresponding keys increate_df
entry. -
score_vs_{type}
is computed for each tau as usualscore_tau / (score_tau + score_vs_type)
. Ifreweight_to_lumi
isn't set tonull
, will also assign a weight to taus from the sample according to the fraction of luminositysample_lumi
assigned to it (i.e.lumi_weight = sample_lumi / (reweight_to_lumi * N_taus_in_sample)
). -
Function
WPMaker.update_thrs()
for eachscore_vs_{type}
computes thresholds corresponding to its weighted quantiles given bytpr
values.tpr
is defined as a grid of evenly-spaced values withstep=wp_maker.tpr_step
. Thresholds which yield efficiencies closest to the values specified inwp_maker.wp_definitions.{type}.wp_eff
are selected. -
The procedure is repeated until convergence of threshold values within
wp_maker.epsilon
for allvs_types
. Ifwp_maker.require_wp_vs_others=False
, it should take only 2 iterations (no self-dependancy). Otherwise, the dependancy on passing the loosest WPs vs remaining tau types is introduced in the definition of WP vs current_type, so convergence will require >2 iterations (usually, it takes 5-8 iterations). -
Resulting json file with thresholds is logged to mlflow artifacts of the specified
run_id
.
An example of running the script - assuming pred_samples
are already specified as needed in derive_wp.yaml
- would be:
python derive_wp.py 'create_df.path_to_mlflow=../Training/python/2018v1/mlruns/' create_df.experiment_id=4 create_df.run_id=e1f3ddb3c4c94128a34a7635c56322eb
Once working points are derived, it is important to understand how tau/fake efficiency look like as a function of tau-related variables for each of the working points. Such plots can be produced with a script Evaluation/plot_wp_eff.py + config Evaluation/configs/plot_wp_eff.yaml.
What it does is firstly composes two datasets, one with genuine taus and the other with vs_type
fakes. This uses the same create_df()
function from Evaluation/utils/data_processing.py
, which in this case however is called in a partial manner. For that, the core arguments create_df.*
in the config, specifying general paths/selection, are shared for each tau_type
, while pred_samples
argument (dictionary of dictionaries, tau_type -> samples -> sample_cfg
) is factorised from it, allowing for a flexible separate calls per tau_type
. If from_skims=False
, those datasets will be logged to mlflow artifacts for the specified run, in order to remove the need in producing the datasets each time (hence, for from_skims=True
it will instead search for such datasets in output_skim_folder
within the mlflow artifacts).
Then, WP thresholds are retrieved from the json file from mlflow artifacts and differential_efficiency()
function from Evaluation/utils/wp_eff.py
is called. It computes separately for both genuine taus and fakes the efficiency of passing WPs in var_bins
of wp_eff_var
. Note that this parameter in the main plot_wp_eff.yaml
is itself the name of another yaml config to be imported into the final hydra configuration from Evaluation/configs/wp_eff_var
folder. This folder contains a template set of yaml files where each describes binning & plotting parameters for a corresponding variable.
It is efficiency()
function which counts tau/fake objects passing WPs and returns the corresponding efficiency with a confidence interval (eff, eff_down, eff_up
). It should be mentioned that this computation can be done both with and without assumption of passing some user-defined WPs. So it is the arguments require_WPs_in_numerator
/ require_WPs_in_denominator
which set if objects entering numerator/denominator of the efficiency formula should be required to pass WPs_to_require
.
Once differential efficiencies are calculated, plot_efficiency()
is called partially, with the core arguments taken from wp_eff_var
config and the other (efficiencies) passed as partial arguments. This creates a corresponding plot and stores it within mlflow artifacts as output_filename
.
An example of running the script (assuming pred_samples
are already specified as needed in plot_wp_eff.yaml
) would be:
python plot_wp_eff.py vs_type=jet from_skims=True output_skim_folder=wp_eff/data/Run3 wp_eff_var=tau_eta 'create_df.path_to_mlflow=../Training/python/2018v1/mlruns/' create_df.experiment_id=4 create_df.run_id=e1f3ddb3c4c94128a34a7635c56322eb