This directory contains scripts to run the Prize-Collecting Steiner Forest
(PCSF) and Temporal Pathway Synthesizer (TPS) algorithms back-to-back. All
scripts must be run from the main tps
directory not this pcsf
subdirectory.
For example, call pcsf/submit_wrapper.sh
not submit_wrapper.sh
.
The PCSF-TPS pipeline initially supports only Linux and Mac OS X.
- Omics Integrator
- msgsteiner (see Omics Integrator for installation instructions)
- pandas Python package (optional)
- R package
BiRewire
(network randomization only)
The pandas package is only required to generate PCSF prizes from the TPS
input files with generate_prizes.sh
.
To run PCSF and generate an input network for TPS using the EGF response example dataset:
- Install Omics Integrator, msgsteiner, and optionally pandas.
- Optionally run
generate_prizes.sh
to generate prizes for PCSF from the TPS score files and peptide-protein map file. This will overwrite the existing prize filedata/pcsf/egfr-prizes.txt
and is only used to illustrate the script usage. - Edit the
oipath
andmsgsteinerpath
variables insubmit_wrapper.sh
with the paths where Omics Integrator and msgsteiner were installed. Optionally edit thecondor_submit submit_PCSF.sub
line if you are not using the HTCondor queueing system. - Run
submit_wrapper.sh
to generate a family of Steiner forests and a script that summarizes them by taking the union of all forests (default script namesummarize_forests.sh
). - Run
summarize_forests.sh
to generate the tab-separated union graph for TPS. - Run TPS (not yet part of the pipeline).
To run PCSF with new data follow the same steps for running PCSF on the example data with these additional steps:
- If starting from TPS input files instead of a PCSF prize file, edit
generate_prizes.sh
to generate PCSF prizes from the TPS files. Skip this step if the PCSF prizes are already available. - Edit the
prizetype
,prizepath
,edgefile
, orsources
variables insubmit_wrapper.sh
to use a different protein prize file, protein-protein interaction network, or source node file. - Edit the
b
,m
, andw
variables insubmit_wrapper.sh
to set beta, mu, and omega to the optimal PCSF parameters. The PCSF parameter sweep is not directly supported by this pipeline. - Optional edit the
outpath
variable insubmit_wrapper.sh
to store the results in a different directory or thesumscript
variable to generate a different summarization script.
Other elements in the submit_wrapper.sh
can be modified if needed but
can generally be left at their default values.
- You can change the number of forests in the family of Steiner forests (the default is 100, though 1000 or more can be preferable when running in parallel on a cluster).
- You can edit the msgsteiner parameters
D
andg
(see Omics Integrator for details). - You can edit the
r
parameter to increase the edge noise, which will lead to more diverse forests in the family of Steiner forest solutions.
The full PCSF-TPS pipeline can be run on randomized protein prizes by
permuting the peptide-protein map and regenerating the prizes. The script
permuted_wrapper.sh
initiates the entire run, running PCSF and TPS
on multiple times on different shuffled peptide-protein maps. To run the
pipeline:
- Edit the variables in the script
permuted_wrapper.sh
to set the PCSF input data, TPS input data, and parameters such as the number of random runs to execute and the number of Steiner forests to generate in each run. - Run
permuted_wrapper.sh
, which will generate the random peptide-protein map files and submitsubmit_permuted.sub
to the HTCondor queueing system. The rest of the pipeline will run automatically. submit_permuted.sub
launches the desired number of jobs, each of which will runrun_permuted_pipeline.sh
with a different shuffled peptide-protein map file. Each job's HTCondorProcess
variable is used to track which version of the shuffled data to use as input.run_permuted_pipeline.sh
will generate prizes for PCSF, run PCSF multiple times to generate a family of forests, summarize the family of forests to produce a single input network for TPS, and run TPS.
The full PCSF-TPS pipeline can be run on bootstrapped peptide-level data by
subsampling the three types of peptide scores and regenerating protein prizes.
The script bootstrap_wrapper.sh
initiates the entire run, running PCSF and TPS
on multiple times on different subsampled datasets. To run the pipeline:
- Edit the variables in the script
bootstrap_wrapper.sh
to set the PCSF input data, TPS input data, and parameters such as the number of random runs to execute, the fraction of peptide scores to retain, and the number of Steiner forests to generate in each run. - Run
bootstrap_wrapper.sh
, which will generate the subsampled peptide-level files and submitsubmit_bootstrap.sub
to the HTCondor queueing system. The rest of the pipeline will run automatically. submit_bootstrap.sub
launches the desired number of jobs, each of which will runrun_bootstrap_pipeline.sh
with a different set of subsampled files. Each job's HTCondorProcess
variable is used to track which version of the subsampled files to use as input.run_bootstrap_pipeline.sh
will generate prizes for PCSF, run PCSF multiple times to generate a family of forests, summarize the family of forests to produce a single input network for TPS, and run TPS.
Note that the subsampling script subsample_peptides.py
expects the
firstfile
, prevfile
, and tsfile
inputs to all begin with a single header
row, as in the example input files in the data/timeseries
subdirectory.
The full PCSF-TPS pipeline can be run on a randomized background network. Unlike the other permutation and bootstrap pipelines, two scripts are required. One generates the randomized networks, and the other runs the PCSF and TPS pipeline on those networks. To run the pipeline:
- Edit the arguments in the script
run_net_rand.sh
to specify the original background network and the number of randomized copies to generate. The script callsgenerate_randomized_networks.py
, a wrapper around BiRewire, and randomizes the directed and undirected edges in the background network separately. Each randomized network copy produces two output files: a background network with directed and undirected edges for PCSF and a partial model file with the directed edges for TPS. - Edit the variables in the script
net_rand_wrapper.sh
to set the PCSF prizes, TPS input data, randomized networks, and parameters such as the number of random runs to execute and the number of Steiner forests to generate in each run. The number of random runsnetrandcopies
should match the number of randomized network copies from step 1. - Run
net_rand_wrapper.sh
, which will submitsubmit_net_rand.sub
to the HTCondor queueing system. The rest of the pipeline will run automatically. submit_net_rand.sub
launches the desired number of jobs, each of which will runrun_net_rand_pipeline.sh
with a different randomized network. Each job's HTCondorProcess
variable is used to track which randomized network to use as input.run_net_rand_pipeline.sh
will run PCSF multiple times to generate a family of forests, summarize the family of forests to produce a single input network for TPS, and run TPS.
Note that the randomized networks from BiRewire can contain self-edges.
usage: generate_prizes.py [-h] --firstfile FIRSTFILE --prevfile PREVFILE
--mapfile MAPFILE --outfile OUTFILE
Compute peptide prizes from the TPS first and previous scores files and map
them to protein prizes. See the TPS readme for the expected file formats.
optional arguments:
-h, --help show this help message and exit
--firstfile FIRSTFILE
The path and filename of the TPS firstscores file
--prevfile PREVFILE The path and filename of the TPS prevscores file
--mapfile MAPFILE The path and filename of the TPS peptidemap file
--outfile OUTFILE The path and filename of the output prize file.
usage: generate_randomized_networks.py [-h] --network NETWORK
[--outdir OUTDIR] [--copies COPIES]
Randomize background network.
optional arguments:
-h, --help show this help message and exit
--network NETWORK Background network file in tsv format.
--outdir OUTDIR Output directory for randomized files
--copies COPIES The number of subsampled copies to generate (default 10).
usage: permute_proteins.py [-h] --mapfile MAPFILE [--outdir OUTDIR]
[--copies COPIES] [--seed SEED]
Shuffle the protein(s) that map to each peptide. Creates the specified number
of peptide-protein map files. See the TPS readme for the expected file format.
optional arguments:
-h, --help show this help message and exit
--mapfile MAPFILE The path and filename of the original TPS peptidemap
file, which must contain a file extension.
--outdir OUTDIR The path of the output directory for the permuted map
files (default is the directory of the mapfile).
--copies COPIES The number of shuffled copies to generate (default 10).
--seed SEED A seed for the pseudo-random number generator for
reproducibility.
usage: subsample_peptides.py [-h] --firstfile FIRSTFILE --prevfile PREVFILE
--tsfile TSFILE [--outdir OUTDIR]
[--fraction FRACTION] [--copies COPIES]
[--seed SEED]
Subsample the peptides in the time series and score files for bootstrapping.
See the TPS readme for the expected file formats.
optional arguments:
-h, --help show this help message and exit
--firstfile FIRSTFILE
The path and filename of the TPS firstscores file
--prevfile PREVFILE The path and filename of the TPS prevscores file
--tsfile TSFILE The path and filename of the TPS timeseries file
--outdir OUTDIR The path of the output directory for the subsamples
files (default is the directory of the input files).
--fraction FRACTION The fraction of peptides to keep in the subsampled
datasets (default 0.9).
--copies COPIES The number of subsampled copies to generate (default
10).
--seed SEED A seed for the pseudo-random number generator for
reproducibility.
usage: summarize_sif.py [-h] --indir INDIR [--pattern PATTERN]
[--siflist SIFLIST] [--prizefile PRIZEFILE] --outfile
OUTFILE [--hubnode HUBNODE] [--cyto28]
Summarize a collection of Steiner forests
optional arguments:
-h, --help show this help message and exit
--indir INDIR The path to the directory that contains sif files.
--pattern PATTERN The filename pattern of the sif files in indir. Not
needed if a siflist is provided instead
--siflist SIFLIST A list of sif files in indir delimited by '|'. Not
used if a pattern is provided.
--prizefile PRIZEFILE
The path and filename prefix of the prize file
(optional). Assumes the same prize file was used for
all forests.
--outfile OUTFILE The path and filename prefix of the output. Does not
include an extension.
--hubnode HUBNODE The name of a hub node in the network (optional). The
degree of this node will be reported.
--cyto28 This flag will generate node and edge frequency
annotation files in the Cytoscape 2.8 format instead
of the default Cytoscape 3 style.