Aurélie Gabriel 2023-10-11
This readme describes how the code (available in the scripts/ folder) was used to perform the analyses described in the following paper: A AG Gabriel, J Racle, M Falquet, C Jandus, D Gfeller. 2024. Robust estimation of cancer and immune cell-type proportions from bulk tumor ATAC-Seq data. DOI: https://doi.org/10.7554/eLife.94833.3
In the following sections we will refer to a data/ folder containing files that were too large to be hosted on the github repository. The complete data/ folder can be retrieved on zenodo (https://zenodo.org/record/13132868, additional_data.zip).
The following command line was used to process the ATAC-Seq samples
listed in
data/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt.
Note that the git_repo_path
variable corresponds to the path to the
location where this GitHub repository has been cloned. The
genome_folder_path
variable corresponds to the folder containing the
refgenie genome assets which are PEPATAC prerequisites for ATAC-Seq data
processing, please refer to the PEPATAC documentation to download these
data (http://pepatac.databio.org/en/latest/assets/).
sra_output_folder
corresponds to the folder where SRA data will be
downloaded.
git_repo_path=user_defined_path # path to the folder where this github repository was cloned
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
genome_folder_path=user_defined_path
sra_output_folder=user_defined_path
pepatac_output_folder=user_defined_path # folder where the processed data are saved
nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--sra_folder ${sra_output_folder} \
--output_folder ${pepatac_output_folder} \
--genome_version hg38 \
--genome_folder ${genome_folder_path} \
--multiqc_config ${git_repo_path}scripts/bulk_preprocessing_scripts/multiqc_config.yaml \
--blacklist_file ${git_repo_path}scripts/bulk_preprocessing_scripts/hg38-blacklist.v2.bed \
--original_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_original.yaml \
--samples_origin SRA --adapters ${git_repo_path}scripts/bulk_preprocessing_scripts/all_adapters_PE.fa \
--preprocess true --getcounts false -bg -resume
The same pipeline was used to process the ENCODE samples listed in data/markers_identification_input_files/samples/ENCODE/SRA_samples_metadata.txt
The same pipeline can then be used to identify a set of consensus peaks
across studies and cell-types and extract the raw counts for each peak
and sample with the following options --preprocess false
and
--getcounts true
.
raw_counts_output_folder=user_defined_path # folder where the counts are saved
nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--output_folder ${pepatac_output_folder} \
--genome_version hg38 --peak_score_thr 2 \
--genome_folder ${genome_folder_path} \
--output_folder_counts ${raw_counts_output_folder} \
--pepatac_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_config.yaml \
--preprocess false --getcounts true -bg -resume
Raw counts for each peak and sample and the metadata associated to each sample are provided as output in raw_counts_output_folder (raw_counts.txt and metadata.txt). These two files are used later in “Computing reference profiles and identifying marker peaks” and are also provided in the data/markers_identification_input_files/ folder.
We extracted the raw counts from the ENCODE data for the peaks (listed in the .saf file) identified in the reference samples using the following command line. The resulting counts and the encode metadata are located in data/markers_identification_input_files/encode_counts.txt and data/markers_identification_input_files/encode_metadata.txt.
nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/extract_peaks_counts.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config \
-profile singularity \
--bam_folder bam_files_ENCODE/ \
--output_folder_counts ${raw_counts_output_folder}ENCODE/ \
--saf_file ${raw_counts_output_folder}all_consensusPeaks.saf -bg -resume
# bam_files_ENCODE/ contains all bam files obtained from the preprocessing of the ENCODE ATAC-Seq data
Markers are identified in 10 subsets of the reference samples (list of
samples in each subset located in
data/markers_identification_input_files/samples/Ref/) and a consensus
of these markers are retrieved. The following command line will generate
reference profiles and identify cell-type specific marker peaks for
EPIC-ATAC. It will also use CIBERSORTx and DeconPeaker to build
reference profiles. For CIBERSORTx a singularity image is required and
can be retrieved on CIBERSORTx website by requesting a token access on
the CIBERSORTx website. The singularity image, the token and the
associated username must be provided in the workflow parameters
(CIBERSORTx_singularity
, cibersortx_token
, CIBERSORTx_username
).
git_repo_path=user_defined_path # path to the folder where this github repository was cloned
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
my_cibersortx_token=user_defined_path # path to CIBERSORTx token (obtained from CIBERSORTx website)
cibersortx_container=user_defined_path # path to CIBERSORTx container image (obtained from CIBERSORTx website)
cibersortx_username=user_defined # CIBERSORTx username associated to the token file
# Without considering Tcell subtypes
ref_output_path=user_defined_path # path where the output files are saved
nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
-bg -resume
# Considering Tcell subtypes
ref_output_path=user_defined_path # path where the output files are saved
nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
--major_groups FALSE \
-bg -resume
path_to_data_folder=user_defined_path
my_cibersortx_token=user_defined_path
cibersortx_container=user_defined_path
cibersortx_username=user_defined_path
output_path=user_defined_path
# Without considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes FALSE \
--output_path ${output_path} -bg -resume
# Considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes TRUE \
--output_path ${output_path} -bg -resume