Last updated: Jan 7, 2019
FunGAP is freely available for academic use. For the commerical use or license of FunGAP, please contact In-Geol Choi (igchoi (at) korea.ac.kr). Please, cite the following reference
Reference: Byoungnam Min, Igor V Grigoriev, and In-Geol Choi, FunGAP: Fungal Genome Annotation Pipeline using evidence-based gene model evaluation (2017), Bioinformatics, Volume 33, Issue 18, Pages 2936–2937, https://doi.org/10.1093/bioinformatics/btx353
- 0. Prerequisites
- 1. Preparing protein database
- 2. Augustus species model
- 3. Running FunGAP
- 4. FunGAP output
- 5. Test dataset
- 6. After FunGAP
To run FunGAP, users are required to prepare three main arguments
- Genome assembly (FASTA)
- Transcriptomic reads (FASTQ)
- Protein database (FASTA)
Currently, FunGAP takes only Illumina-sequenced reads (paired-end or single-read). Paired-end read FASTQ files should have formatted file names such as XXXX_1.fastq
and XXXX_2.fastq
. An example would be hyphae_1.fastq
and hyphae_2.fastq
. For single-read, it should be like XXXX_s.fastq
. Also BAM file is acceptable with --trans_bam
option.
FunGAP requires PROTEIN DATABASE
in FASTA file. We recommend three or four relatives' proteome to reduce computing time. For convenience, we provide a script download_sister_orgs.py
to build your own protein database for your genome using NCBI API.
Example command ($FUNGAP_DIR
is your FunGAP installation directory):
python $FUNGAP_DIR/download_sister_orgs.py \
--download_dir sister_orgs \
--taxon "Schizophyllum" \
--num_sisters 3 \
--email_address [email protected]
E-mail address is needed for NCBI Entrez. All taxon levels are allowed for --taxon
argument, but genus level is appropriate. Now make a protein database.
cd sister_orgs/
zcat ./*faa.gz > prot_db.faa
You can now input prot_db.faa
in the --sister_proteome
argument.
Augustus gene predictor requires to select pre-defined species model.
Run augustus --species=help
to print out the model list. FunGAP provides a script get_augustus_species.py
to help choose proper species model.
Example command ($FUNGAP_DIR
is your FunGAP installation directory):
python $FUNGAP_DIR/get_augustus_species.py \
--genus_name "Schizophyllum" \
--email_address [email protected]
This will suggest coprinus_cinereus and laccaria_bicolor.
Usage ($FUNGAP_DIR
is your FunGAP installation directory):
python $FUNGAP_DIR/fungap.py \
--output_dir <output_directory> \
--trans_read_1 <transcriptome_reads_fastq_1> \
--trans_read_2 <transcriptome_reads_fastq_2> \
--genome_assembly <genome_assembly_fasta> \
--augustus_species <augustus_species> \
--sister_proteome <sister_proteome> \
--num_cores <number_of_cpus_to_be_used> \
Final output will be located in fungap_out
directory
- fungap_out_prot.faa
- fungap_out.gff3
- fungap_out_stats.html
You can download yeast (S. cerevisiae) genome assembly (FASTA) and RNA-seq reads (two FASTQs) from NCBI for testing FunGAP.
# Download RNA-seq reads using SRA toolkit (https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/)
fastq-dump -I --split-files SRR1198667
head -n 12000000 SRR1198667_1.fastq > SRR1198667_sampled_1.fastq
head -n 12000000 SRR1198667_2.fastq > SRR1198667_sampled_2.fastq
# Download assembly
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
It took about 9 hours by dual Intel(R) Xeon(R) CPU E5-2670 v3 with 40 CPU cores.
Interproscan can infer the functions of predicted genes.
The gff3_add_pfam.py
script adds the annotation to the GFF3 file.
Example commands:
$ interproscan.sh -i <protein.fasta> -f tsv -appl Pfam --goterms -pa --iprlookup -b <base_name> --tempdir <TEMP-DIR>
$ gff3_add_pfam.py --input_gff3 <fungap_out.gff3> --pfam_file <interproscan_output.tsv>