Description
Hey @milot-mirdita ,
Thanks for a great resource to find remote homologs!
I am interested in finding an E.coli gene in other Proteobacteria. The literature shows that this gene is conserved in closely related strains only, so I am using HH-SUITE to find remote homologs of this gene in other Proteobacteria samples.
I would like to get some advice on the way I am using the HH-SUITE makes sense, and if the output is not a false positive/negative.
- I run
hhblits
to get all sequences similar to the E.coli gene of interest in the Uniclust30 cluster
hhblits -cpu 4 -i ${IN_DIR}/ytfI_ecoli.fasta -d ${DB2}/UniRef30_2023_02 -oa3m ${OUT_DIR}/ytfI_ECOLI_uniclust.a3m -all
#The idea behind the step1 is to get remote homologs for the E.coli gene of interest as HMMsearch against a single E.coli gene as the database doesn't give any results!
- The resulting
.a3m
file was converted back to fasta file usingreformat.pl
script. - The
hmmbuild
command was used to convert the MSA into a database. - I use
hmmsearch
on the Proteobacteria protein sequences against the database from step 3.
Unfortunately, this is not giving a hit that is "significant" enough i.e. the E.value of the hit was not less than 1e-3.
I am comparing the Proteobacteria sequences with the E.coli gene of interest using Foldseek's easy_search
command too. And, I find no "significant" hit i.e. the E.value of the hit was not less than 1e-3.
So I am interested in understanding what could be considered a reasonable remote homolog of the gene, and if the two methods I am using make sense.
Regards,
Jigyasa