custom heterdimer multiple alignment format #259

fglaser · 2024-09-19T08:50:14Z

Dear all,

I know this topic has been addressed but I cannot make it right.. I need to merge to custom made a3m files of an heterodimer to run colabfold_batch in multimer mode.

What I tried is an alignment that starts with

#174,157 1,1

00000|sp|P38879|NACA_YEAST_1
MSAIPENANVTVLNKNEKKARELIGKLGLKQIPGIIRVTFRKKDNQIYAIEKPEVFRSAGGNYVVFGEAKVDNFTQKLAAAQQQAQASGIMPSNEDVATKSPEDIQADMQAAAEGSVNAAAEEDDEEGEVDAGDLNKDDIELVVQQTNVSKNQAIKALKAHNGDLVNAIMSLSK
00001|UniRef90_A0A540LHD7/58-200_1
--------EASKQSRSEKKSRKAMLKLGMKPVTGVSRVTIKRTKNILFFISKPDVFKSPnSDTYVIFGEAKIEDLSSQLQ---TQAAQQFRMPDMSSVMGK------------PEISAAAAGAQDEEEEEVDETGVEPRDIDLVMTQAGVSRSKAVKALKTHSGDI---------
00002|UniRef90_A0A540LHD7/250-405_1

and fter 15000 sequences of the first protein

0000|sp|Q02642|NACB1_YEAST_2
MPIDQEKLAKLQKLSANNKVGGTRRKLNKKAGSSAGANKDDTKLQSQLAKLHAVTIDNVAEANFFKDDGKVMHFNKVGVQVAAQHNTSVFYGLPQEKNLQDLFPGIISQLGPEAIQALSQLAAQMEKHEAKAPADAEKKDEAIPELVEGQTFDADVE
0001|UniRef90_A0A103Y6K5/46-195_2
-KMNVEKLMKMA---GAVRTGGKGSMRRKKKAIHKTTTTDDKRLQSTLKRIGVTAITQIEEVNIFKDE-TVIQFLNPKVQAAIGANTWVVSGSPQTKQLQDILPGILNQLGPDNLDNLRKLAEQFQKQapgagEGIAAaAAAQEDDDEVPELVAG--------
0002|UniRef90_A0A103Y6K5/240-395_2
-KMNVEKLMKMA---GAVRTGGKGSVRRKKKAVHKTTTTDDKRLQSTLKRIGVNAIPAIEEVNIFKDE-TVIQFLNPKVQASIAANTWVVSGSPQTKKLQDILPGILNQLGPDNLDNLRKLAEQFQKQapgagEGTAATTAQEDDYEVPELVAGETFEAAA-
0003|UniRef90_A0A4Y7KDX3/1-145_2
MKMNRDKLMKMA---GAVRTGGKGSVRRKKKAVHKTATTDDKRLQSTLKRVGVNAIPAIEEVNIFKDDS-VIQFLNPKVQASIAANTWVVSGSPQTKKLQDILPGIINQLGPDNLDNLRKLAEQFKKQgAGAAAaAAQEDDDDDVPELM----------
0004|UniRef90_A0A4Y7KDX3/145-245_2
--MNIEKLQKMA---GAVRTGGKGSVRRKKKAVHKTTTTDDKRLQSTLKRIGVNAIPAIEEVNIFKDDV-VIQFQNPKVQASIAANTWVVSGSPQTKIFVQFVDHIL--------------------------------------------------
0005|UniRef90_A0A4Y7KDX3/279-341_2

But I get the following error when running

::::::::::::::
NACA_NACB1.uniref90.mgnify.bfd_small.merged.log
::::::::::::::
2024-09-19 11:27:10,318 Running colabfold 1.5.5 (4e198f5cecc6a808daa6baf7441899e5e76f7b9e)
2024-09-19 11:27:13,934 Running on GPU
2024-09-19 11:27:14,413 Found 5 citations for tools or databases
2024-09-19 11:27:14,413 Query 1/1: NACA_NACB1.uniref90.mgnify.bfd_small.merged (length 174)
2024-09-19 11:27:14,426 Could not get MSA/templates for NACA_NACB1.uniref90.mgnify.bfd_small.merged: list index out of range
Traceback (most recent call last):
File "/home/fabian/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/batch.py", line 1472, in run
= unserialize_msa(a3m_lines, query_sequence)
File "/home/fabian/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/batch.py", line 1138, in unserialize_msa
paired_msa[j] += ">" + header_no_faster_split[j] + "\n"
IndexError: list index out of range
2024-09-19 11:27:14,428 Done

Any help will be greatelly apreciated,
Fabian

YoshitakaMo · 2024-09-21T16:05:50Z

Are your complex predictions being performed correctly when you provide a FASTA file as input? If correct, the issue may lie in your input a3m file.

Preparing an a3m file for complex prediction by hand is very painful. However, if LocalColabFold runs successfully with a FASTA file as input, MSA files (in a3m format) for the pair and monomers will be generated in subdirectories within the output directory. If these files are present, colabfold_batch will skip obtaining the MSA through the MSA server.

If you remove all files except the MSA files subdirectories in the output directory, colabfold_batch with the same FASTA input file will restart the structure prediction without retrieving MSAs again.

By manually modifying the monomer and paired MSA files left in the subdirectories within the output directory, correctly formatted input a3m files will be generated automatically in the output directory (not its subdirectory!), and complex structure prediction will start using it.

Unfortunately, my server is currently under maintenance, so I am unable to provide detailed instructions now.

fglaser · 2024-09-22T11:14:52Z

Dear Yoshitaka,

Thanks a lot for your kind answer.

As you suggested when I restarted the new run with the same input fasta and having deleted the main output but respected the subdirectories I indeed got a rerun without recomputing the msa. Also puzzling is that the total number of homologues in all a3m in the subdirectories is different and lower that those on the main dir a3m (which is correct in both runs).

So the process you suggested works but honestly don't understand how to exactly manipulate the subdirectories a3m to use my custom alignments instead of the existing ones created by default. There are two a3m in env/ (uniref.a3m and bfd_mg...a3m) and this pair.a3m in pairgreedy/, which I dont' understand exactly how to manipulate.

I would be very happy for more details of how to proceed to use my custom heterodimer multiple alignment in the subdirectories when possible.

Thanks a lot again,

Fabian

YoshitakaMo · 2024-09-29T15:53:42Z

@fglaser I've published a procedure to perform colabfold_batch with your paired MSA file, https://qiita.com/Ag_smith/items/b2f3e08df17ffd2c42ea (but written in Japanese for Japanese researchers, sorry.). I hope it will be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom heterdimer multiple alignment format #259

custom heterdimer multiple alignment format #259

fglaser commented Sep 19, 2024

YoshitakaMo commented Sep 21, 2024 •

edited

Loading

fglaser commented Sep 22, 2024 •

edited

Loading

YoshitakaMo commented Sep 29, 2024

custom heterdimer multiple alignment format #259

custom heterdimer multiple alignment format #259

Comments

fglaser commented Sep 19, 2024

YoshitakaMo commented Sep 21, 2024 • edited Loading

fglaser commented Sep 22, 2024 • edited Loading

YoshitakaMo commented Sep 29, 2024

YoshitakaMo commented Sep 21, 2024 •

edited

Loading

fglaser commented Sep 22, 2024 •

edited

Loading