Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom heterdimer multiple alignment format #259

Open
fglaser opened this issue Sep 19, 2024 · 3 comments
Open

custom heterdimer multiple alignment format #259

fglaser opened this issue Sep 19, 2024 · 3 comments

Comments

@fglaser
Copy link

fglaser commented Sep 19, 2024

Dear all,

I know this topic has been addressed but I cannot make it right.. I need to merge to custom made a3m files of an heterodimer to run colabfold_batch in multimer mode.

What I tried is an alignment that starts with

#174,157 1,1

00000|sp|P38879|NACA_YEAST_1
MSAIPENANVTVLNKNEKKARELIGKLGLKQIPGIIRVTFRKKDNQIYAIEKPEVFRSAGGNYVVFGEAKVDNFTQKLAAAQQQAQASGIMPSNEDVATKSPEDIQADMQAAAEGSVNAAAEEDDEEGEVDAGDLNKDDIELVVQQTNVSKNQAIKALKAHNGDLVNAIMSLSK
00001|UniRef90_A0A540LHD7/58-200_1
--------EASKQSRSEKKSRKAMLKLGMKPVTGVSRVTIKRTKNILFFISKPDVFKSPnSDTYVIFGEAKIEDLSSQLQ---TQAAQQFRMPDMSSVMGK------------PEISAAAAGAQDEEEEEVDETGVEPRDIDLVMTQAGVSRSKAVKALKTHSGDI---------
00002|UniRef90_A0A540LHD7/250-405_1

and fter 15000 sequences of the first protein

0000|sp|Q02642|NACB1_YEAST_2
MPIDQEKLAKLQKLSANNKVGGTRRKLNKKAGSSAGANKDDTKLQSQLAKLHAVTIDNVAEANFFKDDGKVMHFNKVGVQVAAQHNTSVFYGLPQEKNLQDLFPGIISQLGPEAIQALSQLAAQMEKHEAKAPADAEKKDEAIPELVEGQTFDADVE
0001|UniRef90_A0A103Y6K5/46-195_2
-KMNVEKLMKMA---GAVRTGGKGSMRRKKKAIHKTTTTDDKRLQSTLKRIGVTAITQIEEVNIFKDE-TVIQFLNPKVQAAIGANTWVVSGSPQTKQLQDILPGILNQLGPDNLDNLRKLAEQFQKQapgagEGIAAaAAAQEDDDEVPELVAG--------
0002|UniRef90_A0A103Y6K5/240-395_2
-KMNVEKLMKMA---GAVRTGGKGSVRRKKKAVHKTTTTDDKRLQSTLKRIGVNAIPAIEEVNIFKDE-TVIQFLNPKVQASIAANTWVVSGSPQTKKLQDILPGILNQLGPDNLDNLRKLAEQFQKQapgagEGTAATTAQEDDYEVPELVAGETFEAAA-
0003|UniRef90_A0A4Y7KDX3/1-145_2
MKMNRDKLMKMA---GAVRTGGKGSVRRKKKAVHKTATTDDKRLQSTLKRVGVNAIPAIEEVNIFKDDS-VIQFLNPKVQASIAANTWVVSGSPQTKKLQDILPGIINQLGPDNLDNLRKLAEQFKKQgAGAAAaAAQEDDDDDVPELM----------
0004|UniRef90_A0A4Y7KDX3/145-245_2
--MNIEKLQKMA---GAVRTGGKGSVRRKKKAVHKTTTTDDKRLQSTLKRIGVNAIPAIEEVNIFKDDV-VIQFQNPKVQASIAANTWVVSGSPQTKIFVQFVDHIL--------------------------------------------------
0005|UniRef90_A0A4Y7KDX3/279-341_2

But I get the following error when running

::::::::::::::
NACA_NACB1.uniref90.mgnify.bfd_small.merged.log
::::::::::::::
2024-09-19 11:27:10,318 Running colabfold 1.5.5 (4e198f5cecc6a808daa6baf7441899e5e76f7b9e)
2024-09-19 11:27:13,934 Running on GPU
2024-09-19 11:27:14,413 Found 5 citations for tools or databases
2024-09-19 11:27:14,413 Query 1/1: NACA_NACB1.uniref90.mgnify.bfd_small.merged (length 174)
2024-09-19 11:27:14,426 Could not get MSA/templates for NACA_NACB1.uniref90.mgnify.bfd_small.merged: list index out of range
Traceback (most recent call last):
File "/home/fabian/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/batch.py", line 1472, in run
= unserialize_msa(a3m_lines, query_sequence)
File "/home/fabian/localcolabfold/colabfold-conda/lib/python3.10/site-packages/colabfold/batch.py", line 1138, in unserialize_msa
paired_msa[j] += ">" + header_no_faster_split[j] + "\n"
IndexError: list index out of range
2024-09-19 11:27:14,428 Done

Any help will be greatelly apreciated,
Fabian

@YoshitakaMo
Copy link
Owner

YoshitakaMo commented Sep 21, 2024

Are your complex predictions being performed correctly when you provide a FASTA file as input? If correct, the issue may lie in your input a3m file.

Preparing an a3m file for complex prediction by hand is very painful. However, if LocalColabFold runs successfully with a FASTA file as input, MSA files (in a3m format) for the pair and monomers will be generated in subdirectories within the output directory. If these files are present, colabfold_batch will skip obtaining the MSA through the MSA server.

If you remove all files except the MSA files subdirectories in the output directory, colabfold_batch with the same FASTA input file will restart the structure prediction without retrieving MSAs again.

By manually modifying the monomer and paired MSA files left in the subdirectories within the output directory, correctly formatted input a3m files will be generated automatically in the output directory (not its subdirectory!), and complex structure prediction will start using it.

Unfortunately, my server is currently under maintenance, so I am unable to provide detailed instructions now.

@fglaser
Copy link
Author

fglaser commented Sep 22, 2024

Dear Yoshitaka,

Thanks a lot for your kind answer.

As you suggested when I restarted the new run with the same input fasta and having deleted the main output but respected the subdirectories I indeed got a rerun without recomputing the msa. Also puzzling is that the total number of homologues in all a3m in the subdirectories is different and lower that those on the main dir a3m (which is correct in both runs).

So the process you suggested works but honestly don't understand how to exactly manipulate the subdirectories a3m to use my custom alignments instead of the existing ones created by default. There are two a3m in env/ (uniref.a3m and bfd_mg...a3m) and this pair.a3m in pairgreedy/, which I dont' understand exactly how to manipulate.

I would be very happy for more details of how to proceed to use my custom heterodimer multiple alignment in the subdirectories when possible.

Thanks a lot again,

Fabian

@YoshitakaMo
Copy link
Owner

@fglaser I've published a procedure to perform colabfold_batch with your paired MSA file, https://qiita.com/Ag_smith/items/b2f3e08df17ffd2c42ea (but written in Japanese for Japanese researchers, sorry.). I hope it will be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants