Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using precomputed MSA and PDB files for running massive 3d structure prediction #274

Open
berkeucar opened this issue Nov 8, 2024 · 4 comments

Comments

@berkeucar
Copy link

Hello,

I have a fasta file containing thousands of peptide sequences. I wanted to predict their 3D structures using LocalColabFold 1.5.5 installed in an HPC cluster and I have access to GPU clusters as well. Now, I was successfully able to generate PDB & MSA files by following the post/issue: sokrypton/ColabFold#563.

However, as I mentioned, I have multiple peptides in my fasta file and I would like to use my GPU access to produce 3D structure generations with colabfold_batch comment, using the PDB & MSA files I precomputed using the HPC cluster. This was asked in the attached issue but seems to fly under the radar.

Currenty, does LocalColabFold support massive prediction of peptides with the --pdb-hit-file flag?

@YoshitakaMo
Copy link
Owner

Did this not work?: sokrypton/ColabFold#563 (comment)

I use colabfold_batch --pdb-hit-file foobar_pdb100_230517.m8 --local-pdb-path /home/database/pdb_mmcif/mmcif_files foobar.a3m <outputdir> for the prediction. /home/database/pdb_mmcif/mmcif_files contains more than 220,000 flattened 4-letter mmCIF files.

@berkeucar
Copy link
Author

berkeucar commented Nov 10, 2024

So, basically, I appended all my peptide sequences together, using ":" as the separator between them. Let's say that file's name is tmp.fasta.
I obtained the files tmp.a3m and tmp_pdb100_230517.m8 from colabfold_search command. Then I was running the following code:
colabfold_batch \ --amber \ --templates \ --num-recycle 3 \ --use-gpu-relax \ --pdb-hit-file tmp_pdb100_230517.m8 \ --local-pdb-path my_local_pdb/pdb_mmcif/mmcif_files \ --random-seed 0 \ --zip \ tmp_pdb100_230517.m8 \ output_folder

and I received the following error:

Could not generate input features tmp: string index out of range
= generate_input_feature(query_seqs_unique, query_seqs_cardinality, unpaired_msa, paired_msa,
   File "localacolabfold_env/bin/lib/python3.10/site-packages/colabfold/batch.py", line 1035, in generate_input_feature
     features_for_chain[protein.PDB_CHAIN_IDS[chain_cnt]] = feature_dict
 IndexError: string index out of range

@YoshitakaMo
Copy link
Owner

YoshitakaMo commented Nov 11, 2024

Please show me your commit hash number. For example, ColabFold on my machine has 1ccca5a53d20c909f3ccf8a4b81df804e6717cb1. This is the commit on Jul. 23, 2024.

2024-11-11 00:18:05,900 Running colabfold 1.5.5 (1ccca5a53d20c909f3ccf8a4b81df804e6717cb1)
2024-11-11 00:18:06,190 Running on GPU
2024-11-11 00:18:06,859 Found 5 citations for tools or databases
...
...
...

If your commit hash number is old, updating LocalColabFold will fix this issue.

@berkeucar
Copy link
Author

berkeucar commented Nov 13, 2024

Just in case, I freshly installed localcolabfold with the script install_colabfold_batch_linux.sh. Now, I cannot even obtain the msa files it gets stuck in MSA of the first peptide in the batch:

k-mer similarity threshold: 110
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 238
Target db start 1 to 209335862
[>                                                                ] 1.27% 4 eta 0s  

I am running this on CPUs and my gcc version is 9.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants