Releases: steineggerlab/foldseek
Foldseek Release 10-941cd33
Foldseek Release 10
Foldseek introduces GPU support for monomer and multimer search, improved profile search and ProstT5 integration, new databases, several performance improvements and bug fixes.
Major Features
- GPU Support for Search and Multimers
Botheasy-searchandeasy-multimersearchnow support accelerated searches on the GPU execution. Use--gpu 1to enable GPU mode andCUDA_VISIBLE_DEVICESto control the number of GPUs. (#391, #411). GPU-enabled binaries require glibc >= 2.17, NVIDIA driver >= 525.60.13, and a Turing or newer GPU. On a single 4090 GPU, searches are 4x faster, and on eight GPUs, they are up to 37x faster than a 128-core CPU using the k-mer prefilter. For more details, see our preprint. - Improved Structural Profile Search
Alignment results can now be converted into position-specific scoring profiles withresult2profile, enabling the creation of structural protein family representations. (#411) - Enhanced ProstT5 Integration
Multi-GPU and Apple Metal support added, Improved handling of large input sequence through splitting and switched backend to llama.cpp for better compatibility and performance. (#391) - New Databases
Introduced BFVD as a new virus-specific Foldseek database. (#344).
For more details about the database, check out the BFVD paper. - Improved Multimer Search Workflows
Optimized multimer workflows for improved speed and reliability, with contributions by @Woosub-Kim. For more details, see our Foldseek-Multimer preprint. - Clustering Multimers (First Version)
Introduced experimental multimer clustering (easy-multimercluster) by @sooyoung-cha and @rachelse, supporting clustering by interface LDDT, chain TM-score, and complex coverage. Seefiltercomplexfor more details.
Breaking Changes
- Results may differ as masking of letters repeated six or more times is now enabled by default
--mask-n-repeat. Disable this option to reproduce previous results.
Other Features
- Improved Compatibility with MMseqs2 Modules:
createsubdb,makepaddedseqdb, andresult2profilenow work seamlessly with Foldseek databases. - Taxonomy Reports in
easy-search: Added options to generate taxonomy reports directly withineasy-search. (#389) - Residue Mapping Rework: Residue mapping has been reworked to combine most gemmi amino acids with previous Foldseek amino acids. (#387)
Bug Fixes
- Fix order dependent
--format-outputissue ofqtmaln,ttmaln,lddt,u,t(b40729c) - Fix clustering of structures without Cα information (0d8d966)
Full Changelog
View the full changelog: 9-427df8a...10-941cd33.
9-427df8a
At a glance: Foldseek release 9 features the fully benchmarked Foldseek-multimer search and structure-based sequence search using ProstT5. Both Foldseek-multimer and structure-based sequence search are also available in the Foldseek webserver.
Major Features
- Foldseek-multimer: Fully benchmarked and integrated into this release with the
easy-multimersearchandmultimerworkflows (Thanks @Woosub-Kim). Check out our preprint explaining the algorithm.
Read more on how to get started in our README. - Search requires less memory: We optimized the memory consumption of Foldseek. It requires significant less memory now (f629bbe)
- Structure-based sequence search: Predict protein 3Di directly from amino acid sequences without the need for existing protein structures. This is roughly 400-4000x faster than predicting full protein structures with ColabFold. This feature uses the ProstT5 protein language model and runs by default on CPU:
foldseek databases ProstT5 weights tmp
foldseek databases PDB pdb tmp
foldseek easy-search QUERY.fasta pdb result.m8 tmp --prostt5-model weights
Fast inference using GPU/CUDA is also supported. Compile from source with cmake -DCMAKE_BUILD_TYPE=Release -DENABLE_CUDA=1 -DCUDAToolkit_ROOT=Path-To-Cuda-Toolkit and call with createdb/easy-search --prostt5-model weights --gpu 1.
(Thanks @Victor-Mihaila).
Breaking changes
- Remove
.cif/.pdbfrom filenames and remove_MODEL_from identifiers in.lookup#261 (Thanks @chasooyoung) - Removed
--tar-includeand--tar-excludefromcreatedbas they were unused (15c0516) - Not-breaking: workflows using
easy-complexsearchandcomplexsearchwill continue to work. These are hidden modules mapping toeasy-multimersearchandmultimersearchinternally. However, the internals have had major changes since the last release.
Other features
convert2pdbcan output separate PDB files (346c1dd)createdblearned to read a large number of input files from a.tsvfile (e1394aa)- Force input format with
createdb --input-format(852434a) - Compute exact TM-score with
--exact-tmscore(493cefe) - Added CATH50 database (6893dcc)
- Update HTML output (not fully supported for multimer yet; c7e4a37, 361c22a, 1bc8d2e; Thanks @gamcil)
compresscalearned new input and output modes (8e68e86, 5d2724d, 284bc81)
Bug Fixes
- Fix broken symlinks with
databases PDBdownload (9ef6d18, fa6c530). - Fix AFDB Proteome and SwissProt download check (fa6c530, Thanks @TigerWindWood)
- Fix AF3 mmCIF files crashing
createdb - Fix
convert2pdbcreating broken PDB files for large structures (b6dac8a) - Remove ligand and alt res within chain #198 (Thanks @NatureGeorge)
- Skip residues without C-alpha #214 (75a50f7)
structurerescorediagonaldid not properly respect--tmscore-threshold(#205; 886021d)- Fallback alignment to Smith-Waterman when block-aligner produces invalid alignments (54c271c)
Developers
8-ef4e960
At a glance: Added support for clustered, protein-complex searches (alpha-verison, feedback welcome) as well as improved HTML output.
Features
- Implement
easy-complex-searchto find similar complexes structures in a database - Implement a cluster search
--cluster-search 1, which speeds up searches through redundant databases. It first searches only the representatives and then expands the final alignment to all cluster members. Two downloadable DBs support this search:PDBand theAlphafold/UniProt50. createclusearchdballows to build a searchable cluster database (b4d7ec5)convertalisHTML output updated to match search.foldseek.com output (96be67c)- Introduced
Alphafold/UniProt50-minimaland updated cluster file downloads for regularAlphafold/UniProt50to support cluster searches (93ad1d4, 2e9da41, daad5ab) - We added two modules
scorecomplexandcreatecomplexreportto compute a TMscore between protein complex as well as to summarize the findings (938b591, a6c75cb)
Bug fixes
- Foldseek correctly computes coverage again (c63725d). Coverage computation was broken since release 6 (29979fb).
--alignment-type 0(3Di-only) now correctly ignores amino-acid information (f0de872)createdbcould miss some files when recursively looking within directories on some file systems (d1d1b86)convertalis--format-outputcan outputqcaortcaif only one of the two databases has C-alpha information (311845d)--lddt-thrand --tmscore-thrare ignored when--sort-by-structure-bits 0` is set (b1b4710)
Developers
- Much smaller precomputed index for
--prefilter-mode 0(exhaustive ungapped prefiltering) with--index-exclude 1or--sort-by-structure-bits 0(No C-alpha) with--index-exclude 2or both with--index-exclude 3(8f586c0) - Enabled WebAssembly (WASM) compilation for Foldseek (408cfae; pending on Daniel-Liu-c0deb0t/block-aligner#26)
Others
- Thanks @amorehead and @KevinDuringWork for their pull request (#159, #170)
7-04e0ec8
At a glance: Downloadable pdb database can be searched with --cluster-search 1. Many createdb improvements and other bug fixes.
Features
createdbproperly warns and exits if no protein chain can be extracted (a146142, #134)createdbseparates PDB/mmCIF MODEL records into different source/lookup entries (d488f4a)createdbfilters out structures that are not proteins (d48d389)databasesdownloader supports cluster databases (ef768f4)pdbdatabase creation script has been updated to produce a cluster database that can be searched with--cluster-search 1(8eb36a2)
Bug fixes
- Fixed a bug with block-aligner where long protein sequences would error out (0627447) Thanks @Daniel-Liu-c0deb0t!
- Foldseek can be compiled without zlib, fixed an issue with zlib linking to gemmi (0832bef, 1a038db)
- Fixed Dockerfile to drop backports as its not needed with Debian bookworm (04e0ec8)
Others
- Made
compresscaan expert tool, hiding it from the default view to avoid confusion. (e4fe5be)
Developers
Foldseek 6-29e2557
At a glance: Introduced block-aligner for faster alignments, added ungapped prefilter mode, added cluster search support
Major Features
- Introduced block-aligner, a new banded-alignment algorithm that speeds up alignments by ~2x. Check out the block-aligner preprint. Thanks @Daniel-Liu-c0deb0t!
- Added ungapped prefilter mode (
--prefilter-mode 1). This is similar to the HHblits prefilter that exhaustively aligns without gaps all queries and targets. This mode has much lower memory requirements and should scale better for single or few query searches. However, it scales worse with many queries. - Added cluster search support, similar to the search introduced in ColabFold
Features
- Improved
README - Added support for
qtmscoreandttmscoreinconvertalis --format-output - LDDT computation is now faster
Bug fixes
--greedy-best-hitsearch mode is now correctly exposed. Thanks @Pooryamb!- Removed ANISOU parsing of PDB
- Added missing Foldseek specific
convertalis --format-outputoptions to help text
Developers and Maintainers
Foldseek now requires Rust to compile. Please make sure Rust 1.68 or newer is installed, as we have observed issues with 1.64. You can pass -DIGNORE_RUST_VERSION=1 to CMake to ignore the check. Please ensure the Foldseek regression test in ./regression/run_regression.sh passes before shipping Foldseek packages. We also require at least CMake 3.15 now.
Foldseek 5-53465f0
At a glace: Default enabled compressed C-alpha much decrease resource consumption of large databases. Otherwise, mostly house keeping in this release.
Features
- Compressed C-alpha coordinates are now enabled by default
- Foldseek now deals correctly with modified amino acids and HETATMS
- Exhaustive search mode that skips prefiltering with
--exhaustive-search 1 - TM-align speed up by replacing
score_fun8_standard
Bug fixes
- Disable gap-specific profiles for structure alignments
- C-alpha coordinates were not correctly preloaded in the alignment stage
- Reciprocal best hit search now disables new scoring and compositional bias correction for consistent scores in both directions
- Fixed various bugs around compressed C-alpha coordinates
- Computed RMSD was wrong
- Load the DB in memory before aligning (
structurealignperformance issue) - Alignment now uses the correct
--comp-bias-corr-scale - Fix crash with highly compositionally biased sequences
Foldseek 4-645b789
Release at a glance: better hit ranking, critical bug fix, structure clustering, smaller database size and updated AlphaFold Databases.
Features
foldseek databasesnow offers the AlphaFoldDB v4 databases.- We have improved hit ranking in Foldseek by multiplying the 3Di/AA bit-score by the geometric mean of alignment LDDT and TMscore, resulting in more accurate rankings.
- The
--format-output probparameter now returns the probability of homology. - The
--format-mode 5flag generates PDB files with all Cα atoms superimposed based on the aligned coordinates onto the query structure. - We have added a faster computation for LDDT, available with the
--format-output lddt,lddtfullflag. Thelddtflag outputs the average LDDT score for all Cα, while thelddtfullflag outputs a string of LDDT scores for each Cα. - The
--coord-store-mode 2parameter allows for storage of C-alpha lossless in compressed format. - TMalign mode (
--alignment-type 1) now uses the 3Di/AA as a prefilter to improve the precision and recall of TMalign, this also makes the TMalign mode much faster. - We have added support for reading in Foldcomp databases (see foldcomp.foldseek.com).
- The database module now includes an option to download ESMAtlas30.
- We have added support for
easy-cluster, a tool to cluster structural datasets using 3Di/AA alignment, LDDT, and TMscore. - We have added support for profile searches as well as iterative searches using the
--num-iterationsflag. - TMalign results can now be sorted by qTM, tTM, min(qTM, tTM), max(qTM, tTM), and avg(qTM, tTM) using the
--sortflag. - New module
compressca: converts an uncompressed Cα database to compressed format. - New module
convert2pdb: converts a Foldseek structure database to a multi-model PDB file. - We added our PDB100 update pipeline to
util/update_webserver_pdb
Breaking Change
- 3Di/AA score reported by Foldseek is now
bit-score * sqrt(alignment LDDT * alignment TMscore) - Default sort of TMalign is now average avg(qTM,tTM).
- We do not provide the "Alphafold/UniProt-NO-CA" database anymore, Cα databases are now always required.
- AlphaFoldDB Swiss-Prot and Proteome file names have changed. Downloads for these will stop working on Foldseek versions before this one. Generally, since the Cα database format has changed and is incompatible to older Foldseek versions. None of the v4 databases will work with previous versions.
- The default E-value is now 10.
Bug fixes
- We have fixed an issue that resulted in the loss of high-scoring diagonals during the
prefilterstep. - The visualization has been fixed for cases where the alignment length is exactly 80.
- We have fixed issues with tar inputs.
Foldseek 3-915ef7d
Features
- Added
databasesdownloads for the AlphaFold Uniprot Protein Structure Database.
You can choose between Alphafold/UniProt, Alphafold/UniProt-NO-CA and Alphafold/UniProt50:
Alphafold/UniProt: Contains all 214 million entries from the AlphaFold UniProt database, including C-alpha. This database is ~700GB large to download and ~950GB after extraction.
Alphafold/UniProt-NO-CA: Excludes C-alphas and is much smaller (~70GB download, ~170GB extracted). However, TM-align based alignments do not work (search --alignment-type 1, tmalign, and convertalis --format-output alntmscore,u,t).
Alphafold/UniProt50: Alphafold/UniProt clustered with MMseqs2 to 50% sequence identity and 80% bidirectional coverage (~190GB download). We offer this database in the web server at https://search.foldseek.com.
- Added
databasesTSV output createdbsupports downloading structures from Google Cloud Storage. Not enabled by default, see user guide on how to compile Foldseek with GCS support- PDB offered through
databaseswill be updated regularly. Thanks to @jaylee2000
Known issues
prefilteragainst large databases such as the AlphaFold Uniprot Protein Structure Database is executed with 6-mers (-k 6). This is less efficient than 7-mers. We will optimize 7-mer parameters in a future release and re-enable automatic k-mer size choice
Bug fixes
- Fixed PDB download
Foldseek 2-8bd520
Features
- implemented reciprocal-best-structure-hit search (
rbhandeasy-rbh) similar to Monzon et al. preprint - C-alpha only structures are supported as input (backbone is completed using pulchra)
convertaliscan output a HTML based result viz (--format-mode 3)
Example:foldseek easy-search example/d1asha_ example/ aln.html tmp --format-mode 3
- add support to read structures from
tarandtar.gzincreatedb,easy-searchandeasy-rbh.
Example:foldseek easy-rbh UP000005640_9606_HUMAN_v2.tar UP000001940_6239_CAEEL_v2.tar rbh tmp --tar-include '.*pdb' convertaliscan output C-alpha, TMscore, TM rotation matrices (--format-output qca,tca,alntmscore,u,trespectively)
foldseek easy-search example/ example/ aln tmp --format-output query,target,alntmscore,u,t
cat aln
d2gdma_ d2gdma_ 1.000E+00 1.000,-0.000,0.000,0.000,1.000,0.000,-0.000,-0.000,1.000 -0.000,-0.000,0.000
d2gdma_ d1q1fa_ 7.971E-01 0.299,-0.746,-0.595,0.952,0.192,0.237,-0.062,-0.638,0.768 94.039,-63.738,34.804
d2gdma_ d1cqxa1 6.794E-01 0.694,-0.662,0.283,0.570,0.746,0.345,-0.439,-0.078,0.895 7.534,-93.168,-12.301
- introduce
--alt-alito compute additional sub-optimal alignments for a query-target pairs #12 - added Foldseek docker image (supports
linux/amd64andlinux/arm64)
Bug fixes
Foldseek Release 1-3c64211
First release of Foldseek
Foldseek enables fast and sensitive comparisons of large structure sets. It reaches sensitivities similar to state-of-the-art structural aligners while being at least 20,000 times faster.
Publications
Webserver
Search your protein structures against the AlphaFoldDB and PDB in seconds using our Foldseek webserver:
