Skip to content

bowang-lab/BioReason-Pro

Repository files navigation

🧬 BioReason-Pro
Advancing Protein Function Prediction with
Multimodal Biological Reasoning

bioRxiv GitHub Website HuggingFace


Abstract

Protein function annotation is fundamental to understanding biological mechanisms, designing therapeutics, and advancing biomedical research. Current computational methods either rely on shallow sequence similarity or treat function prediction as isolated classification tasks, failing to capture the integrative reasoning across sequence, structure, domains, and interactions that expert biologists perform to infer function. We introduce BioReason-Pro, the first multimodal reasoning large language model (LLM) for protein function prediction that integrates protein embeddings with biological context to generate structured reasoning traces. A key input into BioReason-Pro is the set of GO term predictions made by GO-GPT, our autoregressive transformer that captures hierarchical and cross-aspect dependencies of GO terms. BioReason-Pro is trained via supervised fine-tuning on synthetic reasoning traces generated by GPT-5 for over 130K proteins and further optimized through reinforcement learning. It achieves 73.6% F_max on GO term prediction and an LLM judge score of 8/10 on functional summaries, substantially outperforming previous methods. Evaluations with human protein experts show that BioReason-Pro annotations are preferred over ground truth UniProt annotations in 79% of cases. Remarkably, BioReason-Pro de novo predicted experimentally confirmed binding partners with per-residue attention localizing to the exact contact residues resolved in cryo-EM structures of those complexes. Together, GO-GPT and BioReason-Pro establish a framework for protein function prediction that combines precise ontology modeling with interpretable biological reasoning.


br2_fig1

Key Contributions

First multimodal reasoning LLM for protein function: BioReason-Pro deeply integrates ESM3 protein embeddings, a GO graph encoder, and biological context within a unified LLM to generate structured reasoning traces and functional annotations.

Autoregressive GO term prediction (GO-GPT): A novel autoregressive transformer that treats Gene Ontology prediction as a sequence generation task, capturing hierarchical and cross-aspect dependencies that discriminative methods miss, achieving state-of-the-art weighted F_max of 0.65–0.70.

Expert-level functional reasoning: Human protein experts preferred BioReason-Pro annotations over curated UniProt entries in 79% of evaluated cases, with an LLM judge score of 8.03/10 across five evaluation axes.

De novo structural predictions: BioReason-Pro predicted experimentally validated binding partners (e.g., SBP2 for eEFSec) with per-residue attention localizing to the exact contact interfaces resolved in cryo-EM structures.

Structural reasoning beyond domain transfer: The model performs contextual architectural reasoning that overrides misleading superfamily-level annotations, as demonstrated for CFAP61's non-enzymatic scaffolding role.

Broad-scale release: All model weights, training code, curated datasets, a web interface, and precomputed predictions for 240,000+ proteins including the Human Protein Atlas are publicly available.


Web Interface

Try BioReason-Pro directly through our web-based inference server:

🔗 bioreason.net

Precomputed predictions for 240,000+ proteins (including the Human Protein Atlas) are available at bioreason.net/atlas.


Datasets

The datasets used to train and evaluate BioReason-Pro comprise 133,492 proteins across 3,135 organisms curated from UniProt with experimental GO annotations, InterPro domains, STRING protein-protein interactions, and PDB structures. Temporal holdout follows the CAFA framework. Detailed download and usage instructions are available on our HuggingFace collection.


Checkpoints

Model weights are available on our HuggingFace collection:

Model Link
GO-GPT HuggingFace
BioReason-Pro SFT HuggingFace
BioReason-Pro RL HuggingFace

Installation

Prerequisites

  • Python 3.11+
  • CUDA/GPU for best performance

Installation Steps

# Clone the repository
git clone https://github.com/bowang-lab/BioReason-Pro.git
cd BioReason-Pro

# Install package
pip install -e .

Citation

If you find this work useful, please cite our papers:

@article {Fallahpour2026.03.19.712954,
	author = {Fallahpour, Adibvafa and Seyed-Ahmadi, Arman and Idehpour, Parsa and Ibrahim, Omar and Gupta, Purav and Naimer, Jack and Zhu, Kevin and Shah, Arnav and Ma, Shihao and Adduri, Abhinav and G{\"u}loglu, Talu and Liu, Nuo and Cui, Haotian and Jain, Arihant and de Castro, Max and Fallahpour, Amirfaham and Cembellin-Prieto, Antonio and Stiles, John S. and Nem{\v c}ko, Filip and Nevue, Alexander A. and Moon, Hyungseok C. and Sosnick, Lucas and Markham, Olivia and Duan, Haonan and Lee, Michelle Y. Y. and Salvador, Andrea F. M. and Maddison, Chris J. and Thaiss, Christoph A. and Ricci-Tam, Chiara and Plosky, Brian S. and Burke, Dave P. and Hsu, Patrick D. and Goodarzi, Hani and Wang, Bo},
	title = {BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning},
	elocation-id = {2026.03.19.712954},
	year = {2026},
	doi = {10.64898/2026.03.19.712954},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2026/03/20/2026.03.19.712954},
	eprint = {https://www.biorxiv.org/content/early/2026/03/20/2026.03.19.712954.full.pdf},
	journal = {bioRxiv}
}

@misc{fallahpour2025bioreasonincentivizingmultimodalbiological,
      title={BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model}, 
      author={Adibvafa Fallahpour and Andrew Magnuson and Purav Gupta and Shihao Ma and Jack Naimer and Arnav Shah and Haonan Duan and Omar Ibrahim and Hani Goodarzi and Chris J. Maddison and Bo Wang},
      year={2025},
      eprint={2505.23579},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.23579}, 
}

Authors

  • Adibvafa Fallahpour¹²³⁴⁵ * ([email protected])
  • Arman Seyed-Ahmadi²⁵ *
  • Parsa Idehpour¹⁵⁹ *
  • Omar Ibrahim²⁵ *
  • Purav Gupta³⁴⁵ *
  • Jack Naimer⁵ ¹⁰
  • Kevin Zhu⁵⁸
  • Arnav Shah³⁴⁵
  • Shihao Ma²³⁴⁵
  • Abhinav Adduri¹⁵
  • Talu Güloglu⁴⁵ ¹¹
  • Nuo Liu¹
  • Haotian Cui¹³
  • Arihant Jain¹⁹
  • Max de Castro
  • Amirfaham Fallahpour
  • Antonio Cembellin-Prieto¹
  • John S. Stiles¹
  • Filip Nemčko¹
  • Alexander A. Nevue¹
  • Hyungseok C. Moon¹
  • Lucas Sosnick¹⁶
  • Olivia Markham¹²
  • Haonan Duan³⁴
  • Michelle Y. Y. Lee¹⁶
  • Andrea F. M. Salvador¹⁶
  • Chris J. Maddison³⁴
  • Christoph A. Thaiss¹⁶
  • Chiara Ricci-Tam¹
  • Brian S. Plosky¹
  • Dave P. Burke¹
  • Patrick D. Hsu¹⁸
  • Hani Goodarzi†‡¹⁷ ([email protected])
  • Bo Wang†‡²³⁴ ¹³ ([email protected])

¹ Arc Institute ² University Health Network ³ Vector Institute ⁴ University of Toronto ⁵ Core Contributor
⁶ Stanford University ⁷ University of California, San Francisco ⁸ University of California, Berkeley
⁹ University of Pennsylvania ¹⁰ EPFL ¹¹ ETH Zürich ¹² Cohere ¹³ Xaira Therapeutics


* Equal contribution. The order of authors is not a reflection of their relative contributions.
† These authors, listed alphabetically, jointly supervised this work.
‡ Corresponding authors

Made with ❤️ at Arc Institute, University of Toronto, Vector Institute, and University Health Network

About

BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning

Resources

Stars

Watchers

Forks

Contributors