RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain

RAmBLA (Reliability Assessment for Biomedical LLM Assistants) is a framework for evaluating LLMs on a set of tasks designed to test for reliability. Specifically, the tasks can be divided into the following three aspects of reliability:

Robustness to non-semantic variations: LLMs should be robust to prompt variations that do not alter prompt meaning, and they should not display biases during few-shot prompting.
High recall: When operating on documents, LLMs should recall all relevant information, relying on either parametric knowledge or context exclusively, as instructed.
Hallucinations: If they have insufficient knowledge or context information to answer a question, LLMs should refuse to answer.

Further details can be found in our [paper](LINK PLACEHOLDER).

Installation

RAmBLA uses Python version 3.10.10. To install follow these steps:

Clone the repository:

git clone (URL placeholder)

Create a conda environment and install the package using the Makefile with the following command:

make init

Set environment variables by creating a .env file according to .env_example. This includes the following environment variables:

Variable	Description
`OPENAI_<var-name>`	Set of variables required to access OpenAI API
`DATASET_STORAGE_PATH`	Path where datasets should be stored
`MLFLOW_PROJECT_NAME`	Sets the name of the project to run evaluations under for logging purposes
`BACKOFF_MAX_TRIES`/`BACKOFF_INTERVAL`	Controls retry parameters when using API-based models

Download the bioasq dataset under DATASET_STORAGE_PATH. See this for instructions.

Running Evaluations

Individual Tasks

The main entry point for evaluating LLMs against an individual task is in rambla/run/run_task.py. An example command is:

python rambla/run/run_task.py task=mcqabaseline model=openai_chat

NOTE: We have a few model configs under rambla/conf/model/. For the case of rambla/conf/model/llama2_7b_chat_local.yaml the params.model_name parameter needs to be updated to point to the path the model is stored.

All tasks in this repo are configured using Hydra

Full Evaluation Suite

To run the full evaluation suite on a model use the script bin/run_all_tasks.py. For example:

python bin/run_all_tasks.py --models=openai_chat,mistral_7b_instruct_hf

This will run the full evaluation suite on ChatGPT and the Mistral 7b model

NOTE: Running the full evaluation suite can be very slow. We recommend running individual tasks over the full suite.

Tasks

For detailed information of each task, including how to configure them and example run commands, please refer to the docs.

Supported Tasks

Robustness

Recall

Hallucinations

Supported LLMs

Supported datasets

Semantic/Textual similarity component evaluation

This task was designed to evaluate different components (/models) at their ability to measure semantic similarity. These components take as input two pieces of text and output a score (binary or continuous) that reflects the similarity between the two input tests. The best performing component (chat GPT-4) was then chosen as default for the evaluation tasks where a semantic similarity metric was required.

Supported tasks

We currently support one task, which consists in passing two long-form texts to a component and receiving a metric for how similar the two texts are. It can support different components against different datasets and capture a range of different metrics.

Textual Similarity Task

Supported Components

LLM Component

We prompt GPT with the two sentences and ask whether they are semantically equivalent. Returns Yes or No.

Embedding-based Component

We first embed the two sentences using an embeddings model and then compute inner product between the two embeddings. Returns a score between 0 and 1 (if the embeddings are normalised).

NLI models (Natural Language Inference) (See `NLIModel` in rambla/models/huggingface.py)

We provide the two texts as input to the NLI model and the output are scores for the following classes: {entailment, neutral, contradiction}.

Unidirectional model: “Does sentence A follow from sentence B?”
- Classification: Argmax of the scores (returns predicted class)
- Regression: Exponential softmax of the entailment score (returns a score between 0 and 1)
Bidirectional model: “Does sentence A follow from sentence B AND does sentence B follow from sentence A?”
- Classification:
  - Strict - Bidirectional entailment required for similarity classification (this was our initial preferred method given results from the SICK dataset - please see below)
  - Relaxed - Bidirectional entailment or entailment and neutral required for similarity classification
  - Regression:
    - Average - Bidirectional mean exponential softmax of the entailment score (returns a score between 0 and 1)

Supported datasets

MRPC (Microsoft Research Paraphrase Corpus): Pairs of sentences which are either a paraphrase of each other or not - this could be extrapolated to imply similarity.
SICK (Sentences Involving Compositional Knowledge): Pairs of sentences annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral).

Unit-Tests

For testing, RAmBLA uses pytest

All unit-tests are located under tests. To run the full test suite, run:

pytest tests/

Integration tests

NOTE: These need to be run manually!

We have two sets of integration tests:

Integration tests for rambla/run/run_task.py

Example usage:

This will run a minimal version of the mcqabaseline task against openai_chat
- python integration_tests/run_task.py -m openai_chat -t mcqabaseline
This will run a minimal version of the mcqabaseline task against all available models
- python integration_tests/run_task.py -t mcqabaseline
This will run a minimal version of all available tasks against openai_chat
- python integration_tests/run_task.py -m openai_chat
This will run a minimal version of all available tasks against all available models
- python integration_tests/run_task.py

Integration tests for rambla/run/run_text_to_text.py

This will run a minimal version of all available tasks against all available components.
- python integration_tests/run_text_to_text.py

More Information

For further details about working with RAmBLA see the extended documentation located under docs

Contributing

We welcome contributions, feedback and suggestions to RAmBLA. If you would like to make a contribution, please follow our guidelines.

Please check for existing GitHub issues related to the change and create a new issue if one does not exist so we can first open discussions on the proposed change.

Setting up local development environment

Clone and install the repo according to the installation instructions
Create a new branch:

git checkout -b <my-branch-name>

Ideally use the prefix feat/ for feature-based branches, and hotfix/ for bug fixes.

Making Changes

When you make changes to the code please ensure your changes adhere to our code style.

We use the following:

Numpy docstring style
black and flake8 to ensure consistent code-style
isort to ensure imports are organised consistently

We use a pre-commit to ensure all code adheres to these standards. If you install the package according to our installation instructions then this will be run automatically on every commit. To run manually use:

pre-commit run --all

Testing

All code submissions should include unit-tests written using the pytest framework and these should be located in the relevant directory under tests.

Please ensure all tests pass, before submitting a change, by following the unit testing and integration testing instructions.

Submission

After following the above guidelines, please create a pull request into the master branch. Please ensure your pull-request contains:

Title
Brief description of changes made

Licence

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Contact Info

RAmBLA was originally created by the Responsible AI team at GSK.ai

To get in touch please find our contact details:

Rafael Poyiadzi: [email protected]
Ed Morrell: [email protected]
Gabriela van Bergen Gonzalez-Bueno: [email protected]
Lea Goetz: [email protected]

Citing

If you find this code useful in your research, please cite the associated paper:

@inproceedings{
bolton2024rambla,
title={{RAMBLA}: A {FRAMEWORK} {FOR} {EVALUATING} {THE} {RELIABILITY} {OF} {LLMS} {AS} {ASSISTANTS} {IN} {THE} {BIOMEDICAL} {DOMAIN}},
author={William James Bolton and Rafael Poyiadzi and Edward Morrell and Gabriela van Bergen Gonzalez Bueno and Lea Goetz},
booktitle={ICLR 2024 Workshop on Reliable and Responsible Foundation Models},
year={2024},
url={https://openreview.net/forum?id=lPXMUJlFfP}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bin		bin
docs		docs
integration_tests		integration_tests
rambla		rambla
tests		tests
.env_example		.env_example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
VERSION		VERSION
environment.yml		environment.yml
pytest.ini		pytest.ini
setup.cfg		setup.cfg
setup.py		setup.py

License

GSK-AI/rambla

Folders and files

Latest commit

History

Repository files navigation

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants in the Biomedical Domain

Table of Contents

Installation

Running Evaluations

Individual Tasks

Full Evaluation Suite

Tasks

Supported Tasks

Robustness

Recall

Hallucinations

Supported LLMs

Supported datasets

Semantic/Textual similarity component evaluation

Supported tasks

Supported Components

LLM Component

Embedding-based Component

NLI models (Natural Language Inference) (See NLIModel in rambla/models/huggingface.py)

Supported datasets

Unit-Tests

Integration tests

Integration tests for rambla/run/run_task.py

Integration tests for rambla/run/run_text_to_text.py

More Information

Contributing

Setting up local development environment

Making Changes

Testing

Submission

Licence

Contact Info

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

NLI models (Natural Language Inference) (See `NLIModel` in rambla/models/huggingface.py)

Packages