RAmBLA (Reliability Assessment for Biomedical LLM Assistants) is a framework for evaluating LLMs on a set of tasks designed to test for reliability. Specifically, the tasks can be divided into the following three aspects of reliability:
- Robustness to non-semantic variations: LLMs should be robust to prompt variations that do not alter prompt meaning, and they should not display biases during few-shot prompting.
- High recall: When operating on documents, LLMs should recall all relevant information, relying on either parametric knowledge or context exclusively, as instructed.
- Hallucinations: If they have insufficient knowledge or context information to answer a question, LLMs should refuse to answer.
Further details can be found in our [paper](LINK PLACEHOLDER).
- Installation
- Running Evaluations
- Tasks
- Semantic/Textual similarity component evaluation
- Unit-Tests
- Integration Tests
- More Information
- Contributing
- License
- Contact Info
- Citing
RAmBLA
uses Python version 3.10.10. To install follow these steps:
- Clone the repository:
git clone (URL placeholder)
- Create a conda environment and install the package using the Makefile with the following command:
make init
- Set environment variables by creating a
.env
file according to .env_example. This includes the following environment variables:
Variable | Description |
---|---|
OPENAI_<var-name> |
Set of variables required to access OpenAI API |
DATASET_STORAGE_PATH |
Path where datasets should be stored |
MLFLOW_PROJECT_NAME |
Sets the name of the project to run evaluations under for logging purposes |
BACKOFF_MAX_TRIES /BACKOFF_INTERVAL |
Controls retry parameters when using API-based models |
- Download the
bioasq
dataset underDATASET_STORAGE_PATH
. See this for instructions.
The main entry point for evaluating LLMs against an individual task is in rambla/run/run_task.py. An example command is:
python rambla/run/run_task.py task=mcqabaseline model=openai_chat
NOTE: We have a few model configs under rambla/conf/model/. For the case of rambla/conf/model/llama2_7b_chat_local.yaml the params.model_name
parameter needs to be updated to point to the path the model is stored.
All tasks in this repo are configured using Hydra
To run the full evaluation suite on a model use the script bin/run_all_tasks.py. For example:
python bin/run_all_tasks.py --models=openai_chat,mistral_7b_instruct_hf
This will run the full evaluation suite on ChatGPT and the Mistral 7b model
NOTE: Running the full evaluation suite can be very slow. We recommend running individual tasks over the full suite.
For detailed information of each task, including how to configure them and example run commands, please refer to the docs.
- Openai Chat model
- Openai Completion model
- HuggingFace Text Generation Model
- HuggingFace Natural-Language Inference (NLI) Model
This task was designed to evaluate different components (/models) at their ability to measure semantic similarity. These components take as input two pieces of text and output a score (binary or continuous) that reflects the similarity between the two input tests. The best performing component (chat GPT-4) was then chosen as default for the evaluation tasks where a semantic similarity metric was required.
We currently support one task, which consists in passing two long-form texts to a component and receiving a metric for how similar the two texts are. It can support different components against different datasets and capture a range of different metrics.
We prompt GPT with the two sentences and ask whether they are semantically equivalent. Returns Yes or No.
We first embed the two sentences using an embeddings model and then compute inner product between the two embeddings. Returns a score between 0 and 1 (if the embeddings are normalised).
NLI models (Natural Language Inference) (See NLIModel
in rambla/models/huggingface.py)
We provide the two texts as input to the NLI model and the output are scores for the following classes: {entailment, neutral, contradiction}.
-
Unidirectional model: “Does sentence A follow from sentence B?”
-
Classification: Argmax of the scores (returns predicted class)
-
Regression: Exponential softmax of the entailment score (returns a score between 0 and 1)
-
-
Bidirectional model: “Does sentence A follow from sentence B AND does sentence B follow from sentence A?”
-
Classification:
-
Strict - Bidirectional entailment required for similarity classification (this was our initial preferred method given results from the SICK dataset - please see below)
-
Relaxed - Bidirectional entailment or entailment and neutral required for similarity classification
-
Regression:
- Average - Bidirectional mean exponential softmax of the entailment score (returns a score between 0 and 1)
-
-
- MRPC (Microsoft Research Paraphrase Corpus): Pairs of sentences which are either a paraphrase of each other or not - this could be extrapolated to imply similarity.
- SICK (Sentences Involving Compositional Knowledge): Pairs of sentences annotated for two crucial semantic tasks: relatedness in meaning (with a 5-point rating scale) and entailment relation between the two elements (with three possible gold labels: entailment, contradiction, and neutral).
For testing, RAmBLA
uses pytest
All unit-tests are located under tests. To run the full test suite, run:
pytest tests/
NOTE: These need to be run manually!
We have two sets of integration tests:
Integration tests for rambla/run/run_task.py
Example usage:
-
This will run a minimal version of the mcqabaseline task against openai_chat
python integration_tests/run_task.py -m openai_chat -t mcqabaseline
-
This will run a minimal version of the mcqabaseline task against all available models
python integration_tests/run_task.py -t mcqabaseline
-
This will run a minimal version of all available tasks against openai_chat
python integration_tests/run_task.py -m openai_chat
-
This will run a minimal version of all available tasks against all available models
python integration_tests/run_task.py
Integration tests for rambla/run/run_text_to_text.py
- This will run a minimal version of all available tasks against all available components.
python integration_tests/run_text_to_text.py
For further details about working with RAmBLA
see the extended documentation located under docs
We welcome contributions, feedback and suggestions to RAmBLA
. If you would like to make a contribution, please follow our guidelines.
Please check for existing GitHub issues related to the change and create a new issue if one does not exist so we can first open discussions on the proposed change.
-
Clone and install the repo according to the installation instructions
-
Create a new branch:
git checkout -b <my-branch-name>
Ideally use the prefix feat/
for feature-based branches, and hotfix/
for bug fixes.
When you make changes to the code please ensure your changes adhere to our code style.
We use the following:
- Numpy docstring style
- black and flake8 to ensure consistent code-style
- isort to ensure imports are organised consistently
We use a pre-commit to ensure all code adheres to these standards. If you install the package according to our installation instructions then this will be run automatically on every commit. To run manually use:
pre-commit run --all
All code submissions should include unit-tests written using the pytest framework and these should be located in the relevant directory under tests.
Please ensure all tests pass, before submitting a change, by following the unit testing and integration testing instructions.
After following the above guidelines, please create a pull request into the master
branch. Please ensure your pull-request contains:
- Title
- Brief description of changes made
Copyright 2023 of GlaxoSmithKline Research & Development Limited. All rights reserved.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
RAmBLA
was originally created by the Responsible AI team at GSK.ai
To get in touch please find our contact details:
- Rafael Poyiadzi: [email protected]
- Ed Morrell: [email protected]
- Gabriela van Bergen Gonzalez-Bueno: [email protected]
- Lea Goetz: [email protected]
If you find this code useful in your research, please cite the associated paper:
@inproceedings{
bolton2024rambla,
title={{RAMBLA}: A {FRAMEWORK} {FOR} {EVALUATING} {THE} {RELIABILITY} {OF} {LLMS} {AS} {ASSISTANTS} {IN} {THE} {BIOMEDICAL} {DOMAIN}},
author={William James Bolton and Rafael Poyiadzi and Edward Morrell and Gabriela van Bergen Gonzalez Bueno and Lea Goetz},
booktitle={ICLR 2024 Workshop on Reliable and Responsible Foundation Models},
year={2024},
url={https://openreview.net/forum?id=lPXMUJlFfP}
}