Skip to content

This repository benchmark Ntropy API against different Large Language Models (OpenAI ChatGPT and LLAMA finetuned models). It also contains an easy to use wrapper to enable using the LLMs to perform transaction enrichment. Llama adapters were open sourced and are available on huggingface hub.

License

Notifications You must be signed in to change notification settings

ntropy-network/enrichment_models

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Financial transaction models benchmark

This repository provides a benchmark of the Ntropy API and different Large Language Models (OpenAI ChatGPT and LLAMA finetuned models) in the task of transaction enrichment. It also contains an easy to use wrapper that enables using the LLMs to perform transaction enrichment. Llama adapters were open sourced and are available on huggingface hub.

Table of Contents
  1. Benchmark
  2. Installation
  3. Usage
  4. Contributing
  5. License
  6. Contact
  7. Ressources

Benchmark

We benchmarked Ntropy's API and a set of LLMs in the task of extracting the following fields: label, merchant and website.

Ntropy's API is compared against:

  • OpenAI's LLM's (GPT-4) using a straightforward prompt.
  • Llama finetuned models (7B & 13B params) on consumer transactions data with LORA adapters.

The dataset used can be found here: /datasets/100_labeled_consumer_transactions.csv. All predictions can be found here: /datasets/benchmark_predictions.csv. It consists of a random subset of 100 anonymized consumer transactions. The full label list can be found here.

GPT 4 LLAMA finetuned 7B LLAMA finetuned 13B Ntropy API
Labeler Accuracy 0.71 0.72 0.78 0.86
Labeler F1 Score 0.64 0.56 0.65 0.73
Labeler Label similarity * 0.85 0.82 0.87 0.91
Labeler latency (s/tx) 1.47 0.27 0.34 0.01
Merchant Accuracy 0.66 / / 0.87
Website Accuracy 0.69 / / 0.87
Normalizer latency (s/tx) 4.45 / / 0.01

*: Label similarity is an approximate metric that uses embeddings distance to give a smoother metric than the accuracy (ex: 2 similar labels will have a score close to 1 while 2 very different semantically will have a score close to 0). You can see more details in tests/integration/test_openai::test_label_similarity_score .

Among the models evaluated, Ntropy demonstrates the best metrics in terms of accuracy and latency. This superiority can be attributed to several factors, including its access to web search engines and internal merchant databases. Moreover, Ntropy's internal models have been fine-tuned specifically for financial tasks, contributing to their effectiveness to get accurate labels.

We noticed that when a Llama model is fine-tuned on consumer transactions, even without having access to external information about merchants, it achieves a higher accuracy compared to GPT-4 (by 7 points). This suggests that LLM's models possess a considerable amount of knowledge about companies, even though measuring this knowledge directly can be challenging. Additionally, retrieving cleaned company names and websites appears to be more difficult for these models.

Based on this dataset, it is 'interesting' to note that GPT-4 has the ability to generate websites that appear to be correct at first glance but, in reality, do not exist. For instance:

Note: LLAMA models were benchmarked on a single A100 GPU.

(back to top)

Installation

This project uses python>=3.10

The Python package that can be installed either using poetry or pip:

  • Poetry:
poetry install
poetry shell
  • Pip:
pip install .

Depending on which model you want to run, you need at least one of the following (or all for running the full benchmark):

Ntropy API Key

For using the Ntropy API, you need an API KEY:

Note: You will get a limit of 10 000 transactions with a free account. If you need more, please contact us.

OpenAI API Key

For using the OpenAI models, you'll need an API KEY:

LLAMA requirements

The LLAMA adapters are open sourced and can be used from the HuggingFace Hub. The models have 2 variants (7B params & 13B params, 16bits) and can be found at the following urls:

Note: Minimum 32gb of RAM are needed to run LLAMA models (better if you have access to some GPU's with enough VRAM)

(back to top)

Usage

Benchmark

If you want to run the full benchmark, after setting up API KEY's in enrichment_models/__init__.py, you can just run:

make benchmark

Or

python scripts/full_benchmark.py

This will print results on the terminal as well as dumping metrics and predictions in the datasets/ folder.

Model Integration

If you want to integrate one of these models, you can just take examples on the notebooks, in the notebooks/ folder.

Also, if you want to integrate Ntropy's API, you might want to have a look at the documentation

There is one notebook per model (ntropy, openai and llama).

Contributing

We welcome and appreciate any Pull Request that suggests enhancements or introduces new models, APIs, and so on to add to the benchmark table.

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

[email protected]

Ressources

Main project dependencies:

(back to top)

About

This repository benchmark Ntropy API against different Large Language Models (OpenAI ChatGPT and LLAMA finetuned models). It also contains an easy to use wrapper to enable using the LLMs to perform transaction enrichment. Llama adapters were open sourced and are available on huggingface hub.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published