Empower Functions is a family of LLMs(large language models) that offer GPT-4 level capabilities for real-world "tool using" use cases, with full compatibility support to be served as a drop-in replacement.
Live Demo • Huggingface Repo • Website • Discord
New Empower Functions v1.1 We have just launched new v1.1 of the Empower Functions family. The updated v1.1 family has been fine-tuned based on Llama3.1 using an enhanced curated dataset. It has achieved state-of-the-art performance on the Berkeley Function Calling leader board:
"tool using" refers to the ability of LLMs to interact with external APIs by recognizing when a function needs to be called and then generating JSON containing the necessary arguments based on user inputs. This capability is essential for building conversational agents and applications that convert natural language into API calls, facilitating tasks such as weather inquiries, data extraction, and interactions with knowledge bases.
Real-world use cases, particularly those involving conversational agents, often introduce complex requirements for LLMs. Models must be capable of retrieving context from multiple round of conversations(multi-turn), choosing between utilizing tools or engaging in standard dialogue ('auto' mode), and asking for clarification if any parameters are missing(clarification). Furthermore, they should integrate responses with tool outputs in a streaming fashion. Additionally, when multiple tools are required to complete a task, models should efficiently execute multiple functions either in parallel (parallel calling) or sequentially with dependencies (sequential calling).
For example, below is a screenshot demonstrating how the model is used in a medical center coordinator bot. You can explore this further in our live demo.
Model | Specs | Links | Notes |
---|---|---|---|
llama3-empower-functions-small | 128k context, based on Llama3.1 8B | model, gguf | Most cost-effective, locally runnable |
llama3-empower-functions-large | 128k context, based on Llama3.1 70B | model | Best accuracy |
We have tested and the family of models in following setup:
- empower-functions-small: fp16 on 1xA100 40G, GGUF and 4bit GGUF on Macbook M2 Pro with 32G RAM, in minimal the 4bit GGUF version requires 7.56G RAM.
- empower-functions-large: fp16 on 4xA100 80G
Running locally is only supported by the
llama3-empower-functions-small
model. To use other models, please use our API.
Local running is supported through the empower_functions
pip package, make sure you install it first by running pip install empower-functions
.
If you encounter errors like RuntimeError: Failed to load shared library, (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), please re-install the llama-cpp-python package by running
CMAKE_ARGS="-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python
Running a Local OpenAI Compatible Server
We leverage the llama-cpp-python
project to run the model locally. To start a local OpenAI compatible server, you'll need to follow the steps below:
- Download the GGUF model from our huggingface repo
- Run the command
python -m empower_functions.server --model <path to GGUF model> --chat_format empower-functions
You should see the following output when the server is ready:
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
Then you can use the OpenAI SDK to connect to the server. See below for a basic example:
import openai
import json
client = openai.OpenAI(
base_url = "http://localhost:8000/v1",
api_key = "YOUR_API_KEY"
)
messages = [
{"role": "user", "content": "What's the weather in San Francisco?"}
]
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA"
}
},
"required": ["location"]
}
}
}
]
chat_completion = client.chat.completions.create(
model="does_not_matter",
messages=messages,
tools=tools,
temperature=0,
tool_choice="auto"
)
print(chat_completion)
Running in a Python Environment
You can directly call the model in your python environment through the llama-cpp-python
package with the chat handler provided in the empower_functions
package. See below for a basic example. For more detailed example, please refer to the python script.
import json
from empower_functions import EmpowerFunctionsCompletionHandler
from llama_cpp.llama_tokenizer import LlamaHFTokenizer
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="empower-dev/llama3-empower-functions-small-gguf",
filename="ggml-model-Q4_K_M.gguf",
chat_format="llama-3",
chat_handler=EmpowerFunctionsCompletionHandler(),
tokenizer=LlamaHFTokenizer.from_pretrained("empower-dev/llama3-empower-functions-small-gguf"),
n_gpu_layers=0
)
# You can then use the llm object to chat with the model
messages = [
{"role": "user", "content": "What's the weather in San Francisco?"}
]
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA"
}
},
"required": ["location"]
}
}
}
]
result = llm.create_chat_completion(
messages = messages,
tools=tools,
tool_choice="auto",
max_tokens=128
)
print(json.dumps(result["choices"][0], indent=2))
Running in Windows with Cuda
-
install Nvidia toolkit (I used cuda 12.1): https://developer.nvidia.com/cuda-12-1-1-download-archive?target_os=Windows&target_arch=x86_64&target_version=11&target_type=exe_local
-
install Visual Studio with: C++ CMake tools for Windows. C++ core features
-
run this command with the empower_functions virtual environment active in the Windows command prompt (command prompt, not PowerShell):
set FORCE_CMAKE=1 && set CMAKE_ARGS=-DGGML_CUDA=on -DLLAMA_AVX=off -DLLAMA_AVX2=off -DLLAMA_FMA=off && pip install llama-cpp-python --no-cache-dir --force-reinstall --verbose
That will take awhile but will overwrite the normal llama-cpp-python module with the Cuda support one.
- then run the server with the virtual environment active with a command like this:
python -m empower_functions.server --model C:\Github\empower-functions-gpu\models\ggml-model-Q4_K_M.gguf --chat_format empower-functions --port 8001 --n_ctx 8196 --n_gpu_layers 20
replacing the path with the path where the model is saved on your computer and adjusting n_ctx to the desired context and n_gpu_layers to the amount of the layers to offload to the GPU.
The empower platform offers an API that is fully compatible with the OpenAI API, allowing you to directly use the OpenAI SDK. An example is shown below. See below for a basic example, more details can be found here.
Currently streaming and JSON model is only available in Empower API.
from openai import OpenAI
client = OpenAI(
base_url="https://app.empower.dev/api/v1",
api_key="YOU_API_KEY"
)
response = client.chat.completions.create(
model="empower-functions",
messages=[{"role": "user",
"content": "What's the weather in San Francisco and Los Angeles in Celsius?"}],
temperature=0,
tools=[{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}],
)
response_message = response.choices[0].message.tool_calls
print(response_message)
The Empower functions model family has been tuned to natively produce JSON. We provide utilities in our Python package to prompt OpenAI-formatted messages. See below for a basic example, more details can be found here.
from transformers import AutoModelForCausalLM, AutoTokenizer
from prompt import prompt_messages
device = "cuda"
model_path = 'empower-dev/empower-functions-small'
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)
functions = [
{
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
}
]
messages = [
{'role': 'user', 'content': 'What\'s the weather in San Francisco and Los Angles in Celsius?'},
]
messages = prompt_messages(messages, functions)
model_inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt").to(model.device)
generated_ids = model.generate(model_inputs, max_new_tokens=128)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
Empower's function models are fine-tuned based on state-of-the-art OSS models. We divided the training into two phases.
First, we perform SFT(supervised fine-tuning) using over 100k rows of hand-curated, high-quality conversations involving function calling. These conversations cover different scenarios such as single turn, multi-turn, and parallel calling. Specifically, the model is trained to use beginning tokens to determine whether it is calling functions or returning regular conversation (using and tags). It then returns function calls as JSON or conversations as usual, making streaming integration very straightforward. The SFT sets the model up with a very strong foundation covering various scenarios for general use cases.
Next, we apply DPO (Directly Preference Optimization) for trickier scenario where SFT (Supervised Fine-Tuning) is less effective. For instance, when function specifications include examples for arguments, we want to prevent the model from hallucinating argument values from these examples. We have found DPO to be very effective in correcting such misbehavior with a relatively small amount of data.
Finally, we are committed to continuously optimizing the model for better quality across a wider range of use cases and scenarios :) We can further fine-tune the model based on your specific needs. Please contact us if you have any use-case-specific requirements!
We evaluate our models against the Berkeley Function Calling benchmark and both of the 8B and 70B version have achieved the state of the art performance on its size: