Skip to content

Commit

Permalink
Merge pull request #46 from THUDM/dev
Browse files Browse the repository at this point in the history
update with int4 inference
  • Loading branch information
duzx16 authored Oct 29, 2024
2 parents 4c9c14b + cf1fb2c commit ffc091e
Show file tree
Hide file tree
Showing 5 changed files with 115 additions and 33 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
*venv
*.DS_Store
*.idea/
test*
44 changes: 35 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,19 @@ GLM-4-Voice 由三个部分组成:

## Model List

| Model | Type | Download |
|:---------------------:| :---: |:------------------------------------------------------------------------------------------------------------------------------------------------:|
| Model | Type | Download |
|:---------------------:|:----------------:|:------------------------------------------------------------------------------------------------------------------------------------------------:|
| GLM-4-Voice-Tokenizer | Speech Tokenizer | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-tokenizer) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-tokenizer) |
| GLM-4-Voice-9B | Chat Model | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-9b) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-9b)
| GLM-4-Voice-Decoder | Speech Decoder | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-decoder) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-decoder)
| GLM-4-Voice-9B | Chat Model | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-9b) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-9b) |
| GLM-4-Voice-Decoder | Speech Decoder | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-decoder) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/glm-4-voice-decoder) |

## Usage
我们提供了可以直接启动的 Web Demo。用户可以输入语音或文本,模型会同时给出语音和文字回复。

![](resources/web_demo.png)

### Preparation

首先下载仓库
```shell
git clone --recurse-submodules https://github.com/THUDM/GLM-4-Voice
Expand All @@ -48,22 +49,39 @@ git clone https://huggingface.co/THUDM/glm-4-voice-decoder
```

### Launch Web Demo
首先启动模型服务

1. 启动模型服务

```shell
python model_server.py --host localhost --model-path THUDM/glm-4-voice-9b --port 10000 --dtype bfloat16 --device cuda:0
```

如果你需要使用 Int4 精度启动,请运行

```shell
python model_server.py --model-path THUDM/glm-4-voice-9b
python model_server.py --host localhost --model-path THUDM/glm-4-voice-9b --port 10000 --dtype int4 --device cuda:0
```

此命令会自动下载 `glm-4-voice-9b`。如果网络条件不好,也手动下载之后通过 `--model-path` 指定本地的路径。

然后启动 web 服务
2. 启动 web 服务

```shell
python web_demo.py
python web_demo.py --tokenizer-path THUDM/glm-4-voice-tokenizer --model-path THUDM/glm-4-voice-9b --flow-path ./glm-4-voice-decoder
```
即可在 http://127.0.0.1:8888 访问 web demo。此命令会自动下载 `glm-4-voice-tokenizer``glm-4-voice-9b`。如果网络条件不好,也可以手动下载之后通过 `--tokenizer-path``--model-path` 指定本地的路径。

即可在 http://127.0.0.1:8888 访问 web demo。

此命令会自动下载 `glm-4-voice-tokenizer``glm-4-voice-9b`。 请注意,`glm-4-voice-decoder` 需要手动下载。

如果网络条件不好,可以手动下载这三个模型之后通过 `--tokenizer-path`, `--flow-path``--model-path` 指定本地的路径。

### Known Issues

* Gradio 的流式音频播放效果不稳定。在生成完成后点击对话框中的音频质量会更高。

## Cases

我们提供了 GLM-4-Voice 的部分对话案例,包括控制情绪、改变语速、生成方言等。

* 用轻柔的声音引导我放松
Expand Down Expand Up @@ -99,7 +117,15 @@ https://github.com/user-attachments/assets/c98a4604-366b-4304-917f-3c850a82fe9f
https://github.com/user-attachments/assets/d5ff0815-74f8-4738-b0f1-477cfc8dcc2d

## Acknowledgements

本项目的部分代码来自:
* [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
* [transformers](https://github.com/huggingface/transformers)
* [GLM-4](https://github.com/THUDM/GLM-4)

## 协议

+ GLM-4 模型的权重的使用则需要遵循 [模型协议](https://huggingface.co/THUDM/glm-4-voice-9b/blob/main/LICENSE)

+ 本开源仓库的代码则遵循 [Apache 2.0](LICENSE) 协议。

42 changes: 33 additions & 9 deletions README_en.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# GLM-4-Voice

GLM-4-Voice is an end-to-end voice model launched by Zhipu AI. GLM-4-Voice can directly understand and generate Chinese and English speech, engage in real-time voice conversations, and change attributes such as emotion, intonation, speech rate, and dialect based on user instructions.

## Model Architecture
Expand All @@ -12,18 +13,20 @@ We provide the three components of GLM-4-Voice:
A more detailed technical report will be published later.

## Model List
| Model | Type | Download |
|:---------------------:| :---: |:------------------:|

| Model | Type | Download |
|:---------------------:|:----------------:|:--------------------------------------------------------------------:|
| GLM-4-Voice-Tokenizer | Speech Tokenizer | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-tokenizer) |
| GLM-4-Voice-9B | Chat Model | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-9b)
| GLM-4-Voice-Decoder | Speech Decoder | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-decoder)
| GLM-4-Voice-9B | Chat Model | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-9b) |
| GLM-4-Voice-Decoder | Speech Decoder | [🤗 Huggingface](https://huggingface.co/THUDM/glm-4-voice-decoder) |

## Usage
We provide a Web Demo that can be launched directly. Users can input speech or text, and the model will respond with both speech and text.

![](resources/web_demo.png)

### Preparation

First, download the repository
```shell
git clone --recurse-submodules https://github.com/THUDM/GLM-4-Voice
Expand All @@ -41,16 +44,30 @@ git clone https://huggingface.co/THUDM/glm-4-voice-decoder
```

### Launch Web Demo
First, start the model service

1. Start the model server

```shell
python model_server.py --host localhost --model-path THUDM/glm-4-voice-9b --port 10000 --dtype bfloat16 --device cuda:0
```

If you need to launch with Int4 precision, run

```shell
python model_server.py --model-path THUDM/glm-4-voice-9b
python model_server.py --host localhost --model-path THUDM/glm-4-voice-9b --port 10000 --dtype int4 --device cuda:0
```

Then, start the web service
This command will automatically download `glm-4-voice-9b`. If network conditions are poor, you can manually download it and specify the local path using `--model-path`.

2. Start the web service

```shell
python web_demo.py
python web_demo.py --tokenizer-path THUDM/glm-4-voice-tokenizer --model-path THUDM/glm-4-voice-9b --flow-path ./glm-4-voice-decoder
```
You can then access the web demo at http://127.0.0.1:8888.

You can access the web demo at [http://127.0.0.1:8888](http://127.0.0.1:8888).
This command will automatically download `glm-4-voice-tokenizer` and `glm-4-voice-9b`. Please note that `glm-4-voice-decoder` needs to be downloaded manually.
If the network connection is poor, you can manually download these three models and specify the local paths using `--tokenizer-path`, `--flow-path`, and `--model-path`.

### Known Issues
* Gradio’s streaming audio playback can be unstable. The audio quality will be higher when clicking on the audio in the dialogue box after generation is complete.
Expand Down Expand Up @@ -91,7 +108,14 @@ https://github.com/user-attachments/assets/c98a4604-366b-4304-917f-3c850a82fe9f
https://github.com/user-attachments/assets/d5ff0815-74f8-4738-b0f1-477cfc8dcc2d

## Acknowledgements

Some code in this project is from:
* [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
* [transformers](https://github.com/huggingface/transformers)
* [GLM-4](https://github.com/THUDM/GLM-4)

## License Agreement

+ The use of GLM-4 model weights must follow the [Model License Agreement](https://huggingface.co/THUDM/glm-4-voice-9b/blob/main/LICENSE).

+ The code in this open-source repository is licensed under the [Apache 2.0](LICENSE) License.
54 changes: 41 additions & 13 deletions model_server.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
"""
A model worker executes the model.
A model worker with transformers libs executes the model.
Run BF16 inference with:
python model_server.py --host localhost --model-path THUDM/glm-4-voice-9b --port 10000 --dtype bfloat16 --device cuda:0
Run Int4 inference with:
python model_server.py --host localhost --model-path THUDM/glm-4-voice-9b --port 10000 --dtype int4 --device cuda:0
"""
import argparse
import json
import uuid

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from transformers import AutoModel, AutoTokenizer
from transformers import AutoModel, AutoTokenizer, BitsAndBytesConfig
from transformers.generation.streamers import BaseStreamer
import torch
import uvicorn

from transformers.generation.streamers import BaseStreamer
from threading import Thread
from queue import Queue

Expand Down Expand Up @@ -54,10 +62,21 @@ def __next__(self):


class ModelWorker:
def __init__(self, model_path, device='cuda'):
def __init__(self, model_path, dtype="bfloat16", device='cuda'):
self.device = device
self.glm_model = AutoModel.from_pretrained(model_path, trust_remote_code=True,
device=device).to(device).eval()
self.bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
) if dtype == "int4" else None

self.glm_model = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
quantization_config=self.bnb_config if self.bnb_config else None,
device_map={"": 0}
).eval()
self.glm_tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

@torch.inference_mode()
Expand All @@ -73,10 +92,16 @@ def generate_stream(self, params):
inputs = tokenizer([prompt], return_tensors="pt")
inputs = inputs.to(self.device)
streamer = TokenStreamer(skip_prompt=True)
thread = Thread(target=model.generate,
kwargs=dict(**inputs, max_new_tokens=int(max_new_tokens),
temperature=float(temperature), top_p=float(top_p),
streamer=streamer))
thread = Thread(
target=model.generate,
kwargs=dict(
**inputs,
max_new_tokens=int(max_new_tokens),
temperature=float(temperature),
top_p=float(top_p),
streamer=streamer
)
)
thread.start()
for token_id in streamer:
yield (json.dumps({"token_id": token_id, "error_code": 0}) + "\n").encode()
Expand All @@ -91,7 +116,7 @@ def generate_stream_gate(self, params):
"text": "Server Error",
"error_code": 1,
}
yield (json.dumps(ret)+ "\n").encode()
yield (json.dumps(ret) + "\n").encode()


app = FastAPI()
Expand All @@ -107,10 +132,13 @@ async def generate_stream(request: Request):

if __name__ == "__main__":
parser = argparse.ArgumentParser()

parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--dtype", type=str, default="bfloat16")
parser.add_argument("--device", type=str, default="cuda:0")
parser.add_argument("--port", type=int, default=10000)
parser.add_argument("--model-path", type=str, default="THUDM/glm-4-voice-9b")
args = parser.parse_args()

worker = ModelWorker(args.model_path)
worker = ModelWorker(args.model_path, args.dtype, args.device)
uvicorn.run(app, host=args.host, port=args.port, log_level="info")
4 changes: 2 additions & 2 deletions web_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from argparse import ArgumentParser

import torchaudio
from transformers import WhisperFeatureExtractor, AutoTokenizer, AutoModel
from transformers import WhisperFeatureExtractor, AutoTokenizer
from speech_tokenizer.modeling_whisper import WhisperVQEncoder


Expand All @@ -30,7 +30,7 @@
parser.add_argument("--port", type=int, default="8888")
parser.add_argument("--flow-path", type=str, default="./glm-4-voice-decoder")
parser.add_argument("--model-path", type=str, default="THUDM/glm-4-voice-9b")
parser.add_argument("--tokenizer-path", type=str, default="THUDM/glm-4-voice-tokenizer")
parser.add_argument("--tokenizer-path", type= str, default="THUDM/glm-4-voice-tokenizer")
args = parser.parse_args()

flow_config = os.path.join(args.flow_path, "config.yaml")
Expand Down

0 comments on commit ffc091e

Please sign in to comment.