For Linux or WSL2, click here. This repository is also available on Linux and macOS if you are also using pwsh
on it.
This is an attempt to build a locally hosted alternative to GitHub Copilot. It uses the SalesForce CodeGen models inside of NVIDIA's Triton Inference Server with the FasterTransformer backend.
- Windows PowerShell or pwsh
- Docker
- docker compose (version >= 1.28)
- NVIDIA GPU (Compute Capability >= 6.0, That is GTX 10XX or newer)
- 7z-zstd
For Linux and macOS, you need zstd instead of this.
Note that the VRAM requirements listed by setup.ps1
are total -- if you have multiple GPUs, you can split the model across them. So, if you have two NVIDIA RTX 3080 GPUs, you should be able to run the 6B model by putting half on each GPU.
lmao
Okay, fine, we now have some minimal information on the wiki and a discussion forum where you can ask questions. Still no formal support or warranty though!
-
Install Docker and Docker Compose, The easiest way is to install Docker Desktop.
You can run
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
to test the CUDA working setup. This should result in a console output shown below:Fri Aug 26 20:20:28 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 516.94 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:2B:00.0 On | N/A | | 41% 50C P5 96W / 371W | 21480MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 88 C /tritonserver N/A | +-----------------------------------------------------------------------------+
-
Install 7z-zstd.
As a suggestion, you can add the directory of 7z-zstd (usually
C:\Program Files\7-Zip-Zstandard
) to thePATH
. Then restart Terminal and openpwsh
, typeGet-Command -Name 7z
and press Enter. if everything is ok, you will see some information about7z.exe
instead of errors or warnings message. -
Run the setup script to choose a model to use. This will download the model from Huggingface and then convert it for use with FasterTransformer.
$ .\setup.ps1 [1] codegen-350M-mono (2GB total VRAM required; Python-only) [2] codegen-350M-multi (2GB total VRAM required; multi-language) [3] codegen-2B-mono (7GB total VRAM required; Python-only) [4] codegen-2B-multi (7GB total VRAM required; multi-language) [5] codegen-6B-mono (13GB total VRAM required; Python-only) [6] codegen-6B-multi (13GB total VRAM required; multi-language) [7] codegen-16B-mono (32GB total VRAM required; Python-only) [8] codegen-16B-multi (32GB total VRAM required; multi-language) Enter your choice [6]: Enter number of GPUs [1]: Where do you want to save the model [C:\Users\Frederisk\Documents\GitHub\fauxpilot\models]?: Downloading the model from HuggingFace, this will take a while... Done! Now run .\launch.ps1 to start the FauxPilot server.
Alternatively you can set options by passing arguments. You can go through
.\launch.ps1 -Help
orGet-Help -Name .\launch.ps1 -Full
for more details..\setup.ps1 -Silent -Model codegen-6B-multi -NumGpus 1 -ModelDir C:\foo
-
Then you can just run
.\launch.ps1
. This process can take considerable amount of time to load. In general, It's already loaded when you see output like this:...... fauxpilot-windows-triton-1 | +-------------------+---------+--------+ fauxpilot-windows-triton-1 | | Model | Version | Status | fauxpilot-windows-triton-1 | +-------------------+---------+--------+ fauxpilot-windows-triton-1 | | fastertransformer | 1 | READY | fauxpilot-windows-triton-1 | +-------------------+---------+--------+ ...... fauxpilot-triton-1 | I0803 01:51:04.740423 93 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001 fauxpilot-triton-1 | I0803 01:51:04.740608 93 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000 fauxpilot-triton-1 | I0803 01:51:04.781561 93 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
-
Enjoy!
Yes, it's possible. Please check this issue.
- API: Application Programming Interface
- CC: Compute Capability
- CUDA: Compute Unified Device Architecture
- FT: Faster Transformer
- JSON: JavaScript Object Notation
- gRPC: Remote Procedure call by Google
- GPT-J: A transformer model trained using Ben Wang's Mesh Transformer JAX
- REST: REpresentational State Transfer
The code logic of this repository is derived from moyix/fauxpilot and refactored by myself.