Enable vllm for CodeTrans (#1626)
Set vllm as default llm serving, and add related docker compose files, readmes, and test scripts. Issue: https://github.com/opea-project/GenAIExamples/issues/1436 Signed-off-by: letonghan <letong.han@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,8 @@
|
||||
|
||||
This document outlines the deployment process for a CodeTrans application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Gaudi server. The steps include Docker image creation, container deployment via Docker Compose, and service execution using microservices `llm`. We will publish the Docker images to Docker Hub soon, it will simplify the deployment process for this service.
|
||||
|
||||
The default pipeline deploys with vLLM as the LLM serving component. It also provides options of using TGI backend for LLM microservice, please refer to [start-microservice-docker-containers](#start-microservice-docker-containers) section in this page.
|
||||
|
||||
## 🚀 Build Docker Images
|
||||
|
||||
First of all, you need to build Docker Images locally and install the python package of it. This step can be ignored after the Docker images published to Docker hub.
|
||||
@@ -55,6 +57,37 @@ By default, the LLM model is set to a default value as listed below:
|
||||
|
||||
Change the `LLM_MODEL_ID` below for your needs.
|
||||
|
||||
For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. The vLLM/TGI can load the models either online or offline as described below:
|
||||
|
||||
1. Online
|
||||
|
||||
```bash
|
||||
export HF_TOKEN=${your_hf_token}
|
||||
export HF_ENDPOINT="https://hf-mirror.com"
|
||||
model_name="mistralai/Mistral-7B-Instruct-v0.3"
|
||||
# Start vLLM LLM Service
|
||||
docker run -p 8008:80 -v ./data:/data --name vllm-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 128g opea/vllm:latest --model $model_name --host 0.0.0.0 --port 80
|
||||
# Start TGI LLM Service
|
||||
docker run -p 8008:80 -v ./data:/data --name tgi-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu --model-id $model_name
|
||||
```
|
||||
|
||||
2. Offline
|
||||
|
||||
- Search your model name in ModelScope. For example, check [this page](https://www.modelscope.cn/models/rubraAI/Mistral-7B-Instruct-v0.3/files) for model `mistralai/Mistral-7B-Instruct-v0.3`.
|
||||
|
||||
- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
|
||||
|
||||
- Run the following command to start the LLM service.
|
||||
|
||||
```bash
|
||||
export HF_TOKEN=${your_hf_token}
|
||||
export model_path="/path/to/model"
|
||||
# Start vLLM LLM Service
|
||||
docker run -p 8008:80 -v $model_path:/data --name vllm-service --shm-size 128g opea/vllm:latest --model /data --host 0.0.0.0 --port 80
|
||||
# Start TGI LLM Service
|
||||
docker run -p 8008:80 -v $model_path:/data --name tgi-service --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu --model-id /data
|
||||
```
|
||||
|
||||
### Setup Environment Variables
|
||||
|
||||
1. Set the required environment variables:
|
||||
@@ -87,12 +120,43 @@ Change the `LLM_MODEL_ID` below for your needs.
|
||||
|
||||
```bash
|
||||
cd GenAIExamples/CodeTrans/docker_compose/intel/hpu/gaudi
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
If use vLLM as the LLM serving backend.
|
||||
|
||||
```bash
|
||||
docker compose -f compose.yaml up -d
|
||||
```
|
||||
|
||||
If use TGI as the LLM serving backend.
|
||||
|
||||
```bash
|
||||
docker compose -f compose_tgi.yaml up -d
|
||||
```
|
||||
|
||||
### Validate Microservices
|
||||
|
||||
1. TGI Service
|
||||
1. LLM backend Service
|
||||
|
||||
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
|
||||
|
||||
Try the command below to check whether the LLM serving is ready.
|
||||
|
||||
```bash
|
||||
# vLLM service
|
||||
docker logs codetrans-gaudi-vllm-service 2>&1 | grep complete
|
||||
# If the service is ready, you will get the response like below.
|
||||
INFO: Application startup complete.
|
||||
```
|
||||
|
||||
```bash
|
||||
# TGI service
|
||||
docker logs codetrans-gaudi-tgi-service | grep Connected
|
||||
# If the service is ready, you will get the response like below.
|
||||
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
|
||||
```
|
||||
|
||||
Then try the `cURL` command below to validate services.
|
||||
|
||||
```bash
|
||||
curl http://${host_ip}:8008/generate \
|
||||
|
||||
@@ -2,9 +2,9 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
services:
|
||||
tgi-service:
|
||||
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
|
||||
container_name: codetrans-tgi-service
|
||||
vllm-service:
|
||||
image: ${REGISTRY:-opea}/vllm-gaudi:${TAG:-latest}
|
||||
container_name: codetrans-gaudi-vllm-service
|
||||
ports:
|
||||
- "8008:80"
|
||||
volumes:
|
||||
@@ -13,28 +13,27 @@ services:
|
||||
no_proxy: ${no_proxy}
|
||||
http_proxy: ${http_proxy}
|
||||
https_proxy: ${https_proxy}
|
||||
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
||||
HABANA_VISIBLE_DEVICES: all
|
||||
OMPI_MCA_btl_vader_single_copy_mechanism: none
|
||||
HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
||||
ENABLE_HPU_GRAPH: true
|
||||
LIMIT_HPU_GRAPH: true
|
||||
USE_FLASH_ATTENTION: true
|
||||
FLASH_ATTENTION_RECOMPUTE: true
|
||||
LLM_MODEL_ID: ${LLM_MODEL_ID}
|
||||
NUM_CARDS: ${NUM_CARDS}
|
||||
VLLM_TORCH_PROFILER_DIR: "/mnt"
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "sleep 500 && exit 0"]
|
||||
interval: 1s
|
||||
timeout: 505s
|
||||
retries: 1
|
||||
test: ["CMD-SHELL", "curl -f http://$host_ip:8008/health || exit 1"]
|
||||
interval: 10s
|
||||
timeout: 10s
|
||||
retries: 100
|
||||
runtime: habana
|
||||
cap_add:
|
||||
- SYS_NICE
|
||||
ipc: host
|
||||
command: --model-id ${LLM_MODEL_ID} --max-input-length 1024 --max-total-tokens 2048
|
||||
command: --model $LLM_MODEL_ID --tensor-parallel-size ${NUM_CARDS} --host 0.0.0.0 --port 80 --block-size ${BLOCK_SIZE} --max-num-seqs ${MAX_NUM_SEQS} --max-seq_len-to-capture ${MAX_SEQ_LEN_TO_CAPTURE}
|
||||
llm:
|
||||
image: ${REGISTRY:-opea}/llm-textgen:${TAG:-latest}
|
||||
container_name: llm-textgen-gaudi-server
|
||||
container_name: codetrans-xeon-llm-server
|
||||
depends_on:
|
||||
tgi-service:
|
||||
vllm-service:
|
||||
condition: service_healthy
|
||||
ports:
|
||||
- "9000:9000"
|
||||
@@ -43,18 +42,19 @@ services:
|
||||
no_proxy: ${no_proxy}
|
||||
http_proxy: ${http_proxy}
|
||||
https_proxy: ${https_proxy}
|
||||
LLM_ENDPOINT: ${TGI_LLM_ENDPOINT}
|
||||
LLM_ENDPOINT: ${LLM_ENDPOINT}
|
||||
LLM_MODEL_ID: ${LLM_MODEL_ID}
|
||||
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
||||
LLM_COMPONENT_NAME: ${LLM_COMPONENT_NAME}
|
||||
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
||||
restart: unless-stopped
|
||||
codetrans-gaudi-backend-server:
|
||||
image: ${REGISTRY:-opea}/codetrans:${TAG:-latest}
|
||||
container_name: codetrans-gaudi-backend-server
|
||||
depends_on:
|
||||
- tgi-service
|
||||
- vllm-service
|
||||
- llm
|
||||
ports:
|
||||
- "7777:7777"
|
||||
- "${BACKEND_SERVICE_PORT:-7777}:7777"
|
||||
environment:
|
||||
- no_proxy=${no_proxy}
|
||||
- https_proxy=${https_proxy}
|
||||
@@ -69,7 +69,7 @@ services:
|
||||
depends_on:
|
||||
- codetrans-gaudi-backend-server
|
||||
ports:
|
||||
- "5173:5173"
|
||||
- "${FRONTEND_SERVICE_PORT:-5173}:5173"
|
||||
environment:
|
||||
- no_proxy=${no_proxy}
|
||||
- https_proxy=${https_proxy}
|
||||
|
||||
99
CodeTrans/docker_compose/intel/hpu/gaudi/compose_tgi.yaml
Normal file
99
CodeTrans/docker_compose/intel/hpu/gaudi/compose_tgi.yaml
Normal file
@@ -0,0 +1,99 @@
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
services:
|
||||
tgi-service:
|
||||
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
|
||||
container_name: codetrans-gaudi-tgi-service
|
||||
ports:
|
||||
- "8008:80"
|
||||
volumes:
|
||||
- "${MODEL_CACHE}:/data"
|
||||
environment:
|
||||
no_proxy: ${no_proxy}
|
||||
http_proxy: ${http_proxy}
|
||||
https_proxy: ${https_proxy}
|
||||
HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
||||
HF_HUB_DISABLE_PROGRESS_BARS: 1
|
||||
HF_HUB_ENABLE_HF_TRANSFER: 0
|
||||
HABANA_VISIBLE_DEVICES: all
|
||||
OMPI_MCA_btl_vader_single_copy_mechanism: none
|
||||
ENABLE_HPU_GRAPH: true
|
||||
LIMIT_HPU_GRAPH: true
|
||||
USE_FLASH_ATTENTION: true
|
||||
FLASH_ATTENTION_RECOMPUTE: true
|
||||
runtime: habana
|
||||
cap_add:
|
||||
- SYS_NICE
|
||||
ipc: host
|
||||
command: --model-id ${LLM_MODEL_ID} --max-input-length 2048 --max-total-tokens 4096
|
||||
llm:
|
||||
image: ${REGISTRY:-opea}/llm-textgen:${TAG:-latest}
|
||||
container_name: codetrans-gaudi-llm-server
|
||||
depends_on:
|
||||
- tgi-service
|
||||
ports:
|
||||
- "9000:9000"
|
||||
ipc: host
|
||||
environment:
|
||||
no_proxy: ${no_proxy}
|
||||
http_proxy: ${http_proxy}
|
||||
https_proxy: ${https_proxy}
|
||||
LLM_ENDPOINT: ${LLM_ENDPOINT}
|
||||
LLM_MODEL_ID: ${LLM_MODEL_ID}
|
||||
LLM_COMPONENT_NAME: ${LLM_COMPONENT_NAME}
|
||||
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
||||
restart: unless-stopped
|
||||
codetrans-gaudi-backend-server:
|
||||
image: ${REGISTRY:-opea}/codetrans:${TAG:-latest}
|
||||
container_name: codetrans-gaudi-backend-server
|
||||
depends_on:
|
||||
- tgi-service
|
||||
- llm
|
||||
ports:
|
||||
- "${BACKEND_SERVICE_PORT:-7777}:7777"
|
||||
environment:
|
||||
- no_proxy=${no_proxy}
|
||||
- https_proxy=${https_proxy}
|
||||
- http_proxy=${http_proxy}
|
||||
- MEGA_SERVICE_HOST_IP=${MEGA_SERVICE_HOST_IP}
|
||||
- LLM_SERVICE_HOST_IP=${LLM_SERVICE_HOST_IP}
|
||||
ipc: host
|
||||
restart: always
|
||||
codetrans-gaudi-ui-server:
|
||||
image: ${REGISTRY:-opea}/codetrans-ui:${TAG:-latest}
|
||||
container_name: codetrans-gaudi-ui-server
|
||||
depends_on:
|
||||
- codetrans-gaudi-backend-server
|
||||
ports:
|
||||
- "${FRONTEND_SERVICE_PORT:-5173}:5173"
|
||||
environment:
|
||||
- no_proxy=${no_proxy}
|
||||
- https_proxy=${https_proxy}
|
||||
- http_proxy=${http_proxy}
|
||||
- BASE_URL=${BACKEND_SERVICE_ENDPOINT}
|
||||
ipc: host
|
||||
restart: always
|
||||
codetrans-gaudi-nginx-server:
|
||||
image: ${REGISTRY:-opea}/nginx:${TAG:-latest}
|
||||
container_name: codetrans-gaudi-nginx-server
|
||||
depends_on:
|
||||
- codetrans-gaudi-backend-server
|
||||
- codetrans-gaudi-ui-server
|
||||
ports:
|
||||
- "${NGINX_PORT:-80}:80"
|
||||
environment:
|
||||
- no_proxy=${no_proxy}
|
||||
- https_proxy=${https_proxy}
|
||||
- http_proxy=${http_proxy}
|
||||
- FRONTEND_SERVICE_IP=${FRONTEND_SERVICE_IP}
|
||||
- FRONTEND_SERVICE_PORT=${FRONTEND_SERVICE_PORT}
|
||||
- BACKEND_SERVICE_NAME=${BACKEND_SERVICE_NAME}
|
||||
- BACKEND_SERVICE_IP=${BACKEND_SERVICE_IP}
|
||||
- BACKEND_SERVICE_PORT=${BACKEND_SERVICE_PORT}
|
||||
ipc: host
|
||||
restart: always
|
||||
|
||||
networks:
|
||||
default:
|
||||
driver: bridge
|
||||
Reference in New Issue
Block a user