Enable vLLM Profiling for ChatQnA (#1124)
This commit is contained in:
@@ -432,6 +432,57 @@ curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \
|
|||||||
-H "Content-Type: application/json"
|
-H "Content-Type: application/json"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Profile Microservices
|
||||||
|
|
||||||
|
To further analyze MicroService Performance, users could follow the instructions to profile MicroServices.
|
||||||
|
|
||||||
|
#### 1. vLLM backend Service
|
||||||
|
Users could follow previous section to testing vLLM microservice or ChatQnA MegaService.
|
||||||
|
By default, vLLM profiling is not enabled. Users could start and stop profiling by following commands.
|
||||||
|
|
||||||
|
##### Start vLLM profiling
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://${host_ip}:9009/start_profile \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"model": "Intel/neural-chat-7b-v3-3"}'
|
||||||
|
```
|
||||||
|
Users would see below docker logs from vllm-service if profiling is started correctly.
|
||||||
|
```bash
|
||||||
|
INFO api_server.py:361] Starting profiler...
|
||||||
|
INFO api_server.py:363] Profiler started.
|
||||||
|
INFO: x.x.x.x:35940 - "POST /start_profile HTTP/1.1" 200 OK
|
||||||
|
```
|
||||||
|
After vLLM profiling is started, users could start asking questions and get responses from vLLM MicroService
|
||||||
|
or ChatQnA MicroService.
|
||||||
|
|
||||||
|
##### Stop vLLM profiling
|
||||||
|
By following command, users could stop vLLM profliing and generate a *.pt.trace.json.gz file as profiling result
|
||||||
|
under /mnt folder in vllm-service docker instance.
|
||||||
|
```bash
|
||||||
|
# vLLM Service
|
||||||
|
curl http://${host_ip}:9009/stop_profile \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"model": "Intel/neural-chat-7b-v3-3"}'
|
||||||
|
```
|
||||||
|
Users would see below docker logs from vllm-service if profiling is stopped correctly.
|
||||||
|
```bash
|
||||||
|
INFO api_server.py:368] Stopping profiler...
|
||||||
|
INFO api_server.py:370] Profiler stopped.
|
||||||
|
INFO: x.x.x.x:41614 - "POST /stop_profile HTTP/1.1" 200 OK
|
||||||
|
```
|
||||||
|
After vllm profiling is stopped, users could use below command to get the *.pt.trace.json.gz file under /mnt folder.
|
||||||
|
```bash
|
||||||
|
docker cp vllm-service:/mnt/ .
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Check profiling result
|
||||||
|
Open a web browser and type "chrome://tracing" or "ui.perfetto.dev", and then load the json.gz file, you should be able
|
||||||
|
to see the vLLM profiling result as below diagram.
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
## 🚀 Launch the UI
|
## 🚀 Launch the UI
|
||||||
|
|
||||||
### Launch with origin port
|
### Launch with origin port
|
||||||
|
|||||||
@@ -86,6 +86,7 @@ services:
|
|||||||
https_proxy: ${https_proxy}
|
https_proxy: ${https_proxy}
|
||||||
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
|
||||||
LLM_MODEL_ID: ${LLM_MODEL_ID}
|
LLM_MODEL_ID: ${LLM_MODEL_ID}
|
||||||
|
VLLM_TORCH_PROFILER_DIR: "/mnt"
|
||||||
command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80
|
command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80
|
||||||
chatqna-xeon-backend-server:
|
chatqna-xeon-backend-server:
|
||||||
image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
|
image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
|
||||||
|
|||||||
Reference in New Issue
Block a user