Enable vLLM Profiling for ChatQnA (#1124)

2024-11-12 19:26:31 -08:00
parent 0d52c2f003
commit 7adbba6add
2 changed files with 52 additions and 0 deletions
--- a/ChatQnA/docker_compose/intel/cpu/xeon/README.md
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/README.md
@@ -432,6 +432,57 @@ curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \
     -H "Content-Type: application/json"
 ```
 ### Profile Microservices
 To further analyze MicroService Performance, users could follow the instructions to profile MicroServices. 
 #### 1. vLLM backend Service
   Users could follow previous section to testing vLLM microservice or ChatQnA MegaService.  
   By default, vLLM profiling is not enabled. Users could start and stop profiling by following commands.  
   ##### Start vLLM profiling
   ```bash
   curl http://${host_ip}:9009/start_profile \
     -H "Content-Type: application/json" \
     -d '{"model": "Intel/neural-chat-7b-v3-3"}'
   ```
   Users would see below docker logs from vllm-service if profiling is started correctly.
   ```bash
   INFO api_server.py:361] Starting profiler...
   INFO api_server.py:363] Profiler started.
   INFO:     x.x.x.x:35940 - "POST /start_profile HTTP/1.1" 200 OK
   ```
   After vLLM profiling is started, users could start asking questions and get responses from vLLM MicroService  
   or ChatQnA MicroService.  
   ##### Stop vLLM profiling
   By following command, users could stop vLLM profliing and generate a *.pt.trace.json.gz file as profiling result  
   under /mnt folder in vllm-service docker instance.  
   ```bash
   # vLLM Service
   curl http://${host_ip}:9009/stop_profile \
     -H "Content-Type: application/json" \
     -d '{"model": "Intel/neural-chat-7b-v3-3"}'
   ```
   Users would see below docker logs from vllm-service if profiling is stopped correctly.  
   ```bash
   INFO api_server.py:368] Stopping profiler...
   INFO api_server.py:370] Profiler stopped.
   INFO:     x.x.x.x:41614 - "POST /stop_profile HTTP/1.1" 200 OK
   ```
   After vllm profiling is stopped, users could use below command to get the *.pt.trace.json.gz file under /mnt folder.  
   ```bash
   docker cp  vllm-service:/mnt/ .
   ```
   ##### Check profiling result
   Open a web browser and type "chrome://tracing" or "ui.perfetto.dev", and then load the json.gz file, you should be able  
   to see the vLLM profiling result as below diagram. 
 ![image](https://github.com/user-attachments/assets/55c7097e-5574-41dc-97a7-5e87c31bc286)
 ## 🚀 Launch the UI
 ### Launch with origin port
--- a/ChatQnA/docker_compose/intel/cpu/xeon/compose_vllm.yaml
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/compose_vllm.yaml
@@ -86,6 +86,7 @@ services:
      https_proxy: ${https_proxy}
      HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
      LLM_MODEL_ID: ${LLM_MODEL_ID}
      VLLM_TORCH_PROFILER_DIR: "/mnt"
    command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80
  chatqna-xeon-backend-server:
    image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}