Freeze OPEA images tag

Signed-off-by: NeuralChatBot <grp_neural_chat_bot@intel.com>
Adjustments for helm release change (#1173 )
2024-11-21 14:24:16 +00:00 · 2024-11-21 16:57:30 +08:00 · 2024-11-21 16:57:29 +08:00 · 2024-11-21 16:57:28 +08:00 · 2024-11-20 10:56:49 +08:00 · 2024-11-19 22:57:25 +08:00
373 changed files with 11431 additions and 1088 deletions
--- a/.github/code_spell_ignore.txt
+++ b/.github/code_spell_ignore.txt
@@ -1,2 +1,2 @@
 ModelIn
-modelin
+modelin
--- a/.github/license_template.txt
+++ b/.github/license_template.txt
@@ -1,2 +1,2 @@
 Copyright (C) 2024 Intel Corporation
-SPDX-License-Identifier: Apache-2.0
+SPDX-License-Identifier: Apache-2.0
--- a/.github/workflows/_example-workflow.yml
+++ b/.github/workflows/_example-workflow.yml
@@ -77,9 +77,9 @@ jobs:
              git clone https://github.com/vllm-project/vllm.git
              cd vllm && git rev-parse HEAD && cd ../
          fi
-          if [[ $(grep -c "vllm-hpu:" ${docker_compose_path}) != 0 ]]; then
+          if [[ $(grep -c "vllm-gaudi:" ${docker_compose_path}) != 0 ]]; then
               git clone https://github.com/HabanaAI/vllm-fork.git
-               cd vllm-fork && git rev-parse HEAD && cd ../
+               cd vllm-fork && git checkout 3c39626 && cd ../
          fi
          git clone https://github.com/opea-project/GenAIComps.git
          cd GenAIComps && git checkout ${{ inputs.opea_branch }} && git rev-parse HEAD && cd ../
--- a/.github/workflows/_get-test-matrix.yml
+++ b/.github/workflows/_get-test-matrix.yml
@@ -14,7 +14,7 @@ on:
      test_mode:
        required: false
        type: string
-        default: 'docker_compose'
+        default: 'compose'
    outputs:
      run_matrix:
        description: "The matrix string"
--- a/.github/workflows/manual-freeze-tag.yml
+++ b/.github/workflows/manual-freeze-tag.yml
@@ -1,13 +1,13 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0

-name: Freeze OPEA images release tag in readme on manual event
+name: Freeze OPEA images release tag

 on:
  workflow_dispatch:
    inputs:
      tag:
-        default: "latest"
+        default: "1.1.0"
        description: "Tag to apply to images"
        required: true
        type: string
@@ -23,10 +23,6 @@ jobs:
          fetch-depth: 0
          ref: ${{ github.ref }}

-      - uses: actions/setup-python@v5
-        with:
-          python-version: "3.10"
-
      - name: Set up Git
        run: |
          git config --global user.name "NeuralChatBot"
@@ -35,9 +31,10 @@ jobs:

      - name: Run script
        run: |
-          find . -name "*.md" | xargs sed -i "s|^docker\ compose|TAG=${{ github.event.inputs.tag }}\ docker\ compose|g"
-          find . -type f -name "*.yaml" \( -path "*/benchmark/*" -o -path "*/kubernetes/*" \) | xargs sed -i -E 's/(opea\/[A-Za-z0-9\-]*:)latest/\1${{ github.event.inputs.tag }}/g'
-          find . -type f -name "*.md" \( -path "*/benchmark/*" -o -path "*/kubernetes/*" \) | xargs sed -i -E 's/(opea\/[A-Za-z0-9\-]*:)latest/\1${{ github.event.inputs.tag }}/g'
+          IFS='.' read -r major minor patch <<< "${{ github.event.inputs.tag }}"
+          echo "VERSION_MAJOR ${major}"  > version.txt
+          echo "VERSION_MINOR ${minor}" >> version.txt
+          echo "VERSION_PATCH ${patch}" >> version.txt

      - name: Commit changes
        run: |
--- a/.github/workflows/push-image-build.yml
+++ b/.github/workflows/push-image-build.yml
@@ -8,7 +8,8 @@ on:
    branches: [ 'main' ]
    paths:
      - "**.py"
-      - "**Dockerfile"
+      - "**Dockerfile*"
+      - "**docker_image_build/build.yaml"

 concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}-on-push
@@ -18,7 +19,7 @@ jobs:
  job1:
    uses: ./.github/workflows/_get-test-matrix.yml
    with:
-      test_mode: "docker_image_build/build.yaml"
+      test_mode: "docker_image_build"

  image-build:
    needs: job1
--- a/.github/workflows/scripts/get_test_matrix.sh
+++ b/.github/workflows/scripts/get_test_matrix.sh
@@ -16,8 +16,13 @@ for example in ${examples}; do
    if [[ ! $(find . -type f | grep ${test_mode}) ]]; then continue; fi
    cd tests
    ls -l
-    hardware_list=$(find . -type f -name "test_compose*_on_*.sh" | cut -d/ -f2 | cut -d. -f1 | awk -F'_on_' '{print $2}'| sort -u)
-    echo "Test supported hardware list = ${hardware_list}"
+    if [[ "$test_mode" == "docker_image_build" ]]; then
+        find_name="test_manifest_on_*.sh"
+    else
+        find_name="test_${test_mode}*_on_*.sh"
+    fi
+    hardware_list=$(find . -type f -name "${find_name}" | cut -d/ -f2 | cut -d. -f1 | awk -F'_on_' '{print $2}'| sort -u)
+    echo -e "Test supported hardware list: \n${hardware_list}"

    run_hardware=""
    if [[ $(printf '%s\n' "${changed_files[@]}" | grep ${example} | cut -d'/' -f2 | grep -E '*.py|Dockerfile*|ui|docker_image_build' ) ]]; then
--- a/.gitignore
+++ b/.gitignore
@@ -5,4 +5,4 @@
 **/playwright/.cache/
 **/test-results/

-__pycache__/
+__pycache__/
--- a/.prettierignore
+++ b/.prettierignore
@@ -1 +1 @@
-**/kubernetes/
+**/kubernetes/
--- a/.set_env.sh
+++ b/.set_env.sh
@@ -0,0 +1,16 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+#
+#To anounce the version of the codes, please create a version.txt and have following format.
+#VERSION_MAJOR 1
+#VERSION_MINOR 0
+#VERSION_PATCH 0
+
+VERSION_FILE="version.txt"
+if [ -f $VERSION_FILE ]; then
+    VER_OPEA_MAJOR=$(grep "VERSION_MAJOR" $VERSION_FILE | cut -d " " -f 2)
+    VER_OPEA_MINOR=$(grep "VERSION_MINOR" $VERSION_FILE | cut -d " " -f 2)
+    VER_OPEA_PATCH=$(grep "VERSION_PATCH" $VERSION_FILE | cut -d " " -f 2)
+    export TAG=$VER_OPEA_MAJOR.$VER_OPEA_MINOR
+    echo OPEA Version:$TAG
+fi
--- a/AgentQnA/README.md
+++ b/AgentQnA/README.md
@@ -83,29 +83,32 @@ flowchart LR

 ## Deployment with docker

-1. Build agent docker image
+1. Build agent docker image [Optional]

-   Note: this is optional. The docker images will be automatically pulled when running the docker compose commands. This step is only needed if pulling images failed.
+> [!NOTE]
+> the step is optional. The docker images will be automatically pulled when running the docker compose commands. This step is only needed if pulling images failed.

-   First, clone the opea GenAIComps repo.
+First, clone the opea GenAIComps repo.

-   ```
-   export WORKDIR=<your-work-directory>
-   cd $WORKDIR
-   git clone https://github.com/opea-project/GenAIComps.git
-   ```
+```
+export WORKDIR=<your-work-directory>
+cd $WORKDIR
+git clone https://github.com/opea-project/GenAIComps.git
+```

-   Then build the agent docker image. Both the supervisor agent and the worker agent will use the same docker image, but when we launch the two agents we will specify different strategies and register different tools.
+Then build the agent docker image. Both the supervisor agent and the worker agent will use the same docker image, but when we launch the two agents we will specify different strategies and register different tools.

-   ```
-   cd GenAIComps
-   docker build -t opea/agent-langchain:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/agent/langchain/Dockerfile .
-   ```
+```
+cd GenAIComps
+docker build -t opea/agent-langchain:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/agent/langchain/Dockerfile .
+```

 2. Set up environment for this example </br>
+
   First, clone this repo.

   ```
+   export WORKDIR=<your-work-directory>
   cd $WORKDIR
   git clone https://github.com/opea-project/GenAIExamples.git
   ```
@@ -113,6 +116,14 @@ flowchart LR
   Second, set up env vars.

   ```
+   # Example: host_ip="192.168.1.1" or export host_ip="External_Public_IP"
+   export host_ip=$(hostname -I | awk '{print $1}')
+   # if you are in a proxy environment, also set the proxy-related environment variables
+   export http_proxy="Your_HTTP_Proxy"
+   export https_proxy="Your_HTTPs_Proxy"
+   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
+   export no_proxy="Your_No_Proxy"
+
   export TOOLSET_PATH=$WORKDIR/GenAIExamples/AgentQnA/tools/
   # for using open-source llms
   export HUGGINGFACEHUB_API_TOKEN=<your-HF-token>
@@ -147,6 +158,12 @@ flowchart LR
 5. Launch agent services</br>
   We provide two options for `llm_engine` of the agents: 1. open-source LLMs, 2. OpenAI models via API calls.

+   Deploy it on Gaudi or Xeon respectively
+
+   ::::{tab-set}
+   :::{tab-item} Gaudi
+   :sync: Gaudi
+
   To use open-source LLMs on Gaudi2, run commands below.

   ```
@@ -155,6 +172,10 @@ flowchart LR
   bash launch_agent_service_tgi_gaudi.sh
   ```

+   :::
+   :::{tab-item} Xeon
+   :sync: Xeon
+
   To use OpenAI models, run commands below.

   ```
@@ -162,6 +183,9 @@ flowchart LR
   bash launch_agent_service_openai.sh
   ```

+   :::
+   ::::
+
 ## Validate services

 First look at logs of the agent docker containers:
@@ -181,7 +205,7 @@ You should see something like "HTTP server setup successful" if the docker conta
 Second, validate worker agent:

 ```
-curl http://${ip_address}:9095/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
+curl http://${host_ip}:9095/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
     "query": "Most recent album by Taylor Swift"
    }'
 ```
@@ -189,7 +213,7 @@ curl http://${ip_address}:9095/v1/chat/completions -X POST -H "Content-Type: app
 Third, validate supervisor agent:

 ```
-curl http://${ip_address}:9090/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
+curl http://${host_ip}:9090/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
     "query": "Most recent album by Taylor Swift"
    }'
 ```
--- a/AgentQnA/docker_compose/intel/cpu/xeon/README.md
+++ b/AgentQnA/docker_compose/intel/cpu/xeon/README.md
@@ -1,3 +1,100 @@
-# Deployment on Xeon
+# Single node on-prem deployment with Docker Compose on Xeon Scalable processors

-We deploy the retrieval tool on Xeon. For LLMs, we support OpenAI models via API calls. For instructions on using open-source LLMs, please refer to the deployment guide [here](../../../../README.md).
+This example showcases a hierarchical multi-agent system for question-answering applications. We deploy the example on Xeon. For LLMs, we use OpenAI models via API calls. For instructions on using open-source LLMs, please refer to the deployment guide [here](../../../../README.md).
+
+## Deployment with docker
+
+1. First, clone this repo.
+   ```
+   export WORKDIR=<your-work-directory>
+   cd $WORKDIR
+   git clone https://github.com/opea-project/GenAIExamples.git
+   ```
+2. Set up environment for this example </br>
+
+   ```
+   # Example: host_ip="192.168.1.1" or export host_ip="External_Public_IP"
+   export host_ip=$(hostname -I | awk '{print $1}')
+   # if you are in a proxy environment, also set the proxy-related environment variables
+   export http_proxy="Your_HTTP_Proxy"
+   export https_proxy="Your_HTTPs_Proxy"
+   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
+   export no_proxy="Your_No_Proxy"
+
+   export TOOLSET_PATH=$WORKDIR/GenAIExamples/AgentQnA/tools/
+   #OPANAI_API_KEY if you want to use OpenAI models
+   export OPENAI_API_KEY=<your-openai-key>
+   ```
+
+3. Deploy the retrieval tool (i.e., DocIndexRetriever mega-service)
+
+   First, launch the mega-service.
+
+   ```
+   cd $WORKDIR/GenAIExamples/AgentQnA/retrieval_tool
+   bash launch_retrieval_tool.sh
+   ```
+
+   Then, ingest data into the vector database. Here we provide an example. You can ingest your own data.
+
+   ```
+   bash run_ingest_data.sh
+   ```
+
+4. Launch Tool service
+   In this example, we will use some of the mock APIs provided in the Meta CRAG KDD Challenge to demonstrate the benefits of gaining additional context from mock knowledge graphs.
+   ```
+   docker run -d -p=8080:8000 docker.io/aicrowd/kdd-cup-24-crag-mock-api:v0
+   ```
+5. Launch `Agent` service
+
+   The configurations of the supervisor agent and the worker agent are defined in the docker-compose yaml file. We currently use openAI GPT-4o-mini as LLM, and llama3.1-70B-instruct (served by TGI-Gaudi) in Gaudi example. To use openai llm, run command below.
+
+   ```
+   cd $WORKDIR/GenAIExamples/AgentQnA/docker_compose/intel/cpu/xeon
+   bash launch_agent_service_openai.sh
+   ```
+
+6. [Optional] Build `Agent` docker image if pulling images failed.
+
+   ```
+   git clone https://github.com/opea-project/GenAIComps.git
+   cd GenAIComps
+   docker build -t opea/agent-langchain:latest -f comps/agent/langchain/Dockerfile .
+   ```
+
+## Validate services
+
+First look at logs of the agent docker containers:
+
+```
+# worker agent
+docker logs rag-agent-endpoint
+```
+
+```
+# supervisor agent
+docker logs react-agent-endpoint
+```
+
+You should see something like "HTTP server setup successful" if the docker containers are started successfully.</p>
+
+Second, validate worker agent:
+
+```
+curl http://${host_ip}:9095/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
+     "query": "Most recent album by Taylor Swift"
+    }'
+```
+
+Third, validate supervisor agent:
+
+```
+curl http://${host_ip}:9090/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
+     "query": "Most recent album by Taylor Swift"
+    }'
+```
+
+## How to register your own tools with agent
+
+You can take a look at the tools yaml and python files in this example. For more details, please refer to the "Provide your own tools" section in the instructions [here](https://github.com/opea-project/GenAIComps/tree/main/comps/agent/langchain/README.md).
--- a/AgentQnA/docker_compose/intel/cpu/xeon/launch_agent_service_openai.sh
+++ b/AgentQnA/docker_compose/intel/cpu/xeon/launch_agent_service_openai.sh
@@ -1,6 +1,9 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0

+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null
 export TOOLSET_PATH=$WORKDIR/GenAIExamples/AgentQnA/tools/
 export ip_address=$(hostname -I | awk '{print $1}')
 export recursion_limit_worker=12
--- a/AgentQnA/docker_compose/intel/hpu/gaudi/README.md
+++ b/AgentQnA/docker_compose/intel/hpu/gaudi/README.md
@@ -0,0 +1,105 @@
+# Single node on-prem deployment AgentQnA on Gaudi
+
+This example showcases a hierarchical multi-agent system for question-answering applications. We deploy the example on Gaudi using open-source LLMs,
+For more details, please refer to the deployment guide [here](../../../../README.md).
+
+## Deployment with docker
+
+1. First, clone this repo.
+   ```
+   export WORKDIR=<your-work-directory>
+   cd $WORKDIR
+   git clone https://github.com/opea-project/GenAIExamples.git
+   ```
+2. Set up environment for this example </br>
+
+   ```
+   # Example: host_ip="192.168.1.1" or export host_ip="External_Public_IP"
+   export host_ip=$(hostname -I | awk '{print $1}')
+   # if you are in a proxy environment, also set the proxy-related environment variables
+   export http_proxy="Your_HTTP_Proxy"
+   export https_proxy="Your_HTTPs_Proxy"
+   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
+   export no_proxy="Your_No_Proxy"
+
+   export TOOLSET_PATH=$WORKDIR/GenAIExamples/AgentQnA/tools/
+   # for using open-source llms
+   export HUGGINGFACEHUB_API_TOKEN=<your-HF-token>
+   # Example export HF_CACHE_DIR=$WORKDIR so that no need to redownload every time
+   export HF_CACHE_DIR=<directory-where-llms-are-downloaded>
+
+   ```
+
+3. Deploy the retrieval tool (i.e., DocIndexRetriever mega-service)
+
+   First, launch the mega-service.
+
+   ```
+   cd $WORKDIR/GenAIExamples/AgentQnA/retrieval_tool
+   bash launch_retrieval_tool.sh
+   ```
+
+   Then, ingest data into the vector database. Here we provide an example. You can ingest your own data.
+
+   ```
+   bash run_ingest_data.sh
+   ```
+
+4. Launch Tool service
+   In this example, we will use some of the mock APIs provided in the Meta CRAG KDD Challenge to demonstrate the benefits of gaining additional context from mock knowledge graphs.
+   ```
+   docker run -d -p=8080:8000 docker.io/aicrowd/kdd-cup-24-crag-mock-api:v0
+   ```
+5. Launch `Agent` service
+
+   To use open-source LLMs on Gaudi2, run commands below.
+
+   ```
+   cd $WORKDIR/GenAIExamples/AgentQnA/docker_compose/intel/hpu/gaudi
+   bash launch_tgi_gaudi.sh
+   bash launch_agent_service_tgi_gaudi.sh
+   ```
+
+6. [Optional] Build `Agent` docker image if pulling images failed.
+
+   ```
+   git clone https://github.com/opea-project/GenAIComps.git
+   cd GenAIComps
+   docker build -t opea/agent-langchain:latest -f comps/agent/langchain/Dockerfile .
+   ```
+
+## Validate services
+
+First look at logs of the agent docker containers:
+
+```
+# worker agent
+docker logs rag-agent-endpoint
+```
+
+```
+# supervisor agent
+docker logs react-agent-endpoint
+```
+
+You should see something like "HTTP server setup successful" if the docker containers are started successfully.</p>
+
+Second, validate worker agent:
+
+```
+curl http://${host_ip}:9095/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
+     "query": "Most recent album by Taylor Swift"
+    }'
+```
+
+Third, validate supervisor agent:
+
+```
+curl http://${host_ip}:9090/v1/chat/completions -X POST -H "Content-Type: application/json" -d '{
+     "query": "Most recent album by Taylor Swift"
+    }'
+```
+
+## How to register your own tools with agent
+
+You can take a look at the tools yaml and python files in this example. For more details, please refer to the "Provide your own tools" section in the instructions [here](https://github.com/opea-project/GenAIComps/tree/main/comps/agent/langchain/README.md).
--- a/AgentQnA/docker_compose/intel/hpu/gaudi/launch_agent_service_tgi_gaudi.sh
+++ b/AgentQnA/docker_compose/intel/hpu/gaudi/launch_agent_service_tgi_gaudi.sh
@@ -1,6 +1,9 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0

+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null
 WORKPATH=$(dirname "$PWD")/..
 # export WORKDIR=$WORKPATH/../../
 echo "WORKDIR=${WORKDIR}"
--- a/AgentQnA/docker_compose/intel/hpu/gaudi/tgi_gaudi.yaml
+++ b/AgentQnA/docker_compose/intel/hpu/gaudi/tgi_gaudi.yaml
@@ -3,7 +3,7 @@

 services:
  tgi-server:
-    image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    container_name: tgi-server
    ports:
      - "8085:80"
--- a/AudioQnA/Dockerfile
+++ b/AudioQnA/Dockerfile
@@ -18,7 +18,7 @@ WORKDIR /home/user/
 RUN git clone https://github.com/opea-project/GenAIComps.git

 WORKDIR /home/user/GenAIComps
-RUN pip install --no-cache-dir --upgrade pip && \
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
    pip install --no-cache-dir -r /home/user/GenAIComps/requirements.txt

 COPY ./audioqna.py /home/user/audioqna.py
--- a/AudioQnA/Dockerfile.multilang
+++ b/AudioQnA/Dockerfile.multilang
@@ -18,7 +18,7 @@ WORKDIR /home/user/
 RUN git clone https://github.com/opea-project/GenAIComps.git

 WORKDIR /home/user/GenAIComps
-RUN pip install --no-cache-dir --upgrade pip && \
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
    pip install --no-cache-dir -r /home/user/GenAIComps/requirements.txt

 COPY ./audioqna_multilang.py /home/user/audioqna_multilang.py
--- a/AudioQnA/benchmark/performance/README.md
+++ b/AudioQnA/benchmark/performance/README.md
@@ -0,0 +1,77 @@
+# AudioQnA Benchmarking
+
+This folder contains a collection of scripts to enable inference benchmarking by leveraging a comprehensive benchmarking tool, [GenAIEval](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md), that enables throughput analysis to assess inference performance.
+
+By following this guide, you can run benchmarks on your deployment and share the results with the OPEA community.
+
+## Purpose
+
+We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
+
+- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
+- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
+- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading llms, serving frameworks etc.
+
+## Metrics
+
+The benchmark will report the below metrics, including:
+
+- Number of Concurrent Requests
+- End-to-End Latency: P50, P90, P99 (in milliseconds)
+- End-to-End First Token Latency: P50, P90, P99 (in milliseconds)
+- Average Next Token Latency (in milliseconds)
+- Average Token Latency (in milliseconds)
+- Requests Per Second (RPS)
+- Output Tokens Per Second
+- Input Tokens Per Second
+
+Results will be displayed in the terminal and saved as CSV file named `1_stats.csv` for easy export to spreadsheets.
+
+## Getting Started
+
+We recommend using Kubernetes to deploy the AudioQnA service, as it offers benefits such as load balancing and improved scalability. However, you can also deploy the service using Docker if that better suits your needs.
+
+### Prerequisites
+
+- Install Kubernetes by following [this guide](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md).
+
+- Every node has direct internet access
+- Set up kubectl on the master node with access to the Kubernetes cluster.
+- Install Python 3.8+ on the master node for running GenAIEval.
+- Ensure all nodes have a local /mnt/models folder, which will be mounted by the pods.
+- Ensure that the container's ulimit can meet the the number of requests.
+
+```bash
+# The way to modify the containered ulimit:
+sudo systemctl edit containerd
+# Add two lines:
+[Service]
+LimitNOFILE=65536:1048576
+
+sudo systemctl daemon-reload; sudo systemctl restart containerd
+```
+
+## Test Steps
+
+Please deploy AudioQnA service before benchmarking.
+
+### Run Benchmark Test
+
+Before the benchmark, we can configure the number of test queries and test output directory by:
+
+```bash
+export USER_QUERIES="[128, 128, 128, 128]"
+export TEST_OUTPUT_DIR="/tmp/benchmark_output"
+```
+
+And then run the benchmark by:
+
+```bash
+bash benchmark.sh -n <node_count>
+```
+
+The argument `-n` refers to the number of test nodes.
+
+### Data collection
+
+All the test results will come to this folder `/tmp/benchmark_output` configured by the environment variable `TEST_OUTPUT_DIR` in previous steps.
--- a/AudioQnA/benchmark/performance/benchmark.sh
+++ b/AudioQnA/benchmark/performance/benchmark.sh
@@ -0,0 +1,99 @@
+#!/bin/bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+deployment_type="k8s"
+node_number=1
+service_port=8888
+query_per_node=128
+
+benchmark_tool_path="$(pwd)/GenAIEval"
+
+usage() {
+    echo "Usage: $0 [-d deployment_type] [-n node_number] [-i service_ip] [-p service_port]"
+    echo "  -d deployment_type    AudioQnA deployment type, select between k8s and docker (default: k8s)"
+    echo "  -n node_number        Test node number, required only for k8s deployment_type, (default: 1)"
+    echo "  -i service_ip         AudioQnA service ip, required only for docker deployment_type"
+    echo "  -p service_port       AudioQnA service port, required only for docker deployment_type, (default: 8888)"
+    exit 1
+}
+
+while getopts ":d:n:i:p:" opt; do
+    case ${opt} in
+        d )
+            deployment_type=$OPTARG
+            ;;
+        n )
+            node_number=$OPTARG
+            ;;
+        i )
+            service_ip=$OPTARG
+            ;;
+        p )
+            service_port=$OPTARG
+            ;;
+        \? )
+            echo "Invalid option: -$OPTARG" 1>&2
+            usage
+            ;;
+        : )
+            echo "Invalid option: -$OPTARG requires an argument" 1>&2
+            usage
+            ;;
+    esac
+done
+
+if [[ "$deployment_type" == "docker" && -z "$service_ip" ]]; then
+    echo "Error: service_ip is required for docker deployment_type" 1>&2
+    usage
+fi
+
+if [[ "$deployment_type" == "k8s" && ( -n "$service_ip" || -n "$service_port" ) ]]; then
+    echo "Warning: service_ip and service_port are ignored for k8s deployment_type" 1>&2
+fi
+
+function main() {
+    if [[ ! -d ${benchmark_tool_path} ]]; then
+        echo "Benchmark tool not found, setting up..."
+        setup_env
+    fi
+    run_benchmark
+}
+
+function setup_env() {
+    git clone https://github.com/opea-project/GenAIEval.git
+    pushd ${benchmark_tool_path}
+    python3 -m venv stress_venv
+    source stress_venv/bin/activate
+    pip install -r requirements.txt
+    popd
+}
+
+function run_benchmark() {
+    source ${benchmark_tool_path}/stress_venv/bin/activate
+    export DEPLOYMENT_TYPE=${deployment_type}
+    export SERVICE_IP=${service_ip:-"None"}
+    export SERVICE_PORT=${service_port:-"None"}
+    if [[ -z $USER_QUERIES ]]; then
+        user_query=$((query_per_node*node_number))
+        export USER_QUERIES="[${user_query}, ${user_query}, ${user_query}, ${user_query}]"
+        echo "USER_QUERIES not configured, setting to: ${USER_QUERIES}."
+    fi
+    export WARMUP=$(echo $USER_QUERIES | sed -e 's/[][]//g' -e 's/,.*//')
+    if [[ -z $WARMUP ]]; then export WARMUP=0; fi
+    if [[ -z $TEST_OUTPUT_DIR ]]; then
+        if [[ $DEPLOYMENT_TYPE == "k8s" ]]; then
+            export TEST_OUTPUT_DIR="${benchmark_tool_path}/evals/benchmark/benchmark_output/node_${node_number}"
+        else
+            export TEST_OUTPUT_DIR="${benchmark_tool_path}/evals/benchmark/benchmark_output/docker"
+        fi
+        echo "TEST_OUTPUT_DIR not configured, setting to: ${TEST_OUTPUT_DIR}."
+    fi
+
+    envsubst < ./benchmark.yaml > ${benchmark_tool_path}/evals/benchmark/benchmark.yaml
+    cd ${benchmark_tool_path}/evals/benchmark
+    python benchmark.py
+}
+
+main
--- a/AudioQnA/benchmark/performance/benchmark.yaml
+++ b/AudioQnA/benchmark/performance/benchmark.yaml
@@ -0,0 +1,52 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+test_suite_config: # Overall configuration settings for the test suite
+  examples: ["audioqna"]  # The specific test cases being tested, e.g., chatqna, codegen, codetrans, faqgen, audioqna, visualqna
+  deployment_type: "k8s"  # Default is "k8s", can also be "docker"
+  service_ip: None  # Leave as None for k8s, specify for Docker
+  service_port: None  # Leave as None for k8s, specify for Docker
+  warm_ups: 0  # Number of test requests for warm-up
+  run_time: 60m  # The max total run time for the test suite
+  seed:  # The seed for all RNGs
+  user_queries: [1, 2, 4, 8, 16, 32, 64, 128]  # Number of test requests at each concurrency level
+  query_timeout: 120  # Number of seconds to wait for a simulated user to complete any executing task before exiting. 120 sec by defeult.
+  random_prompt: false  # Use random prompts if true, fixed prompts if false
+  collect_service_metric: false  # Collect service metrics if true, do not collect service metrics if false
+  data_visualization: false # Generate data visualization if true, do not generate data visualization if false
+  llm_model: "Intel/neural-chat-7b-v3-3"  # The LLM model used for the test
+  test_output_dir: "/tmp/benchmark_output"  # The directory to store the test output
+  load_shape:              # Tenant concurrency pattern
+    name: constant           # poisson or constant(locust default load shape)
+    params:                  # Loadshape-specific parameters
+      constant:                # Poisson load shape specific parameters, activate only if load_shape is poisson
+        concurrent_level: 4      # If user_queries is specified, concurrent_level is target number of requests per user. If not, it is the number of simulated users
+      poisson:                 # Poisson load shape specific parameters, activate only if load_shape is poisson
+        arrival-rate: 1.0        # Request arrival rate
+  namespace: "" # Fill the user-defined namespace. Otherwise, it will be default.
+
+test_cases:
+  audioqna:
+    asr:
+      run_test: true
+      service_name: "asr-svc"  # Replace with your service name
+    llm:
+      run_test: true
+      service_name: "llm-svc"  # Replace with your service name
+      parameters:
+        model_name: "Intel/neural-chat-7b-v3-3"
+        max_new_tokens: 128
+        temperature: 0.01
+        top_k: 10
+        top_p: 0.95
+        repetition_penalty: 1.03
+        streaming: true
+    llmserve:
+      run_test: true
+      service_name: "llm-svc"  # Replace with your service name
+    tts:
+      run_test: true
+      service_name: "tts-svc"  # Replace with your service name
+    e2e:
+      run_test: true
+      service_name: "audioqna-backend-server-svc"  # Replace with your service name
--- a/AudioQnA/docker_compose/intel/cpu/xeon/set_env.sh
+++ b/AudioQnA/docker_compose/intel/cpu/xeon/set_env.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null
--- a/AudioQnA/docker_compose/intel/hpu/gaudi/compose.yaml
+++ b/AudioQnA/docker_compose/intel/hpu/gaudi/compose.yaml
@@ -51,7 +51,7 @@ services:
    environment:
      TTS_ENDPOINT: ${TTS_ENDPOINT}
  tgi-service:
-    image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    container_name: tgi-gaudi-server
    ports:
      - "3006:80"
--- a/AudioQnA/docker_compose/intel/hpu/gaudi/set_env.sh
+++ b/AudioQnA/docker_compose/intel/hpu/gaudi/set_env.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null
--- a/AudioQnA/kubernetes/intel/README_gmc.md
+++ b/AudioQnA/kubernetes/intel/README_gmc.md
@@ -25,7 +25,7 @@ The AudioQnA uses the below prebuilt images if you choose a Xeon deployment
 Should you desire to use the Gaudi accelerator, two alternate images are used for the embedding and llm services.
 For Gaudi:

- tgi-service: ghcr.io/huggingface/tgi-gaudi:2.0.5
+- tgi-service: ghcr.io/huggingface/tgi-gaudi:2.0.6
 - whisper-gaudi: opea/whisper-gaudi:latest
 - speecht5-gaudi: opea/speecht5-gaudi:latest

--- a/AudioQnA/kubernetes/intel/hpu/gaudi/manifest/audioqna.yaml
+++ b/AudioQnA/kubernetes/intel/hpu/gaudi/manifest/audioqna.yaml
@@ -271,7 +271,7 @@ spec:
      - envFrom:
        - configMapRef:
            name: audio-qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        name: llm-dependency-deploy-demo
        securityContext:
          capabilities:
--- a/AudioQnA/tests/test_compose_on_gaudi.sh
+++ b/AudioQnA/tests/test_compose_on_gaudi.sh
@@ -22,7 +22,7 @@ function build_docker_images() {
    service_list="audioqna whisper-gaudi asr llm-tgi speecht5-gaudi tts"
    docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log

-    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.5
+    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.6
    docker images && sleep 1s
 }

@@ -100,7 +100,7 @@ function validate_megaservice() {
 #
 #    sed -i "s/localhost/$ip_address/g" playwright.config.ts
 #
-##    conda install -c conda-forge nodejs -y
+##    conda install -c conda-forge nodejs=22.6.0 -y
 #    npm install && npm ci && npx playwright install --with-deps
 #    node -v && npm -v && pip list
 #
--- a/AudioQnA/tests/test_compose_on_xeon.sh
+++ b/AudioQnA/tests/test_compose_on_xeon.sh
@@ -22,7 +22,7 @@ function build_docker_images() {
    service_list="audioqna whisper asr llm-tgi speecht5 tts"
    docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log

-    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.5
+    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.6
    docker images && sleep 1s
 }

@@ -90,7 +90,7 @@ function validate_megaservice() {
 #
 #    sed -i "s/localhost/$ip_address/g" playwright.config.ts
 #
-##    conda install -c conda-forge nodejs -y
+##    conda install -c conda-forge nodejs=22.6.0 -y
 #    npm install && npm ci && npx playwright install --with-deps
 #    node -v && npm -v && pip list
 #
--- a/AudioQnA/ui/docker/Dockerfile
+++ b/AudioQnA/ui/docker/Dockerfile
@@ -23,4 +23,4 @@ RUN npm run build
 EXPOSE 5173

 # Run the front-end application in preview mode
-CMD ["npm", "run", "preview", "--", "--port", "5173", "--host", "0.0.0.0"]
+CMD ["npm", "run", "preview", "--", "--port", "5173", "--host", "0.0.0.0"]
--- a/AudioQnA/ui/svelte/src/app.postcss
+++ b/AudioQnA/ui/svelte/src/app.postcss
@@ -79,4 +79,4 @@ a.btn {

 .w-12\/12 {
 	width: 100%
-}
+}
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/1.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/1.svg
@@ -89,4 +89,4 @@
            <stop offset="1" stop-color="#3300FF" stop-opacity="0.2" />
        </linearGradient>
    </defs>
-</svg>
+</svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/2.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/2.svg
@@ -89,4 +89,4 @@
            <stop offset="1" stop-color="#f3f4f6" stop-opacity="0" />
        </linearGradient>
    </defs>
-</svg>
+</svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/3.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/3.svg
@@ -76,4 +76,4 @@
            <stop offset="1" stop-color="#9CFFED" stop-opacity="0" />
        </linearGradient>
    </defs>
-</svg>
+</svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/4.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/4.svg
@@ -76,4 +76,4 @@
            <stop offset="1" stop-color="#6141E1" stop-opacity="0" />
        </linearGradient>
    </defs>
-</svg>
+</svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/5.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/5.svg
@@ -89,4 +89,4 @@
            <stop offset="1" stop-color="#3300FF" stop-opacity="0" />
        </linearGradient>
    </defs>
-</svg>
+</svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/stop-recording.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/stop-recording.svg
@@ -3,4 +3,4 @@
    <path
        d="M512 1024a512 512 0 1 1 512-512 512 512 0 0 1-512 512z m0-896a384 384 0 1 0 384 384A384 384 0 0 0 512 128z m128 576h-256a64 64 0 0 1-64-64v-256a64 64 0 0 1 64-64h256a64 64 0 0 1 64 64v256a64 64 0 0 1-64 64z"
        fill="#d81e06" p-id="3104"></path>
-</svg>
+</svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/upload.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/upload.svg
@@ -1 +1 @@
-<svg t="1713431562066" class="icon" viewBox="0 0 1024 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="6399" width="32" height="32"><path d="M592 768h-160c-26.6 0-48-21.4-48-48V384h-175.4c-35.6 0-53.4-43-28.2-68.2L484.6 11.4c15-15 39.6-15 54.6 0l304.4 304.4c25.2 25.2 7.4 68.2-28.2 68.2H640v336c0 26.6-21.4 48-48 48z m432-16v224c0 26.6-21.4 48-48 48H48c-26.6 0-48-21.4-48-48V752c0-26.6 21.4-48 48-48h272v16c0 61.8 50.2 112 112 112h160c61.8 0 112-50.2 112-112v-16h272c26.6 0 48 21.4 48 48z m-248 176c0-22-18-40-40-40s-40 18-40 40 18 40 40 40 40-18 40-40z m128 0c0-22-18-40-40-40s-40 18-40 40 18 40 40 40 40-18 40-40z" p-id="6400" fill="#ffffff"></path></svg>
+<svg t="1713431562066" class="icon" viewBox="0 0 1024 1024" version="1.1" xmlns="http://www.w3.org/2000/svg" p-id="6399" width="32" height="32"><path d="M592 768h-160c-26.6 0-48-21.4-48-48V384h-175.4c-35.6 0-53.4-43-28.2-68.2L484.6 11.4c15-15 39.6-15 54.6 0l304.4 304.4c25.2 25.2 7.4 68.2-28.2 68.2H640v336c0 26.6-21.4 48-48 48z m432-16v224c0 26.6-21.4 48-48 48H48c-26.6 0-48-21.4-48-48V752c0-26.6 21.4-48 48-48h272v16c0 61.8 50.2 112 112 112h160c61.8 0 112-50.2 112-112v-16h272c26.6 0 48 21.4 48 48z m-248 176c0-22-18-40-40-40s-40 18-40 40 18 40 40 40 40-18 40-40z m128 0c0-22-18-40-40-40s-40 18-40 40 18 40 40 40 40-18 40-40z" p-id="6400" fill="#ffffff"></path></svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/voice.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/voice.svg
@@ -6,4 +6,4 @@
    <path
        d="M864 479.776 864 352c0-17.664-14.304-32-32-32s-32 14.336-32 32l0 127.776c0 160.16-129.184 290.464-288 290.464-158.784 0-288-130.304-288-290.464L224 352c0-17.664-14.336-32-32-32s-32 14.336-32 32l0 127.776c0 184.608 140.864 336.48 320 352.832L480 896 288 896c-17.664 0-32 14.304-32 32s14.336 32 32 32l448 0c17.696 0 32-14.304 32-32s-14.304-32-32-32l-192 0 0-63.36C723.136 816.256 864 664.384 864 479.776z"
        fill="#707070" p-id="2962"></path>
-</svg>
+</svg>
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/voiceOff.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/voiceOff.svg
--- a/AudioQnA/ui/svelte/src/lib/assets/icons/svg/voiceOn.svg
+++ b/AudioQnA/ui/svelte/src/lib/assets/icons/svg/voiceOn.svg
--- a/AvatarChatbot/.gitignore
+++ b/AvatarChatbot/.gitignore
@@ -5,4 +5,4 @@
 docker_compose/intel/cpu/xeon/data
 docker_compose/intel/hpu/gaudi/data
 inputs/
-outputs/
+outputs/
--- a/AvatarChatbot/docker_compose/intel/cpu/xeon/set_env.sh
+++ b/AvatarChatbot/docker_compose/intel/cpu/xeon/set_env.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null
--- a/AvatarChatbot/docker_compose/intel/hpu/gaudi/compose.yaml
+++ b/AvatarChatbot/docker_compose/intel/hpu/gaudi/compose.yaml
@@ -15,7 +15,7 @@ services:
      no_proxy: ${no_proxy}
      http_proxy: ${http_proxy}
      https_proxy: ${https_proxy}
-      HABANA_VISIBLE_MODULES: all
+      HABANA_VISIBLE_DEVICES: all
      OMPI_MCA_btl_vader_single_copy_mechanism: none
    runtime: habana
    cap_add:
@@ -39,7 +39,7 @@ services:
      no_proxy: ${no_proxy}
      http_proxy: ${http_proxy}
      https_proxy: ${https_proxy}
-      HABANA_VISIBLE_MODULES: all
+      HABANA_VISIBLE_DEVICES: all
      OMPI_MCA_btl_vader_single_copy_mechanism: none
    runtime: habana
    cap_add:
@@ -54,7 +54,7 @@ services:
    environment:
      TTS_ENDPOINT: ${TTS_ENDPOINT}
  tgi-service:
-    image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    container_name: tgi-gaudi-server
    ports:
      - "3006:80"
@@ -67,7 +67,7 @@ services:
      HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
      HF_HUB_DISABLE_PROGRESS_BARS: 1
      HF_HUB_ENABLE_HF_TRANSFER: 0
-      HABANA_VISIBLE_MODULES: all
+      HABANA_VISIBLE_DEVICES: all
      OMPI_MCA_btl_vader_single_copy_mechanism: none
      ENABLE_HPU_GRAPH: true
      LIMIT_HPU_GRAPH: true
@@ -105,7 +105,7 @@ services:
      no_proxy: ${no_proxy}
      http_proxy: ${http_proxy}
      https_proxy: ${https_proxy}
-      HABANA_VISIBLE_MODULES: all
+      HABANA_VISIBLE_DEVICES: all
      OMPI_MCA_btl_vader_single_copy_mechanism: none
      DEVICE: ${DEVICE}
      INFERENCE_MODE: ${INFERENCE_MODE}
@@ -132,7 +132,7 @@ services:
      no_proxy: ${no_proxy}
      http_proxy: ${http_proxy}
      https_proxy: ${https_proxy}
-      HABANA_VISIBLE_MODULES: all
+      HABANA_VISIBLE_DEVICES: all
      OMPI_MCA_btl_vader_single_copy_mechanism: none
      WAV2LIP_ENDPOINT: ${WAV2LIP_ENDPOINT}
    runtime: habana
--- a/AvatarChatbot/docker_compose/intel/hpu/gaudi/set_env.sh
+++ b/AvatarChatbot/docker_compose/intel/hpu/gaudi/set_env.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null
--- a/AvatarChatbot/tests/test_compose_on_gaudi.sh
+++ b/AvatarChatbot/tests/test_compose_on_gaudi.sh
@@ -29,7 +29,7 @@ function build_docker_images() {
    service_list="avatarchatbot whisper-gaudi asr llm-tgi speecht5-gaudi tts wav2lip-gaudi animation"
    docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log

-    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.5
+    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.6

    docker images && sleep 1s
 }
@@ -74,7 +74,7 @@ function start_services() {
    export FPS=10

    # Start Docker Containers
-    docker compose up -d
+    docker compose up -d > ${LOG_PATH}/start_services_with_compose.log

    n=0
    until [[ "$n" -ge 100 ]]; do
@@ -86,7 +86,6 @@ function start_services() {
       n=$((n+1))
    done

-    # sleep 5m
    echo "All services are up and running"
    sleep 5s
 }
@@ -99,6 +98,7 @@ function validate_megaservice() {
    if [[ $result == *"mp4"* ]]; then
        echo "Result correct."
    else
+        echo "Result wrong, print docker logs."
        docker logs whisper-service > $LOG_PATH/whisper-service.log
        docker logs asr-service > $LOG_PATH/asr-service.log
        docker logs speecht5-service > $LOG_PATH/speecht5-service.log
@@ -107,19 +107,13 @@ function validate_megaservice() {
        docker logs llm-tgi-gaudi-server > $LOG_PATH/llm-tgi-gaudi-server.log
        docker logs wav2lip-service > $LOG_PATH/wav2lip-service.log
        docker logs animation-gaudi-server > $LOG_PATH/animation-gaudi-server.log
-
-        echo "Result wrong."
+        echo "Exit test."
        exit 1
    fi

 }


-#function validate_frontend() {
-
-#}
-
-
 function stop_docker() {
    cd $WORKPATH/docker_compose/intel/hpu/gaudi
    docker compose down
--- a/AvatarChatbot/tests/test_compose_on_xeon.sh
+++ b/AvatarChatbot/tests/test_compose_on_xeon.sh
@@ -29,7 +29,7 @@ function build_docker_images() {
    service_list="avatarchatbot whisper asr llm-tgi speecht5 tts wav2lip animation"
    docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log

-    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.5
+    docker pull ghcr.io/huggingface/tgi-gaudi:2.0.6

    docker images && sleep 1s
 }
--- a/ChatQnA/Dockerfile
+++ b/ChatQnA/Dockerfile
@@ -18,7 +18,7 @@ WORKDIR /home/user/
 RUN git clone https://github.com/opea-project/GenAIComps.git

 WORKDIR /home/user/GenAIComps
-RUN pip install --no-cache-dir --upgrade pip && \
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
    pip install --no-cache-dir -r /home/user/GenAIComps/requirements.txt && \
    pip install --no-cache-dir langchain_core

--- a/ChatQnA/Dockerfile.guardrails
+++ b/ChatQnA/Dockerfile.guardrails
@@ -18,7 +18,7 @@ WORKDIR /home/user/
 RUN git clone https://github.com/opea-project/GenAIComps.git

 WORKDIR /home/user/GenAIComps
-RUN pip install --no-cache-dir --upgrade pip && \
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
    pip install --no-cache-dir -r /home/user/GenAIComps/requirements.txt && \
    pip install --no-cache-dir langchain_core

--- a/ChatQnA/Dockerfile.without_rerank
+++ b/ChatQnA/Dockerfile.without_rerank
@@ -18,7 +18,7 @@ WORKDIR /home/user/
 RUN git clone https://github.com/opea-project/GenAIComps.git

 WORKDIR /home/user/GenAIComps
-RUN pip install --no-cache-dir --upgrade pip && \
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
    pip install --no-cache-dir -r /home/user/GenAIComps/requirements.txt && \
    pip install --no-cache-dir langchain_core

--- a/ChatQnA/Dockerfile.wrapper
+++ b/ChatQnA/Dockerfile.wrapper
@@ -0,0 +1,32 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM python:3.11-slim
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    git
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+WORKDIR /home/user/
+RUN git clone https://github.com/opea-project/GenAIComps.git
+
+WORKDIR /home/user/GenAIComps
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r /home/user/GenAIComps/requirements.txt
+
+COPY ./chatqna_wrapper.py /home/user/chatqna.py
+
+ENV PYTHONPATH=$PYTHONPATH:/home/user/GenAIComps
+
+USER user
+
+WORKDIR /home/user
+
+RUN echo 'ulimit -S -n 999999' >> ~/.bashrc
+
+ENTRYPOINT ["python", "chatqna.py"]
--- a/ChatQnA/README.md
+++ b/ChatQnA/README.md
@@ -4,7 +4,26 @@ Chatbots are the most widely adopted use case for leveraging the powerful chat a

 RAG bridges the knowledge gap by dynamically fetching relevant information from external sources, ensuring that responses generated remain factual and current. The core of this architecture are vector databases, which are instrumental in enabling efficient and semantic retrieval of information. These databases store data as vectors, allowing RAG to swiftly access the most pertinent documents or data points based on semantic similarity.

-## Deploy ChatQnA Service
+## 🤖 Automated Terraform Deployment using Intel® Optimized Cloud Modules for **Terraform**
+
+| Cloud Provider       | Intel Architecture                | Intel Optimized Cloud Module for Terraform                                                                                         | Comments                                                             |
+| -------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
+| AWS                  | 4th Gen Intel Xeon with Intel AMX | [AWS Module](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna)                          | Uses Intel/neural-chat-7b-v3-3 by default                            |
+| AWS Falcon2-11B      | 4th Gen Intel Xeon with Intel AMX | [AWS Module with Falcon11B](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna-falcon11B) | Uses TII Falcon2-11B LLM Model                                       |
+| GCP                  | 5th Gen Intel Xeon with Intel AMX | [GCP Module](https://github.com/intel/terraform-intel-gcp-vm/tree/main/examples/gen-ai-xeon-opea-chatqna)                          | Also supports Confidential AI by using Intel® TDX with 4th Gen Xeon |
+| Azure                | 5th Gen Intel Xeon with Intel AMX | Work-in-progress                                                                                                                   | Work-in-progress                                                     |
+| Intel Tiber AI Cloud | 5th Gen Intel Xeon with Intel AMX | Work-in-progress                                                                                                                   | Work-in-progress                                                     |
+
+## Automated Deployment to Ubuntu based system(if not using Terraform) using Intel® Optimized Cloud Modules for **Ansible**
+
+To deploy to existing Xeon Ubuntu based system, use our Intel Optimized Cloud Modules for Ansible. This is the same Ansible playbook used by Terraform.
+Use this if you are not using Terraform and have provisioned your system with another tool or manually including bare metal.
+| Operating System | Intel Optimized Cloud Module for Ansible |
+|------------------|------------------------------------------|
+| Ubuntu 20.04 | [ChatQnA Ansible Module](https://github.com/intel/optimized-cloud-recipes/tree/main/recipes/ai-opea-chatqna-xeon) |
+| Ubuntu 22.04 | Work-in-progress |
+
+## Manually Deploy ChatQnA Service

 The ChatQnA service can be effortlessly deployed on Intel Gaudi2, Intel Xeon Scalable Processors and Nvidia GPU.

--- a/ChatQnA/benchmark/accuracy/README.md
+++ b/ChatQnA/benchmark/accuracy/README.md
@@ -48,7 +48,7 @@ To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-
 docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2

 # for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
-docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
+docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
 ```

 ### Prepare Dataset
--- a/ChatQnA/benchmark/performance-deprecated/README.md
+++ b/ChatQnA/benchmark/performance-deprecated/README.md
--- a/ChatQnA/benchmark/performance-deprecated/benchmark.sh
+++ b/ChatQnA/benchmark/performance-deprecated/benchmark.sh
--- a/ChatQnA/benchmark/performance-deprecated/benchmark.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/benchmark.yaml
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/.helmignore
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/.helmignore
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/Chart.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/Chart.yaml
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/README.md
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/README.md
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/customize.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/customize.yaml
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/templates/configmap.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/templates/configmap.yaml
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/templates/deployment.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/templates/deployment.yaml
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/templates/service.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/templates/service.yaml
--- a/ChatQnA/benchmark/performance-deprecated/helm_charts/values.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/helm_charts/values.yaml
--- a/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/eight_gaudi/oob_eight_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/eight_gaudi/oob_eight_gaudi_with_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -327,7 +327,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/four_gaudi/oob_four_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/four_gaudi/oob_four_gaudi_with_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -327,7 +327,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/single_gaudi/oob_single_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/single_gaudi/oob_single_gaudi_with_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -327,7 +327,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/two_gaudi/oob_two_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/with_rerank/two_gaudi/oob_two_gaudi_with_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -327,7 +327,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/eight_gaudi/oob_eight_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/eight_gaudi/oob_eight_gaudi_without_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/four_gaudi/oob_four_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/four_gaudi/oob_four_gaudi_without_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/single_gaudi/oob_single_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/single_gaudi/oob_single_gaudi_without_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/two_gaudi/oob_two_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/oob/without_rerank/two_gaudi/oob_two_gaudi_without_rerank.yaml
@@ -237,7 +237,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/eight_gaudi/eight_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/eight_gaudi/eight_gaudi_with_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -345,7 +345,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/four_gaudi/tuned_four_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/four_gaudi/tuned_four_gaudi_with_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -345,7 +345,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/single_gaudi/tuned_single_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/single_gaudi/tuned_single_gaudi_with_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -345,7 +345,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/two_gaudi/tuned_two_gaudi_with_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/with_rerank/two_gaudi/tuned_two_gaudi_with_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
@@ -345,7 +345,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tei-gaudi:latest
+        image: ghcr.io/huggingface/tei-gaudi:1.5.0
        imagePullPolicy: IfNotPresent
        name: reranking-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/eight_gaudi/tuned_eight_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/eight_gaudi/tuned_eight_gaudi_without_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/four_gaudi/tuned_four_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/four_gaudi/tuned_four_gaudi_without_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/single_gaudi/tuned_single_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/single_gaudi/tuned_single_gaudi_without_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/two_gaudi/tuned_two_gaudi_without_rerank.yaml
+++ b/ChatQnA/benchmark/performance-deprecated/tuned/without_rerank/two_gaudi/tuned_two_gaudi_without_rerank.yaml
@@ -255,7 +255,7 @@ spec:
        envFrom:
        - configMapRef:
            name: qna-config
-        image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+        image: ghcr.io/huggingface/tgi-gaudi:2.0.6
        imagePullPolicy: IfNotPresent
        name: llm-dependency-deploy
        ports:
--- a/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/README.md
+++ b/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/README.md
@@ -0,0 +1,196 @@
+# ChatQnA Benchmarking
+
+This folder contains a collection of Kubernetes manifest files for deploying the ChatQnA service across scalable nodes. It includes a comprehensive [benchmarking tool](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md) that enables throughput analysis to assess inference performance.
+
+By following this guide, you can run benchmarks on your deployment and share the results with the OPEA community.
+
+## Purpose
+
+We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
+
+- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
+- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
+- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading llms, serving frameworks etc.
+
+## Metrics
+
+The benchmark will report the below metrics, including:
+
+- Number of Concurrent Requests
+- End-to-End Latency: P50, P90, P99 (in milliseconds)
+- End-to-End First Token Latency: P50, P90, P99 (in milliseconds)
+- Average Next Token Latency (in milliseconds)
+- Average Token Latency (in milliseconds)
+- Requests Per Second (RPS)
+- Output Tokens Per Second
+- Input Tokens Per Second
+
+Results will be displayed in the terminal and saved as CSV file named `1_stats.csv` for easy export to spreadsheets.
+
+## Table of Contents
+
+- [Deployment](#deployment)
+  - [Prerequisites](#prerequisites)
+  - [Deployment Scenarios](#deployment-scenarios)
+    - [Case 1: Baseline Deployment with Rerank](#case-1-baseline-deployment-with-rerank)
+    - [Case 2: Baseline Deployment without Rerank](#case-2-baseline-deployment-without-rerank)
+    - [Case 3: Tuned Deployment with Rerank](#case-3-tuned-deployment-with-rerank)
+- [Benchmark](#benchmark)
+  - [Test Configurations](#test-configurations)
+  - [Test Steps](#test-steps)
+    - [Upload Retrieval File](#upload-retrieval-file)
+    - [Run Benchmark Test](#run-benchmark-test)
+    - [Data collection](#data-collection)
+- [Teardown](#teardown)
+
+## Deployment
+
+### Prerequisites
+
+- Kubernetes installation: Use [kubespray](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md) or other official Kubernetes installation guides.
+- Helm installation: Follow the [Helm documentation](https://helm.sh/docs/intro/install/#helm) to install Helm.
+- Setup Hugging Face Token
+
+  To access models and APIs from Hugging Face, set your token as environment variable.
+  ```bash
+  export HF_TOKEN="insert-your-huggingface-token-here"
+  ```
+- Prepare Shared Models (Optional but Strongly Recommended)
+
+  Downloading models simultaneously to multiple nodes in your cluster can overload resources such as network bandwidth, memory and storage. To prevent resource exhaustion, it's recommended to preload the models in advance.
+  ```bash
+  pip install -U "huggingface_hub[cli]"
+  sudo mkdir -p /mnt/models
+  sudo chmod 777 /mnt/models
+  huggingface-cli download --cache-dir /mnt/models Intel/neural-chat-7b-v3-3
+  export MODEL_DIR=/mnt/models
+  ```
+  Once the models are downloaded, you can consider the following methods for sharing them across nodes:
+  - Persistent Volume Claim (PVC): This is the recommended approach for production setups. For more details on using PVC, refer to [PVC](https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/README.md#using-persistent-volume).
+  - Local Host Path: For simpler testing, ensure that each node involved in the deployment follows the steps above to locally prepare the models. After preparing the models, use `--set global.modelUseHostPath=${MODELDIR}` in the deployment command.
+
+- Label Nodes
+  ```base
+  python deploy.py --add-label --num-nodes 2
+  ```
+
+### Deployment Scenarios
+
+The example below are based on a two-node setup. You can adjust the number of nodes by using the `--num-nodes` option.
+
+By default, these commands use the `default` namespace. To specify a different namespace, use the `--namespace` flag with deploy, uninstall, and kubernetes command. Additionally, update the `namespace` field in `benchmark.yaml` before running the benchmark test.
+
+For additional configuration options, run `python deploy.py --help`
+
+#### Case 1: Baseline Deployment with Rerank
+
+Deploy Command (with node number, Hugging Face token, model directory specified):
+```bash
+python deploy.py --hf-token $HF_TOKEN --model-dir $MODEL_DIR --num-nodes 2 --with-rerank
+```
+Uninstall Command:
+```bash
+python deploy.py --uninstall
+```
+
+#### Case 2: Baseline Deployment without Rerank
+
+```bash
+python deploy.py --hf-token $HFTOKEN --model-dir $MODELDIR --num-nodes 2
+```
+#### Case 3: Tuned Deployment with Rerank
+
+```bash
+python deploy.py --hf-token $HFTOKEN --model-dir $MODELDIR --num-nodes 2 --with-rerank --tuned
+```
+
+## Benchmark
+
+### Test Configurations
+
+| Key      | Value   |
+| -------- | ------- |
+| Workload | ChatQnA |
+| Tag      | V1.1    |
+
+Models configuration
+| Key | Value |
+| ---------- | ------------------ |
+| Embedding | BAAI/bge-base-en-v1.5 |
+| Reranking | BAAI/bge-reranker-base |
+| Inference | Intel/neural-chat-7b-v3-3 |
+
+Benchmark parameters
+| Key | Value |
+| ---------- | ------------------ |
+| LLM input tokens | 1024 |
+| LLM output tokens | 128 |
+
+Number of test requests for different scheduled node number:
+| Node count | Concurrency | Query number |
+| ----- | -------- | -------- |
+| 1 | 128 | 640 |
+| 2 | 256 | 1280 |
+| 4 | 512 | 2560 |
+
+More detailed configuration can be found in configuration file [benchmark.yaml](./benchmark.yaml).
+
+### Test Steps
+
+Use `kubectl get pods` to confirm that all pods are `READY` before starting the test.
+
+#### Upload Retrieval File
+
+Before testing, upload a specified file to make sure the llm input have the token length of 1k.
+
+Get files:
+
+```bash
+wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file_no_rerank.txt
+wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file.txt
+```
+
+Retrieve the `ClusterIP` of the `chatqna-data-prep` service.
+
+```bash
+kubectl get svc
+```
+Expected output:
+```log
+chatqna-data-prep         ClusterIP   xx.xx.xx.xx    <none>        6007/TCP            51m
+```
+
+Use the following `cURL` command to upload file:
+
+```bash
+cd GenAIEval/evals/benchmark/data
+# RAG with Rerank
+curl -X POST "http://${cluster_ip}:6007/v1/dataprep" \
+     -H "Content-Type: multipart/form-data" \
+     -F "files=@./upload_file.txt"
+# RAG without Rerank
+curl -X POST "http://${cluster_ip}:6007/v1/dataprep" \
+     -H "Content-Type: multipart/form-data" \
+     -F "files=@./upload_file_no_rerank.txt"
+```
+
+#### Run Benchmark Test
+
+Run the benchmark test using:
+```bash
+bash benchmark.sh -n 2
+```
+The `-n` argument specifies the number of test nodes. Required dependencies will be automatically installed when running the benchmark for the first time.
+
+#### Data collection
+
+All the test results will come to the folder `GenAIEval/evals/benchmark/benchmark_output`.
+
+## Teardown
+
+After completing the benchmark, use the following command to clean up the environment:
+
+Remove Node Labels:
+```bash
+python deploy.py --delete-label
+```
--- a/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/benchmark.sh
+++ b/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/benchmark.sh
@@ -0,0 +1,102 @@
+#!/bin/bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+deployment_type="k8s"
+node_number=1
+service_port=8888
+query_per_node=640
+
+benchmark_tool_path="$(pwd)/GenAIEval"
+
+usage() {
+    echo "Usage: $0 [-d deployment_type] [-n node_number] [-i service_ip] [-p service_port]"
+    echo "  -d deployment_type    ChatQnA deployment type, select between k8s and docker (default: k8s)"
+    echo "  -n node_number        Test node number, required only for k8s deployment_type, (default: 1)"
+    echo "  -i service_ip         chatqna service ip, required only for docker deployment_type"
+    echo "  -p service_port       chatqna service port, required only for docker deployment_type, (default: 8888)"
+    exit 1
+}
+
+while getopts ":d:n:i:p:" opt; do
+    case ${opt} in
+        d )
+            deployment_type=$OPTARG
+            ;;
+        n )
+            node_number=$OPTARG
+            ;;
+        i )
+            service_ip=$OPTARG
+            ;;
+        p )
+            service_port=$OPTARG
+            ;;
+        \? )
+            echo "Invalid option: -$OPTARG" 1>&2
+            usage
+            ;;
+        : )
+            echo "Invalid option: -$OPTARG requires an argument" 1>&2
+            usage
+            ;;
+    esac
+done
+
+if [[ "$deployment_type" == "docker" && -z "$service_ip" ]]; then
+    echo "Error: service_ip is required for docker deployment_type" 1>&2
+    usage
+fi
+
+if [[ "$deployment_type" == "k8s" && ( -n "$service_ip" || -n "$service_port" ) ]]; then
+    echo "Warning: service_ip and service_port are ignored for k8s deployment_type" 1>&2
+fi
+
+function main() {
+    if [[ ! -d ${benchmark_tool_path} ]]; then
+        echo "Benchmark tool not found, setting up..."
+        setup_env
+    fi
+    run_benchmark
+}
+
+function setup_env() {
+    git clone https://github.com/opea-project/GenAIEval.git
+    pushd ${benchmark_tool_path}
+    python3 -m venv stress_venv
+    source stress_venv/bin/activate
+    pip install -r requirements.txt
+    popd
+}
+
+function run_benchmark() {
+    source ${benchmark_tool_path}/stress_venv/bin/activate
+    export DEPLOYMENT_TYPE=${deployment_type}
+    export SERVICE_IP=${service_ip:-"None"}
+    export SERVICE_PORT=${service_port:-"None"}
+    export LOAD_SHAPE=${load_shape:-"constant"}
+    export CONCURRENT_LEVEL=${concurrent_level:-5}
+    export ARRIVAL_RATE=${arrival_rate:-1.0}
+    if [[ -z $USER_QUERIES ]]; then
+        user_query=$((query_per_node*node_number))
+        export USER_QUERIES="[${user_query}, ${user_query}, ${user_query}, ${user_query}]"
+        echo "USER_QUERIES not configured, setting to: ${USER_QUERIES}."
+    fi
+    export WARMUP=$(echo $USER_QUERIES | sed -e 's/[][]//g' -e 's/,.*//')
+    if [[ -z $WARMUP ]]; then export WARMUP=0; fi
+    if [[ -z $TEST_OUTPUT_DIR ]]; then
+        if [[ $DEPLOYMENT_TYPE == "k8s" ]]; then
+            export TEST_OUTPUT_DIR="${benchmark_tool_path}/evals/benchmark/benchmark_output/node_${node_number}"
+        else
+            export TEST_OUTPUT_DIR="${benchmark_tool_path}/evals/benchmark/benchmark_output/docker"
+        fi
+        echo "TEST_OUTPUT_DIR not configured, setting to: ${TEST_OUTPUT_DIR}."
+    fi
+
+    envsubst < ./benchmark.yaml > ${benchmark_tool_path}/evals/benchmark/benchmark.yaml
+    cd ${benchmark_tool_path}/evals/benchmark
+    python benchmark.py
+}
+
+main
--- a/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/benchmark.yaml
+++ b/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/benchmark.yaml
@@ -0,0 +1,67 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+test_suite_config: # Overall configuration settings for the test suite
+  examples: ["chatqna"]  # The specific test cases being tested, e.g., chatqna, codegen, codetrans, faqgen, audioqna, visualqna
+  deployment_type: ${DEPLOYMENT_TYPE}  # Default is "k8s", can also be "docker"
+  service_ip: ${SERVICE_IP}  # Leave as None for k8s, specify for Docker
+  service_port: ${SERVICE_PORT}  # Leave as None for k8s, specify for Docker
+  warm_ups: ${WARMUP}  # Number of test requests for warm-up
+  run_time: 60m  # The max total run time for the test suite
+  seed:  # The seed for all RNGs
+  user_queries: ${USER_QUERIES}  # Number of test requests at each concurrency level
+  query_timeout: 120  # Number of seconds to wait for a simulated user to complete any executing task before exiting. 120 sec by defeult.
+  random_prompt: false  # Use random prompts if true, fixed prompts if false
+  collect_service_metric: false  # Collect service metrics if true, do not collect service metrics if false
+  data_visualization: false # Generate data visualization if true, do not generate data visualization if false
+  llm_model: "Intel/neural-chat-7b-v3-3"  # The LLM model used for the test
+  test_output_dir: "${TEST_OUTPUT_DIR}"  # The directory to store the test output
+  load_shape:              # Tenant concurrency pattern
+    name: ${LOAD_SHAPE}      # poisson or constant(locust default load shape)
+    params:                  # Loadshape-specific parameters
+      constant:                # Constant load shape specific parameters, activate only if load_shape.name is constant
+        concurrent_level: ${CONCURRENT_LEVEL}      # If user_queries is specified, concurrent_level is target number of requests per user. If not, it is the number of simulated users
+      poisson:                 # Poisson load shape specific parameters, activate only if load_shape.name is poisson
+        arrival_rate: ${ARRIVAL_RATE}        # Request arrival rate
+
+test_cases:
+  chatqna:
+    embedding:
+      run_test: false
+      service_name: "chatqna-embedding-usvc"  # Replace with your service name
+    embedserve:
+      run_test: false
+      service_name: "chatqna-tei"  # Replace with your service name
+    retriever:
+      run_test: false
+      service_name: "chatqna-retriever-usvc"  # Replace with your service name
+      parameters:
+        search_type: "similarity"
+        k: 4
+        fetch_k: 20
+        lambda_mult: 0.5
+        score_threshold: 0.2
+    reranking:
+      run_test: false
+      service_name: "chatqna-reranking-usvc"  # Replace with your service name
+      parameters:
+        top_n: 1
+    rerankserve:
+      run_test: false
+      service_name: "chatqna-teirerank"  # Replace with your service name
+    llm:
+      run_test: false
+      service_name: "chatqna-llm-uservice"  # Replace with your service name
+      parameters:
+        max_tokens: 128
+        temperature: 0.01
+        top_k: 10
+        top_p: 0.95
+        repetition_penalty: 1.03
+        streaming: true
+    llmserve:
+      run_test: false
+      service_name: "chatqna-tgi"  # Replace with your service name
+    e2e:
+      run_test: true
+      service_name: "chatqna"  # Replace with your service name
--- a/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/deploy.py
+++ b/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/deploy.py
@@ -0,0 +1,279 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import glob
+import json
+import os
+import shutil
+import subprocess
+import sys
+
+import yaml
+from generate_helm_values import generate_helm_values
+
+
+def run_kubectl_command(command):
+    """Run a kubectl command and return the output."""
+    try:
+        result = subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+        return result.stdout
+    except subprocess.CalledProcessError as e:
+        print(f"Error running command: {command}\n{e.stderr}")
+        exit(1)
+
+
+def get_all_nodes():
+    """Get the list of all nodes in the Kubernetes cluster."""
+    command = ["kubectl", "get", "nodes", "-o", "json"]
+    output = run_kubectl_command(command)
+    nodes = json.loads(output)
+    return [node["metadata"]["name"] for node in nodes["items"]]
+
+
+def add_label_to_node(node_name, label):
+    """Add a label to the specified node."""
+    command = ["kubectl", "label", "node", node_name, label, "--overwrite"]
+    print(f"Labeling node {node_name} with {label}...")
+    run_kubectl_command(command)
+    print(f"Label {label} added to node {node_name} successfully.")
+
+
+def add_labels_to_nodes(node_count=None, label=None, node_names=None):
+    """Add a label to the specified number of nodes or to specified nodes."""
+
+    if node_names:
+        # Add label to the specified nodes
+        for node_name in node_names:
+            add_label_to_node(node_name, label)
+    else:
+        # Fetch the node list and label the specified number of nodes
+        all_nodes = get_all_nodes()
+        if node_count is None or node_count > len(all_nodes):
+            print(f"Error: Node count exceeds the number of available nodes ({len(all_nodes)} available).")
+            sys.exit(1)
+
+        selected_nodes = all_nodes[:node_count]
+        for node_name in selected_nodes:
+            add_label_to_node(node_name, label)
+
+
+def clear_labels_from_nodes(label, node_names=None):
+    """Clear the specified label from specific nodes if provided, otherwise from all nodes."""
+    label_key = label.split("=")[0]  # Extract key from 'key=value' format
+
+    # If specific nodes are provided, use them; otherwise, get all nodes
+    nodes_to_clear = node_names if node_names else get_all_nodes()
+
+    for node_name in nodes_to_clear:
+        # Check if the node has the label by inspecting its metadata
+        command = ["kubectl", "get", "node", node_name, "-o", "json"]
+        node_info = run_kubectl_command(command)
+        node_metadata = json.loads(node_info)
+
+        # Check if the label exists on this node
+        labels = node_metadata["metadata"].get("labels", {})
+        if label_key in labels:
+            # Remove the label from the node
+            command = ["kubectl", "label", "node", node_name, f"{label_key}-"]
+            print(f"Removing label {label_key} from node {node_name}...")
+            run_kubectl_command(command)
+            print(f"Label {label_key} removed from node {node_name} successfully.")
+        else:
+            print(f"Label {label_key} not found on node {node_name}, skipping.")
+
+
+def install_helm_release(release_name, chart_name, namespace, values_file, device_type):
+    """Deploy a Helm release with a specified name and chart.
+
+    Parameters:
+    - release_name: The name of the Helm release.
+    - chart_name: The Helm chart name or path, e.g., "opea/chatqna".
+    - namespace: The Kubernetes namespace for deployment.
+    - values_file: The user values file for deployment.
+    - device_type: The device type (e.g., "gaudi") for specific configurations (optional).
+    """
+
+    # Check if the namespace exists; if not, create it
+    try:
+        # Check if the namespace exists
+        command = ["kubectl", "get", "namespace", namespace]
+        subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+    except subprocess.CalledProcessError:
+        # Namespace does not exist, create it
+        print(f"Namespace '{namespace}' does not exist. Creating it...")
+        command = ["kubectl", "create", "namespace", namespace]
+        subprocess.run(command, check=True)
+        print(f"Namespace '{namespace}' created successfully.")
+
+    # Handle gaudi-specific values file if device_type is "gaudi"
+    hw_values_file = None
+    untar_dir = None
+    if device_type == "gaudi":
+        print("Device type is gaudi. Pulling Helm chart to get gaudi-values.yaml...")
+
+        # Combine chart_name with fixed prefix
+        chart_pull_url = f"oci://ghcr.io/opea-project/charts/{chart_name}"
+
+        # Pull and untar the chart
+        subprocess.run(["helm", "pull", chart_pull_url, "--untar"], check=True)
+
+        # Find the untarred directory
+        untar_dirs = glob.glob(f"{chart_name}*")
+        if untar_dirs:
+            untar_dir = untar_dirs[0]
+            hw_values_file = os.path.join(untar_dir, "gaudi-values.yaml")
+            print("gaudi-values.yaml pulled and ready for use.")
+        else:
+            print(f"Error: Could not find untarred directory for {chart_name}")
+            return
+
+    # Prepare the Helm install command
+    command = ["helm", "install", release_name, chart_name, "--namespace", namespace]
+
+    # Append additional values file for gaudi if it exists
+    if hw_values_file:
+        command.extend(["-f", hw_values_file])
+
+    # Append the main values file
+    command.extend(["-f", values_file])
+
+    # Execute the Helm install command
+    try:
+        print(f"Running command: {' '.join(command)}")  # Print full command for debugging
+        subprocess.run(command, check=True)
+        print("Deployment initiated successfully.")
+    except subprocess.CalledProcessError as e:
+        print(f"Error occurred while deploying Helm release: {e}")
+
+    # Cleanup: Remove the untarred directory
+    if untar_dir and os.path.isdir(untar_dir):
+        print(f"Removing temporary directory: {untar_dir}")
+        shutil.rmtree(untar_dir)
+        print("Temporary directory removed successfully.")
+
+
+def uninstall_helm_release(release_name, namespace=None):
+    """Uninstall a Helm release and clean up resources, optionally delete the namespace if not 'default'."""
+    # Default to 'default' namespace if none is specified
+    if not namespace:
+        namespace = "default"
+
+    try:
+        # Uninstall the Helm release
+        command = ["helm", "uninstall", release_name, "--namespace", namespace]
+        print(f"Uninstalling Helm release {release_name} in namespace {namespace}...")
+        run_kubectl_command(command)
+        print(f"Helm release {release_name} uninstalled successfully.")
+
+        # If the namespace is specified and not 'default', delete it
+        if namespace != "default":
+            print(f"Deleting namespace {namespace}...")
+            delete_namespace_command = ["kubectl", "delete", "namespace", namespace]
+            run_kubectl_command(delete_namespace_command)
+            print(f"Namespace {namespace} deleted successfully.")
+        else:
+            print("Namespace is 'default', skipping deletion.")
+
+    except subprocess.CalledProcessError as e:
+        print(f"Error occurred while uninstalling Helm release or deleting namespace: {e}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Manage Helm Deployment.")
+    parser.add_argument(
+        "--release-name",
+        type=str,
+        default="chatqna",
+        help="The Helm release name created during deployment (default: chatqna).",
+    )
+    parser.add_argument(
+        "--chart-name",
+        type=str,
+        default="chatqna",
+        help="The chart name to deploy, composed of repo name and chart name (default: chatqna).",
+    )
+    parser.add_argument("--namespace", default="default", help="Kubernetes namespace (default: default).")
+    parser.add_argument("--hf-token", help="Hugging Face API token.")
+    parser.add_argument(
+        "--model-dir", help="Model directory, mounted as volumes for service access to pre-downloaded models"
+    )
+    parser.add_argument("--user-values", help="Path to a user-specified values.yaml file.")
+    parser.add_argument(
+        "--create-values-only", action="store_true", help="Only create the values.yaml file without deploying."
+    )
+    parser.add_argument("--uninstall", action="store_true", help="Uninstall the Helm release.")
+    parser.add_argument("--num-nodes", type=int, default=1, help="Number of nodes to use (default: 1).")
+    parser.add_argument("--node-names", nargs="*", help="Optional specific node names to label.")
+    parser.add_argument("--add-label", action="store_true", help="Add label to specified nodes if this flag is set.")
+    parser.add_argument(
+        "--delete-label", action="store_true", help="Delete label from specified nodes if this flag is set."
+    )
+    parser.add_argument(
+        "--label", default="node-type=opea-benchmark", help="Label to add/delete (default: node-type=opea-benchmark)."
+    )
+    parser.add_argument("--with-rerank", action="store_true", help="Include rerank service in the deployment.")
+    parser.add_argument(
+        "--tuned",
+        action="store_true",
+        help="Modify resources for services and change extraCmdArgs when creating values.yaml.",
+    )
+    parser.add_argument(
+        "--device-type",
+        type=str,
+        choices=["cpu", "gaudi"],
+        default="gaudi",
+        help="Specify the device type for deployment (choices: 'cpu', 'gaudi'; default: gaudi).",
+    )
+
+    args = parser.parse_args()
+
+    # Adjust num-nodes based on node-names if specified
+    if args.node_names:
+        num_node_names = len(args.node_names)
+        if args.num_nodes != 1 and args.num_nodes != num_node_names:
+            parser.error("--num-nodes must match the number of --node-names if both are specified.")
+        else:
+            args.num_nodes = num_node_names
+
+    # Node labeling management
+    if args.add_label:
+        add_labels_to_nodes(args.num_nodes, args.label, args.node_names)
+        return
+    elif args.delete_label:
+        clear_labels_from_nodes(args.label, args.node_names)
+        return
+
+    # Uninstall Helm release if specified
+    if args.uninstall:
+        uninstall_helm_release(args.release_name, args.namespace)
+        return
+
+    # Prepare values.yaml if not uninstalling
+    if args.user_values:
+        values_file_path = args.user_values
+    else:
+        if not args.hf_token:
+            parser.error("--hf-token are required")
+        node_selector = {args.label.split("=")[0]: args.label.split("=")[1]}
+        values_file_path = generate_helm_values(
+            with_rerank=args.with_rerank,
+            num_nodes=args.num_nodes,
+            hf_token=args.hf_token,
+            model_dir=args.model_dir,
+            node_selector=node_selector,
+            tune=args.tuned,
+        )
+
+    # Read back the generated YAML file for verification
+    with open(values_file_path, "r") as file:
+        print("Generated YAML contents:")
+        print(file.read())
+
+    # Deploy unless --create-values-only is specified
+    if not args.create_values_only:
+        install_helm_release(args.release_name, args.chart_name, args.namespace, values_file_path, args.device_type)
+
+
+if __name__ == "__main__":
+    main()
--- a/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/generate_helm_values.py
+++ b/ChatQnA/benchmark/performance/kubernetes/intel/gaudi/generate_helm_values.py
@@ -0,0 +1,163 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+import yaml
+
+
+def generate_helm_values(with_rerank, num_nodes, hf_token, model_dir, node_selector=None, tune=False):
+    """Create a values.yaml file based on the provided configuration."""
+
+    # Log the received parameters
+    print("Received parameters:")
+    print(f"with_rerank: {with_rerank}")
+    print(f"num_nodes: {num_nodes}")
+    print(f"node_selector: {node_selector}")  # Log the node_selector
+    print(f"tune: {tune}")
+
+    if node_selector is None:
+        node_selector = {}
+
+    # Construct the base values dictionary
+    values = {
+        "tei": {"nodeSelector": {key: value for key, value in node_selector.items()}},
+        "tgi": {"nodeSelector": {key: value for key, value in node_selector.items()}},
+        "data-prep": {"nodeSelector": {key: value for key, value in node_selector.items()}},
+        "redis-vector-db": {"nodeSelector": {key: value for key, value in node_selector.items()}},
+        "retriever-usvc": {"nodeSelector": {key: value for key, value in node_selector.items()}},
+        "chatqna-ui": {"nodeSelector": {key: value for key, value in node_selector.items()}},
+        "global": {
+            "HUGGINGFACEHUB_API_TOKEN": hf_token,  # Use passed token
+            "modelUseHostPath": model_dir,  # Use passed model directory
+        },
+        "nodeSelector": {key: value for key, value in node_selector.items()},
+    }
+
+    if with_rerank:
+        values["teirerank"] = {"nodeSelector": {key: value for key, value in node_selector.items()}}
+    else:
+        values["image"] = {"repository": "opea/chatqna-without-rerank"}
+
+    default_replicas = [
+        {"name": "chatqna", "replicaCount": 2},
+        {"name": "tei", "replicaCount": 1},
+        {"name": "teirerank", "replicaCount": 1} if with_rerank else None,
+        {"name": "tgi", "replicaCount": 7 if with_rerank else 8},
+        {"name": "data-prep", "replicaCount": 1},
+        {"name": "redis-vector-db", "replicaCount": 1},
+        {"name": "retriever-usvc", "replicaCount": 2},
+    ]
+
+    if num_nodes > 1:
+        # Scale replicas based on number of nodes
+        replicas = [
+            {"name": "chatqna", "replicaCount": 1 * num_nodes},
+            {"name": "tei", "replicaCount": 1 * num_nodes},
+            {"name": "teirerank", "replicaCount": 1} if with_rerank else None,
+            {"name": "tgi", "replicaCount": (8 * num_nodes - 1) if with_rerank else 8 * num_nodes},
+            {"name": "data-prep", "replicaCount": 1},
+            {"name": "redis-vector-db", "replicaCount": 1},
+            {"name": "retriever-usvc", "replicaCount": 1 * num_nodes},
+        ]
+    else:
+        replicas = default_replicas
+
+    # Remove None values for rerank disabled
+    replicas = [r for r in replicas if r]
+
+    # Update values.yaml with replicas
+    for replica in replicas:
+        service_name = replica["name"]
+        if service_name == "chatqna":
+            values["replicaCount"] = replica["replicaCount"]
+            print(replica["replicaCount"])
+        elif service_name in values:
+            values[service_name]["replicaCount"] = replica["replicaCount"]
+
+    # Prepare resource configurations based on tuning
+    resources = []
+    if tune:
+        resources = [
+            {
+                "name": "chatqna",
+                "resources": {
+                    "limits": {"cpu": "16", "memory": "8000Mi"},
+                    "requests": {"cpu": "16", "memory": "8000Mi"},
+                },
+            },
+            {
+                "name": "tei",
+                "resources": {
+                    "limits": {"cpu": "80", "memory": "20000Mi"},
+                    "requests": {"cpu": "80", "memory": "20000Mi"},
+                },
+            },
+            {"name": "teirerank", "resources": {"limits": {"habana.ai/gaudi": 1}}} if with_rerank else None,
+            {"name": "tgi", "resources": {"limits": {"habana.ai/gaudi": 1}}},
+            {"name": "retriever-usvc", "resources": {"requests": {"cpu": "8", "memory": "8000Mi"}}},
+        ]
+
+        # Filter out any None values directly as part of initialization
+        resources = [r for r in resources if r is not None]
+
+        # Add resources for each service if tuning
+        for resource in resources:
+            service_name = resource["name"]
+            if service_name == "chatqna":
+                values["resources"] = resource["resources"]
+            elif service_name in values:
+                values[service_name]["resources"] = resource["resources"]
+
+        # Add extraCmdArgs for tgi service with default values
+        if "tgi" in values:
+            values["tgi"]["extraCmdArgs"] = [
+                "--max-input-length",
+                "1280",
+                "--max-total-tokens",
+                "2048",
+                "--max-batch-total-tokens",
+                "65536",
+                "--max-batch-prefill-tokens",
+                "4096",
+            ]
+
+    yaml_string = yaml.dump(values, default_flow_style=False)
+
+    # Determine the mode based on the 'tune' parameter
+    mode = "tuned" if tune else "oob"
+
+    # Determine the filename based on 'with_rerank' and 'num_nodes'
+    if with_rerank:
+        filename = f"{mode}-{num_nodes}-gaudi-with-rerank-values.yaml"
+    else:
+        filename = f"{mode}-{num_nodes}-gaudi-without-rerank-values.yaml"
+
+    # Write the YAML data to the file
+    with open(filename, "w") as file:
+        file.write(yaml_string)
+
+    # Get the current working directory and construct the file path
+    current_dir = os.getcwd()
+    filepath = os.path.join(current_dir, filename)
+
+    print(f"YAML file {filepath} has been generated.")
+    return filepath  # Optionally return the file path
+
+
+# Main execution for standalone use of create_values_yaml
+if __name__ == "__main__":
+    # Example values for standalone execution
+    with_rerank = True
+    num_nodes = 2
+    hftoken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+    modeldir = "/mnt/model"
+    node_selector = {"node-type": "opea-benchmark"}
+    tune = True
+
+    filename = generate_helm_values(with_rerank, num_nodes, hftoken, modeldir, node_selector, tune)
+
+    # Read back the generated YAML file for verification
+    with open(filename, "r") as file:
+        print("Generated YAML contents:")
+        print(file.read())
--- a/ChatQnA/chatqna.py
+++ b/ChatQnA/chatqna.py
@@ -148,6 +148,8 @@ def align_outputs(self, data, cur_node, inputs, runtime_graph, llm_parameters_di

        next_data["inputs"] = prompt

+    elif self.services[cur_node].service_type == ServiceType.LLM and not llm_parameters_dict["streaming"]:
+        next_data["text"] = data["choices"][0]["message"]["content"]
    else:
        next_data = data

--- a/ChatQnA/chatqna.yaml
+++ b/ChatQnA/chatqna.yaml
@@ -19,7 +19,7 @@ opea_micro_services:
  tei-embedding-service:
    host: ${TEI_EMBEDDING_SERVICE_IP}
    ports: ${TEI_EMBEDDING_SERVICE_PORT}
-    image: ghcr.io/huggingface/tei-gaudi:latest
+    image: ghcr.io/huggingface/tei-gaudi:1.5.0
    volumes:
      - "./data:/data"
    runtime: habana
@@ -38,7 +38,7 @@ opea_micro_services:
  tgi-service:
    host: ${TGI_SERVICE_IP}
    ports: ${TGI_SERVICE_PORT}
-    image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    volumes:
      - "./data:/data"
    runtime: habana
--- a/ChatQnA/chatqna_wrapper.py
+++ b/ChatQnA/chatqna_wrapper.py
@@ -0,0 +1,68 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+from comps import ChatQnAGateway, MicroService, ServiceOrchestrator, ServiceType
+
+MEGA_SERVICE_HOST_IP = os.getenv("MEGA_SERVICE_HOST_IP", "0.0.0.0")
+MEGA_SERVICE_PORT = int(os.getenv("MEGA_SERVICE_PORT", 8888))
+EMBEDDING_SERVICE_HOST_IP = os.getenv("EMBEDDING_SERVICE_HOST_IP", "0.0.0.0")
+EMBEDDING_SERVICE_PORT = int(os.getenv("EMBEDDING_SERVICE_PORT", 6000))
+RETRIEVER_SERVICE_HOST_IP = os.getenv("RETRIEVER_SERVICE_HOST_IP", "0.0.0.0")
+RETRIEVER_SERVICE_PORT = int(os.getenv("RETRIEVER_SERVICE_PORT", 7000))
+RERANK_SERVICE_HOST_IP = os.getenv("RERANK_SERVICE_HOST_IP", "0.0.0.0")
+RERANK_SERVICE_PORT = int(os.getenv("RERANK_SERVICE_PORT", 8000))
+LLM_SERVICE_HOST_IP = os.getenv("LLM_SERVICE_HOST_IP", "0.0.0.0")
+LLM_SERVICE_PORT = int(os.getenv("LLM_SERVICE_PORT", 9000))
+
+
+class ChatQnAService:
+    def __init__(self, host="0.0.0.0", port=8000):
+        self.host = host
+        self.port = port
+        self.megaservice = ServiceOrchestrator()
+
+    def add_remote_service(self):
+        embedding = MicroService(
+            name="embedding",
+            host=EMBEDDING_SERVICE_HOST_IP,
+            port=EMBEDDING_SERVICE_PORT,
+            endpoint="/v1/embeddings",
+            use_remote_service=True,
+            service_type=ServiceType.EMBEDDING,
+        )
+        retriever = MicroService(
+            name="retriever",
+            host=RETRIEVER_SERVICE_HOST_IP,
+            port=RETRIEVER_SERVICE_PORT,
+            endpoint="/v1/retrieval",
+            use_remote_service=True,
+            service_type=ServiceType.RETRIEVER,
+        )
+        rerank = MicroService(
+            name="rerank",
+            host=RERANK_SERVICE_HOST_IP,
+            port=RERANK_SERVICE_PORT,
+            endpoint="/v1/reranking",
+            use_remote_service=True,
+            service_type=ServiceType.RERANK,
+        )
+        llm = MicroService(
+            name="llm",
+            host=LLM_SERVICE_HOST_IP,
+            port=LLM_SERVICE_PORT,
+            endpoint="/v1/chat/completions",
+            use_remote_service=True,
+            service_type=ServiceType.LLM,
+        )
+        self.megaservice.add(embedding).add(retriever).add(rerank).add(llm)
+        self.megaservice.flow_to(embedding, retriever)
+        self.megaservice.flow_to(retriever, rerank)
+        self.megaservice.flow_to(rerank, llm)
+        self.gateway = ChatQnAGateway(megaservice=self.megaservice, host="0.0.0.0", port=self.port)
+
+
+if __name__ == "__main__":
+    chatqna = ChatQnAService(host=MEGA_SERVICE_HOST_IP, port=MEGA_SERVICE_PORT)
+    chatqna.add_remote_service()
--- a/ChatQnA/docker_compose/amd/gpu/rocm/README.md
+++ b/ChatQnA/docker_compose/amd/gpu/rocm/README.md
@@ -0,0 +1,432 @@
+# Build and deploy CodeGen Application on AMD GPU (ROCm)
+
+## Build MegaService of ChatQnA on AMD ROCm GPU
+
+This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on AMD ROCm GPU platform. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as embedding, retriever, rerank, and llm. We will publish the Docker images to Docker Hub, it will simplify the deployment process for this service.
+
+Quick Start Deployment Steps:
+
+1. Set up the environment variables.
+2. Run Docker Compose.
+3. Consume the ChatQnA Service.
+
+## Quick Start: 1.Setup Environment Variable
+
+To set up environment variables for deploying ChatQnA services, follow these steps:
+
+1. Set the required environment variables:
+
+   ```bash
+   # Example: host_ip="192.168.1.1"
+   export HOST_IP=${host_ip}
+   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
+   export CHATQNA_HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+   ```
+
+2. If you are in a proxy environment, also set the proxy-related environment variables:
+
+   ```bash
+   export http_proxy="Your_HTTP_Proxy"
+   export https_proxy="Your_HTTPs_Proxy"
+   ```
+
+3. Set up other environment variables:
+
+   ```bash
+   source ./set_env.sh
+   ```
+
+## Quick Start: 2.Run Docker Compose
+
+```bash
+docker compose up -d
+```
+
+It will automatically download the docker image on `docker hub`:
+
+```bash
+docker pull opea/chatqna:latest
+docker pull opea/chatqna-ui:latest
+```
+
+In following cases, you could build docker image from source by yourself.
+
+- Failed to download the docker image.
+
+- If you want to use a specific version of Docker image.
+
+Please refer to 'Build Docker Images' in below.
+
+## QuickStart: 3.Consume the ChatQnA Service
+
+Prepare and upload test document
+
+```
+# download pdf file
+wget https://raw.githubusercontent.com/opea-project/GenAIComps/main/comps/retrievers/redis/data/nke-10k-2023.pdf
+# upload pdf file with dataprep
+curl -X POST "http://${host_ip}:6007/v1/dataprep" \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./nke-10k-2023.pdf"
+```
+
+Get MegaSerice(backend) response:
+
+```bash
+curl http://${host_ip}:8888/v1/chatqna \
+    -H "Content-Type: application/json" \
+    -d '{
+        "messages": "What is the revenue of Nike in 2023?"
+    }'
+```
+
+## 🚀 Build Docker Images
+
+First of all, you need to build Docker Images locally. This step can be ignored after the Docker images published to Docker hub.
+
+### 1. Source Code install GenAIComps
+
+```bash
+git clone https://github.com/opea-project/GenAIComps.git
+cd GenAIComps
+```
+
+### 2. Build Retriever Image
+
+```bash
+docker build --no-cache -t opea/retriever-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/redis/langchain/Dockerfile .
+```
+
+### 3. Build Dataprep Image
+
+```bash
+docker build --no-cache -t opea/dataprep-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/redis/langchain/Dockerfile .
+```
+
+### 4. Build MegaService Docker Image
+
+To construct the Mega Service, we utilize the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline within the `chatqna.py` Python script. Build the MegaService Docker image using the command below:
+
+```bash
+git clone https://github.com/opea-project/GenAIExamples.git
+cd GenAIExamples/ChatQnA/docker
+docker build --no-cache -t opea/chatqna:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .
+cd ../../..
+```
+
+### 5. Build UI Docker Image
+
+Construct the frontend Docker image using the command below:
+
+```bash
+cd GenAIExamples/ChatQnA/ui
+docker build --no-cache -t opea/chatqna-ui:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f ./docker/Dockerfile .
+cd ../../../..
+```
+
+### 6. Build React UI Docker Image (Optional)
+
+Construct the frontend Docker image using the command below:
+
+```bash
+cd GenAIExamples/ChatQnA/ui
+docker build --no-cache -t opea/chatqna-react-ui:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f ./docker/Dockerfile.react .
+cd ../../../..
+```
+
+### 7. Build Nginx Docker Image
+
+```bash
+cd GenAIComps
+docker build -t opea/nginx:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/nginx/Dockerfile .
+```
+
+Then run the command `docker images`, you will have the following 5 Docker Images:
+
+1. `opea/retriever-redis:latest`
+2. `opea/dataprep-redis:latest`
+3. `opea/chatqna:latest`
+4. `opea/chatqna-ui:latest` or `opea/chatqna-react-ui:latest`
+5. `opea/nginx:latest`
+
+## 🚀 Start MicroServices and MegaService
+
+### Required Models
+
+By default, the embedding, reranking and LLM models are set to a default value as listed below:
+
+| Service   | Model                     |
+| --------- | ------------------------- |
+| Embedding | BAAI/bge-base-en-v1.5     |
+| Reranking | BAAI/bge-reranker-base    |
+| LLM       | Intel/neural-chat-7b-v3-3 |
+
+Change the `xxx_MODEL_ID` below for your needs.
+
+### Setup Environment Variables
+
+1. Set the required environment variables:
+
+   ```bash
+   # Example: host_ip="192.168.1.1"
+   export host_ip="External_Public_IP"
+   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
+   export no_proxy="Your_No_Proxy"
+   export CHATQNA_HUGGINGFACEHUB_API_TOKEN="Your_Huggingface_API_Token"
+   # Example: NGINX_PORT=80
+   export HOST_IP=${host_ip}
+   export NGINX_PORT=${your_nginx_port}
+   export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
+   export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
+   export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-base"
+   export CHATQNA_LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
+   export CHATQNA_TGI_SERVICE_PORT=8008
+   export CHATQNA_TEI_EMBEDDING_PORT=8090
+   export CHATQNA_TEI_EMBEDDING_ENDPOINT="http://${HOST_IP}:${CHATQNA_TEI_EMBEDDING_PORT}"
+   export CHATQNA_TEI_RERANKING_PORT=8808
+   export CHATQNA_REDIS_VECTOR_PORT=16379
+   export CHATQNA_REDIS_VECTOR_INSIGHT_PORT=8001
+   export CHATQNA_REDIS_DATAPREP_PORT=6007
+   export CHATQNA_REDIS_RETRIEVER_PORT=7000
+   export CHATQNA_INDEX_NAME="rag-redis"
+   export CHATQNA_MEGA_SERVICE_HOST_IP=${HOST_IP}
+   export CHATQNA_RETRIEVER_SERVICE_HOST_IP=${HOST_IP}
+   export CHATQNA_BACKEND_SERVICE_ENDPOINT="http://127.0.0.1:${CHATQNA_BACKEND_SERVICE_PORT}/v1/chatqna"
+   export CHATQNA_DATAPREP_SERVICE_ENDPOINT="http://127.0.0.1:${CHATQNA_REDIS_DATAPREP_PORT}/v1/dataprep"
+   export CHATQNA_DATAPREP_GET_FILE_ENDPOINT="http://127.0.0.1:${CHATQNA_REDIS_DATAPREP_PORT}/v1/dataprep/get_file"
+   export CHATQNA_DATAPREP_DELETE_FILE_ENDPOINT="http://127.0.0.1:${CHATQNA_REDIS_DATAPREP_PORT}/v1/dataprep/delete_file"
+   export CHATQNA_FRONTEND_SERVICE_IP=${HOST_IP}
+   export CHATQNA_FRONTEND_SERVICE_PORT=5173
+   export CHATQNA_BACKEND_SERVICE_NAME=chatqna
+   export CHATQNA_BACKEND_SERVICE_IP=${HOST_IP}
+   export CHATQNA_BACKEND_SERVICE_PORT=8888
+   export CHATQNA_REDIS_URL="redis://${HOST_IP}:${CHATQNA_REDIS_VECTOR_PORT}"
+   export CHATQNA_EMBEDDING_SERVICE_HOST_IP=${HOST_IP}
+   export CHATQNA_RERANK_SERVICE_HOST_IP=${HOST_IP}
+   export CHATQNA_LLM_SERVICE_HOST_IP=${HOST_IP}
+   export CHATQNA_NGINX_PORT=5176
+   ```
+
+2. If you are in a proxy environment, also set the proxy-related environment variables:
+
+   ```bash
+   export http_proxy="Your_HTTP_Proxy"
+   export https_proxy="Your_HTTPs_Proxy"
+   ```
+
+3. Note: In order to limit access to a subset of GPUs, please pass each device individually using one or more -device /dev/dri/rendered<node>, where <node> is the card index, starting from 128. (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html#docker-restrict-gpus) into tgi-service in compose.yaml file
+
+Example for set isolation for 1 GPU
+
+```
+      - /dev/dri/card0:/dev/dri/card0
+      - /dev/dri/renderD128:/dev/dri/renderD128
+```
+
+Example for set isolation for 2 GPUs
+
+```
+      - /dev/dri/card0:/dev/dri/card0
+      - /dev/dri/renderD128:/dev/dri/renderD128
+      - /dev/dri/card1:/dev/dri/card1
+      - /dev/dri/renderD129:/dev/dri/renderD129
+```
+
+Please find more information about accessing and restricting AMD GPUs in the link (https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html#docker-restrict-gpus)
+
+4. Set up other environment variables:
+
+   ```bash
+   source ./set_env.sh
+   ```
+
+### Start all the services Docker Containers
+
+```bash
+cd GenAIExamples/ChatQnA/docker_compose/amd/gpu/rocm
+docker compose up -d
+```
+
+### Validate MicroServices and MegaService
+
+1. TEI Embedding Service
+
+   ```bash
+   curl ${host_ip}:8090/embed \
+       -X POST \
+       -d '{"inputs":"What is Deep Learning?"}' \
+       -H 'Content-Type: application/json'
+   ```
+
+2. Retriever Microservice
+
+   To consume the retriever microservice, you need to generate a mock embedding vector by Python script. The length of embedding vector
+   is determined by the embedding model.
+   Here we use the model `EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"`, which vector size is 768.
+
+   Check the vecotor dimension of your embedding model, set `your_embedding` dimension equals to it.
+
+   ```bash
+   export your_embedding=$(python3 -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")
+   curl http://${host_ip}:7000/v1/retrieval \
+     -X POST \
+     -d "{\"text\":\"test\",\"embedding\":${your_embedding}}" \
+     -H 'Content-Type: application/json'
+   ```
+
+3. TEI Reranking Service
+
+   ```bash
+   curl http://${host_ip}:8808/rerank \
+       -X POST \
+       -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
+       -H 'Content-Type: application/json'
+   ```
+
+4. TGI Service
+
+   In first startup, this service will take more time to download the model files. After it's finished, the service will be ready.
+
+   Try the command below to check whether the TGI service is ready.
+
+   ```bash
+   docker logs ${CONTAINER_ID} | grep Connected
+   ```
+
+   If the service is ready, you will get the response like below.
+
+   ```
+   2024-09-03T02:47:53.402023Z  INFO text_generation_router::server: router/src/server.rs:2311: Connected
+   ```
+
+   Then try the `cURL` command below to validate TGI.
+
+   ```bash
+   curl http://${host_ip}:8008/generate \
+     -X POST \
+     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":64, "do_sample": true}}' \
+     -H 'Content-Type: application/json'
+   ```
+
+5. MegaService
+
+   ```bash
+   curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
+        "messages": "What is the revenue of Nike in 2023?"
+        }'
+   ```
+
+6. Nginx Service
+
+   ```bash
+   curl http://${host_ip}:${NGINX_PORT}/v1/chatqna \
+       -H "Content-Type: application/json" \
+       -d '{"messages": "What is the revenue of Nike in 2023?"}'
+   ```
+
+7. Dataprep Microservice（Optional）
+
+If you want to update the default knowledge base, you can use the following commands:
+
+Update Knowledge Base via Local File Upload:
+
+```bash
+curl -X POST "http://${host_ip}:6007/v1/dataprep" \
+     -H "Content-Type: multipart/form-data" \
+     -F "files=@./nke-10k-2023.pdf"
+```
+
+This command updates a knowledge base by uploading a local file for processing. Update the file path according to your environment.
+
+Add Knowledge Base via HTTP Links:
+
+```bash
+curl -X POST "http://${host_ip}:6007/v1/dataprep" \
+     -H "Content-Type: multipart/form-data" \
+     -F 'link_list=["https://opea.dev"]'
+```
+
+This command updates a knowledge base by submitting a list of HTTP links for processing.
+
+Also, you are able to get the file list that you uploaded:
+
+```bash
+curl -X POST "http://${host_ip}:6007/v1/dataprep/get_file" \
+     -H "Content-Type: application/json"
+```
+
+To delete the file/link you uploaded:
+
+```bash
+# delete link
+curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \
+     -d '{"file_path": "https://opea.dev"}' \
+     -H "Content-Type: application/json"
+
+# delete file
+curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \
+     -d '{"file_path": "nke-10k-2023.pdf"}' \
+     -H "Content-Type: application/json"
+
+# delete all uploaded files and links
+curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \
+     -d '{"file_path": "all"}' \
+     -H "Content-Type: application/json"
+```
+
+## 🚀 Launch the UI
+
+### Launch with origin port
+
+To access the frontend, open the following URL in your browser: http://{host_ip}:5173. By default, the UI runs on port 5173 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
+
+```yaml
+  chaqna-ui-server:
+    image: opea/chatqna-ui:latest
+    ...
+    ports:
+      - "80:5173"
+```
+
+### Launch with Nginx
+
+If you want to launch the UI using Nginx, open this URL: `http://${host_ip}:${NGINX_PORT}` in your browser to access the frontend.
+
+## 🚀 Launch the Conversational UI (Optional)
+
+To access the Conversational UI (react based) frontend, modify the UI service in the `compose.yaml` file. Replace `chaqna-ui-server` service with the `chatqna-react-ui-server` service as per the config below:
+
+```yaml
+chatqna-react-ui-server:
+  image: opea/chatqna-react-ui:latest
+  container_name: chatqna-react-ui-server
+  environment:
+    - APP_BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
+    - APP_DATA_PREP_SERVICE_URL=${DATAPREP_SERVICE_ENDPOINT}
+  ports:
+    - "5174:80"
+  depends_on:
+    - chaqna-backend-server
+  ipc: host
+  restart: always
+```
+
+Once the services are up, open the following URL in your browser: http://{host_ip}:5174. By default, the UI runs on port 80 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
+
+```yaml
+  chaqna-react-ui-server:
+    image: opea/chatqna-react-ui:latest
+    ...
+    ports:
+      - "80:80"
+```
+
+![project-screenshot](../../../../assets/img/chat_ui_init.png)
+
+Here is an example of running ChatQnA:
+
+![project-screenshot](../../../../assets/img/chat_ui_response.png)
+
+Here is an example of running ChatQnA with Conversational UI (React):
+
+![project-screenshot](../../../../assets/img/conversation_ui_response.png)
--- a/ChatQnA/docker_compose/amd/gpu/rocm/compose.yaml
+++ b/ChatQnA/docker_compose/amd/gpu/rocm/compose.yaml
@@ -0,0 +1,183 @@
+# Copyright (C) 2024 Advanced Micro Devices, Inc.
+# SPDX-License-Identifier: Apache-2.0
+
+services:
+  chatqna-redis-vector-db:
+    image: redis/redis-stack:7.2.0-v9
+    container_name: redis-vector-db
+    ports:
+      - "${CHATQNA_REDIS_VECTOR_PORT}:6379"
+      - "${CHATQNA_REDIS_VECTOR_INSIGHT_PORT}:8001"
+  chatqna-dataprep-redis-service:
+    image: ${REGISTRY:-opea}/dataprep-redis:${TAG:-latest}
+    container_name: dataprep-redis-server
+    depends_on:
+      - chatqna-redis-vector-db
+      - chatqna-tei-embedding-service
+    ports:
+      - "${CHATQNA_REDIS_DATAPREP_PORT}:6007"
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      REDIS_URL: ${CHATQNA_REDIS_URL}
+      INDEX_NAME: ${CHATQNA_INDEX_NAME}
+      TEI_ENDPOINT: ${CHATQNA_TEI_EMBEDDING_ENDPOINT}
+      HUGGINGFACEHUB_API_TOKEN: ${CHATQNA_HUGGINGFACEHUB_API_TOKEN}
+  chatqna-tei-embedding-service:
+    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
+    container_name: chatqna-tei-embedding-server
+    ports:
+      - "${CHATQNA_TEI_EMBEDDING_PORT}:80"
+    volumes:
+      - "/var/opea/chatqna-service/data:/data"
+    shm_size: 1g
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+    command: --model-id ${CHATQNA_EMBEDDING_MODEL_ID} --auto-truncate
+    devices:
+      - /dev/kfd:/dev/kfd
+      - /dev/dri/card1:/dev/dri/card1
+      - /dev/dri/renderD136:/dev/dri/renderD136
+    cap_add:
+      - SYS_PTRACE
+    group_add:
+      - video
+    security_opt:
+      - seccomp:unconfined
+  chatqna-retriever:
+    image: ${REGISTRY:-opea}/retriever-redis:${TAG:-latest}
+    container_name: chatqna-retriever-redis-server
+    depends_on:
+      - chatqna-redis-vector-db
+    ports:
+      - "${CHATQNA_REDIS_RETRIEVER_PORT}:7000"
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      REDIS_URL: ${CHATQNA_REDIS_URL}
+      INDEX_NAME: ${CHATQNA_INDEX_NAME}
+      TEI_EMBEDDING_ENDPOINT: ${CHATQNA_TEI_EMBEDDING_ENDPOINT}
+    restart: unless-stopped
+  chatqna-tei-reranking-service:
+    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
+    container_name: chatqna-tei-reranking-server
+    ports:
+      - "${CHATQNA_TEI_RERANKING_PORT}:80"
+    volumes:
+      - "/var/opea/chatqna-service/data:/data"
+    shm_size: 1g
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      HUGGINGFACEHUB_API_TOKEN: ${CHATQNA_HUGGINGFACEHUB_API_TOKEN}
+      HF_HUB_DISABLE_PROGRESS_BARS: 1
+      HF_HUB_ENABLE_HF_TRANSFER: 0
+    devices:
+      - /dev/kfd:/dev/kfd
+      - /dev/dri/:/dev/dri/
+    cap_add:
+      - SYS_PTRACE
+    group_add:
+      - video
+    security_opt:
+      - seccomp:unconfined
+    command: --model-id ${CHATQNA_RERANK_MODEL_ID} --auto-truncate
+  chatqna-tgi-service:
+    image: ${CHATQNA_TGI_SERVICE_IMAGE}
+    container_name: chatqna-tgi-server
+    ports:
+      - "${CHATQNA_TGI_SERVICE_PORT}:80"
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      HUGGINGFACEHUB_API_TOKEN: ${CHATQNA_HUGGINGFACEHUB_API_TOKEN}
+      HF_HUB_DISABLE_PROGRESS_BARS: 1
+      HF_HUB_ENABLE_HF_TRANSFER: 0
+    volumes:
+      - "/var/opea/chatqna-service/data:/data"
+    shm_size: 1g
+    devices:
+      - /dev/kfd:/dev/kfd
+      - /dev/dri/:/dev/dri/
+    cap_add:
+      - SYS_PTRACE
+    group_add:
+      - video
+    security_opt:
+      - seccomp:unconfined
+    command: --model-id ${CHATQNA_LLM_MODEL_ID}
+    ipc: host
+  chatqna-backend-server:
+    image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
+    container_name: chatqna-backend-server
+    depends_on:
+      - chatqna-redis-vector-db
+      - chatqna-tei-embedding-service
+      - chatqna-retriever
+      - chatqna-tei-reranking-service
+      - chatqna-tgi-service
+    ports:
+      - "${CHATQNA_BACKEND_SERVICE_PORT}:8888"
+    environment:
+      - no_proxy=${no_proxy}
+      - https_proxy=${https_proxy}
+      - http_proxy=${http_proxy}
+      - MEGA_SERVICE_HOST_IP=${CHATQNA_MEGA_SERVICE_HOST_IP}
+      - EMBEDDING_SERVER_HOST_IP=${HOST_IP}
+      - EMBEDDING_SERVER_PORT=${CHATQNA_TEI_EMBEDDING_PORT:-80}
+      - RETRIEVER_SERVICE_HOST_IP=${HOST_IP}
+      - RERANK_SERVER_HOST_IP=${HOST_IP}
+      - RERANK_SERVER_PORT=${CHATQNA_TEI_RERANKING_PORT:-80}
+      - LLM_SERVER_HOST_IP=${HOST_IP}
+      - LLM_SERVER_PORT=${CHATQNA_TGI_SERVICE_PORT:-80}
+      - LLM_MODEL=${CHATQNA_LLM_MODEL_ID}
+    ipc: host
+    restart: always
+  chatqna-ui-server:
+    image: ${REGISTRY:-opea}/chatqna-ui:${TAG:-latest}
+    container_name: chatqna-ui-server
+    depends_on:
+      - chatqna-backend-server
+    ports:
+      - "${CHATQNA_FRONTEND_SERVICE_PORT}:5173"
+    environment:
+      - no_proxy=${no_proxy}
+      - https_proxy=${https_proxy}
+      - http_proxy=${http_proxy}
+      - CHAT_BASE_URL=${CHATQNA_BACKEND_SERVICE_ENDPOINT}
+      - UPLOAD_FILE_BASE_URL=${CHATQNA_DATAPREP_SERVICE_ENDPOINT}
+      - GET_FILE=${CHATQNA_DATAPREP_GET_FILE_ENDPOINT}
+      - DELETE_FILE=${CHATQNA_DATAPREP_DELETE_FILE_ENDPOINT}
+    ipc: host
+    restart: always
+  chatqna-nginx-server:
+    image: ${REGISTRY:-opea}/nginx:${TAG:-latest}
+    container_name: chaqna-nginx-server
+    depends_on:
+      - chatqna-backend-server
+      - chatqna-ui-server
+    ports:
+      - "${CHATQNA_NGINX_PORT}:80"
+    environment:
+      - no_proxy=${no_proxy}
+      - https_proxy=${https_proxy}
+      - http_proxy=${http_proxy}
+      - FRONTEND_SERVICE_IP=${CHATQNA_FRONTEND_SERVICE_IP}
+      - FRONTEND_SERVICE_PORT=${CHATQNA_FRONTEND_SERVICE_PORT}
+      - BACKEND_SERVICE_NAME=${CHATQNA_BACKEND_SERVICE_NAME}
+      - BACKEND_SERVICE_IP=${CHATQNA_BACKEND_SERVICE_IP}
+      - BACKEND_SERVICE_PORT=${CHATQNA_BACKEND_SERVICE_PORT}
+    ipc: host
+    restart: always
+
+networks:
+  default:
+    driver: bridge
--- a/ChatQnA/docker_compose/amd/gpu/rocm/set_env.sh
+++ b/ChatQnA/docker_compose/amd/gpu/rocm/set_env.sh
@@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+
+# Copyright (C) 2024 Advanced Micro Devices, Inc.
+# SPDX-License-Identifier: Apache-2.0
+
+export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
+export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
+export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-base"
+export CHATQNA_LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
+export CHATQNA_TGI_SERVICE_PORT=18008
+export CHATQNA_TEI_EMBEDDING_PORT=18090
+export CHATQNA_TEI_EMBEDDING_ENDPOINT="http://${HOST_IP}:${CHATQNA_TEI_EMBEDDING_PORT}"
+export CHATQNA_TEI_RERANKING_PORT=18808
+export CHATQNA_REDIS_VECTOR_PORT=16379
+export CHATQNA_REDIS_VECTOR_INSIGHT_PORT=8001
+export CHATQNA_REDIS_DATAPREP_PORT=6007
+export CHATQNA_REDIS_RETRIEVER_PORT=7000
+export CHATQNA_INDEX_NAME="rag-redis"
+export CHATQNA_MEGA_SERVICE_HOST_IP=${HOST_IP}
+export CHATQNA_RETRIEVER_SERVICE_HOST_IP=${HOST_IP}
+export CHATQNA_BACKEND_SERVICE_ENDPOINT="http://127.0.0.1:${CHATQNA_BACKEND_SERVICE_PORT}/v1/chatqna"
+export CHATQNA_DATAPREP_SERVICE_ENDPOINT="http://127.0.0.1:${CHATQNA_REDIS_DATAPREP_PORT}/v1/dataprep"
+export CHATQNA_DATAPREP_GET_FILE_ENDPOINT="http://127.0.0.1:${CHATQNA_REDIS_DATAPREP_PORT}/v1/dataprep/get_file"
+export CHATQNA_DATAPREP_DELETE_FILE_ENDPOINT="http://127.0.0.1:${CHATQNA_REDIS_DATAPREP_PORT}/v1/dataprep/delete_file"
+export CHATQNA_FRONTEND_SERVICE_IP=${HOST_IP}
+export CHATQNA_FRONTEND_SERVICE_PORT=15173
+export CHATQNA_BACKEND_SERVICE_NAME=chatqna
+export CHATQNA_BACKEND_SERVICE_IP=${HOST_IP}
+export CHATQNA_BACKEND_SERVICE_PORT=18888
+export CHATQNA_REDIS_URL="redis://${HOST_IP}:${CHATQNA_REDIS_VECTOR_PORT}"
+export CHATQNA_EMBEDDING_SERVICE_HOST_IP=${HOST_IP}
+export CHATQNA_RERANK_SERVICE_HOST_IP=${HOST_IP}
+export CHATQNA_LLM_SERVICE_HOST_IP=${HOST_IP}
+export CHATQNA_NGINX_PORT=15176
--- a/ChatQnA/docker_compose/intel/cpu/aipc/set_env.sh
+++ b/ChatQnA/docker_compose/intel/cpu/aipc/set_env.sh
@@ -3,6 +3,9 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0

+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null

 if [ -z "${your_hf_api_token}" ]; then
    echo "Error: HUGGINGFACEHUB_API_TOKEN is not set. Please set your_hf_api_token."
--- a/ChatQnA/docker_compose/intel/cpu/xeon/README.md
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/README.md
@@ -26,7 +26,6 @@ To set up environment variables for deploying ChatQnA services, follow these ste
   export http_proxy="Your_HTTP_Proxy"
   export https_proxy="Your_HTTPs_Proxy"
   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
-   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
   export no_proxy="Your_No_Proxy",chatqna-xeon-ui-server,chatqna-xeon-backend-server,dataprep-redis-service,tei-embedding-service,retriever,tei-reranking-service,tgi-service,vllm_service
   ```

@@ -324,17 +323,17 @@ For details on how to verify the correctness of the response, refer to [how-to-v

   ```bash
   # TGI service
-   curl http://${host_ip}:9009/generate \
+   curl http://${host_ip}:9009/v1/chat/completions \
     -X POST \
-     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
+     -d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
     -H 'Content-Type: application/json'
   ```

   ```bash
   # vLLM Service
-   curl http://${host_ip}:9009/v1/completions \
+   curl http://${host_ip}:9009/v1/chat/completions \
     -H "Content-Type: application/json" \
-     -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32, "temperature": 0}'
+     -d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}]}'
   ```

 5. MegaService
@@ -433,6 +432,66 @@ curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \
     -H "Content-Type: application/json"
 ```

+### Profile Microservices
+
+To further analyze MicroService Performance, users could follow the instructions to profile MicroServices.
+
+#### 1. vLLM backend Service
+
+Users could follow previous section to testing vLLM microservice or ChatQnA MegaService.  
+ By default, vLLM profiling is not enabled. Users could start and stop profiling by following commands.
+
+##### Start vLLM profiling
+
+```bash
+curl http://${host_ip}:9009/start_profile \
+  -H "Content-Type: application/json" \
+  -d '{"model": "Intel/neural-chat-7b-v3-3"}'
+```
+
+Users would see below docker logs from vllm-service if profiling is started correctly.
+
+```bash
+INFO api_server.py:361] Starting profiler...
+INFO api_server.py:363] Profiler started.
+INFO:     x.x.x.x:35940 - "POST /start_profile HTTP/1.1" 200 OK
+```
+
+After vLLM profiling is started, users could start asking questions and get responses from vLLM MicroService  
+ or ChatQnA MicroService.
+
+##### Stop vLLM profiling
+
+By following command, users could stop vLLM profliing and generate a \*.pt.trace.json.gz file as profiling result  
+ under /mnt folder in vllm-service docker instance.
+
+```bash
+# vLLM Service
+curl http://${host_ip}:9009/stop_profile \
+  -H "Content-Type: application/json" \
+  -d '{"model": "Intel/neural-chat-7b-v3-3"}'
+```
+
+Users would see below docker logs from vllm-service if profiling is stopped correctly.
+
+```bash
+INFO api_server.py:368] Stopping profiler...
+INFO api_server.py:370] Profiler stopped.
+INFO:     x.x.x.x:41614 - "POST /stop_profile HTTP/1.1" 200 OK
+```
+
+After vllm profiling is stopped, users could use below command to get the \*.pt.trace.json.gz file under /mnt folder.
+
+```bash
+docker cp  vllm-service:/mnt/ .
+```
+
+##### Check profiling result
+
+Open a web browser and type "chrome://tracing" or "ui.perfetto.dev", and then load the json.gz file, you should be able  
+ to see the vLLM profiling result as below diagram.
+![image](https://github.com/user-attachments/assets/55c7097e-5574-41dc-97a7-5e87c31bc286)
+
 ## 🚀 Launch the UI

 ### Launch with origin port
--- a/ChatQnA/docker_compose/intel/cpu/xeon/README_pinecone.md
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/README_pinecone.md
@@ -0,0 +1,382 @@
+# Build Mega Service of ChatQnA (with Pinecone) on Xeon
+
+This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`. We will publish the Docker images to Docker Hub soon, it will simplify the deployment process for this service.
+
+## 🚀 Apply Xeon Server on AWS
+
+To apply a Xeon server on AWS, start by creating an AWS account if you don't have one already. Then, head to the [EC2 Console](https://console.aws.amazon.com/ec2/v2/home) to begin the process. Within the EC2 service, select the Amazon EC2 M7i or M7i-flex instance type to leverage the power of 4th Generation Intel Xeon Scalable processors. These instances are optimized for high-performance computing and demanding workloads.
+
+For detailed information about these instance types, you can refer to this [link](https://aws.amazon.com/ec2/instance-types/m7i/). Once you've chosen the appropriate instance type, proceed with configuring your instance settings, including network configurations, security groups, and storage options.
+
+After launching your instance, you can connect to it using SSH (for Linux instances) or Remote Desktop Protocol (RDP) (for Windows instances). From there, you'll have full access to your Xeon server, allowing you to install, configure, and manage your applications as needed.
+
+**Certain ports in the EC2 instance need to opened up in the security group, for the microservices to work with the curl commands**
+
+> See one example below. Please open up these ports in the EC2 instance based on the IP addresses you want to allow
+
+```
+
+data_prep_service
+=====================
+Port 6007 - Open to 0.0.0.0/0
+Port 6008 - Open to 0.0.0.0/0
+
+tei_embedding_service
+=====================
+Port 6006 - Open to 0.0.0.0/0
+
+embedding
+=========
+Port 6000 - Open to 0.0.0.0/0
+
+retriever
+=========
+Port 7000 - Open to 0.0.0.0/0
+
+tei_xeon_service
+================
+Port 8808 - Open to 0.0.0.0/0
+
+reranking
+=========
+Port 8000 - Open to 0.0.0.0/0
+
+tgi-service
+===========
+Port 9009 - Open to 0.0.0.0/0
+
+llm
+===
+Port 9000 - Open to 0.0.0.0/0
+
+chaqna-xeon-backend-server
+==========================
+Port 8888 - Open to 0.0.0.0/0
+
+chaqna-xeon-ui-server
+=====================
+Port 5173 - Open to 0.0.0.0/0
+```
+
+## 🚀 Build Docker Images
+
+First of all, you need to build Docker Images locally and install the python package of it.
+
+```bash
+git clone https://github.com/opea-project/GenAIComps.git
+cd GenAIComps
+```
+
+### 1. Build Embedding Image
+
+```bash
+docker build --no-cache -t opea/embedding-tei:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/embeddings/tei/langchain/Dockerfile .
+```
+
+### 2. Build Retriever Image
+
+```bash
+docker build --no-cache -t opea/retriever-pinecone:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/pinecone/langchain/Dockerfile .
+```
+
+### 3. Build Rerank Image
+
+```bash
+docker build --no-cache -t opea/reranking-tei:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/reranks/tei/Dockerfile .
+```
+
+### 4. Build LLM Image
+
+```bash
+docker build --no-cache -t opea/llm-tgi:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/text-generation/tgi/Dockerfile .
+```
+
+### 5. Build Dataprep Image
+
+```bash
+docker build --no-cache -t opea/dataprep-pinecone:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/pinecone/langchain/Dockerfile .
+cd ..
+```
+
+### 6. Build MegaService Docker Image
+
+To construct the Mega Service, we utilize the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline within the `chatqna.py` Python script. Build MegaService Docker image via below command:
+
+```bash
+git clone https://github.com/opea-project/GenAIExamples.git
+cd GenAIExamples/ChatQnA/docker
+docker build --no-cache -t opea/chatqna:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .
+cd ../../..
+```
+
+### 7. Build UI Docker Image
+
+Build frontend Docker image via below command:
+
+```bash
+cd GenAIExamples/ChatQnA/docker/ui/
+docker build --no-cache -t opea/chatqna-ui:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f ./docker/Dockerfile .
+cd ../../../..
+```
+
+### 8. Build Conversational React UI Docker Image (Optional)
+
+Build frontend Docker image that enables Conversational experience with ChatQnA megaservice via below command:
+
+**Export the value of the public IP address of your Xeon server to the `host_ip` environment variable**
+
+```bash
+cd GenAIExamples/ChatQnA/docker/ui/
+export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/chatqna"
+export DATAPREP_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/dataprep"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6008/v1/dataprep/get_file"
+docker build --no-cache -t opea/chatqna-conversation-ui:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy --build-arg BACKEND_SERVICE_ENDPOINT=$BACKEND_SERVICE_ENDPOINT --build-arg DATAPREP_SERVICE_ENDPOINT=$DATAPREP_SERVICE_ENDPOINT --build-arg DATAPREP_GET_FILE_ENDPOINT=$DATAPREP_GET_FILE_ENDPOINT -f ./docker/Dockerfile.react .
+cd ../../../..
+```
+
+Then run the command `docker images`, you will have the following 7 Docker Images:
+
+1. `opea/dataprep-pinecone:latest`
+2. `opea/embedding-tei:latest`
+3. `opea/retriever-pinecone:latest`
+4. `opea/reranking-tei:latest`
+5. `opea/llm-tgi:latest`
+6. `opea/chatqna:latest`
+7. `opea/chatqna-ui:latest`
+
+## 🚀 Start Microservices
+
+### Setup Environment Variables
+
+Since the `compose_pinecone.yaml` will consume some environment variables, you need to setup them in advance as below.
+
+**Export the value of the public IP address of your Xeon server to the `host_ip` environment variable**
+
+> Change the External_Public_IP below with the actual IPV4 value
+
+```
+export host_ip="External_Public_IP"
+```
+
+**Export the value of your Huggingface API token to the `your_hf_api_token` environment variable**
+
+> Change the Your_Huggingface_API_Token below with tyour actual Huggingface API Token value
+
+```
+export your_hf_api_token="Your_Huggingface_API_Token"
+```
+
+**Append the value of the public IP address to the no_proxy list**
+
+```
+export your_no_proxy=${your_no_proxy},"External_Public_IP"
+```
+
+\*\*Get the PINECONE_API_KEY and the INDEX_NAME
+
+```
+export pinecone_api_key=${api_key}
+export pinecone_index_name=${pinecone_index}
+```
+
+```bash
+export no_proxy=${your_no_proxy}
+export http_proxy=${your_http_proxy}
+export https_proxy=${your_http_proxy}
+export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
+export RERANK_MODEL_ID="BAAI/bge-reranker-base"
+export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
+export TEI_EMBEDDING_ENDPOINT="http://${host_ip}:6006"
+export TEI_RERANKING_ENDPOINT="http://${host_ip}:8808"
+export TGI_LLM_ENDPOINT="http://${host_ip}:9009"
+export PINECONE_API_KEY=${pinecone_api_key}
+export PINECONE_INDEX_NAME=${pinecone_index_name}
+export INDEX_NAME=${pinecone_index_name}
+export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
+export MEGA_SERVICE_HOST_IP=${host_ip}
+export EMBEDDING_SERVICE_HOST_IP=${host_ip}
+export RETRIEVER_SERVICE_HOST_IP=${host_ip}
+export RERANK_SERVICE_HOST_IP=${host_ip}
+export LLM_SERVICE_HOST_IP=${host_ip}
+export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/chatqna"
+export DATAPREP_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/dataprep"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6008/v1/dataprep/get_file"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6009/v1/dataprep/delete_file"
+```
+
+Note: Please replace with `host_ip` with you external IP address, do not use localhost.
+
+### Start all the services Docker Containers
+
+> Before running the docker compose command, you need to be in the folder that has the docker compose yaml file
+
+```bash
+cd GenAIExamples/ChatQnA/docker/xeon/
+docker compose -f compose_pinecone.yaml up -d
+```
+
+### Validate Microservices
+
+1. TEI Embedding Service
+
+```bash
+curl ${host_ip}:6006/embed \
+    -X POST \
+    -d '{"inputs":"What is Deep Learning?"}' \
+    -H 'Content-Type: application/json'
+```
+
+2. Embedding Microservice
+
+```bash
+curl http://${host_ip}:6000/v1/embeddings\
+  -X POST \
+  -d '{"text":"hello"}' \
+  -H 'Content-Type: application/json'
+```
+
+3. Retriever Microservice  
+   To validate the retriever microservice, you need to generate a mock embedding vector of length 768 in Python script:
+
+```Python
+import random
+embedding = [random.uniform(-1, 1) for _ in range(768)]
+print(embedding)
+```
+
+Then substitute your mock embedding vector for the `${your_embedding}` in the following cURL command:
+
+```bash
+curl http://${host_ip}:7000/v1/retrieval \
+  -X POST \
+  -d '{"text":"What is the revenue of Nike in 2023?","embedding":"'"${your_embedding}"'"}' \
+  -H 'Content-Type: application/json'
+```
+
+4. TEI Reranking Service
+
+```bash
+curl http://${host_ip}:8808/rerank \
+    -X POST \
+    -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
+    -H 'Content-Type: application/json'
+```
+
+5. Reranking Microservice
+
+```bash
+curl http://${host_ip}:8000/v1/reranking\
+  -X POST \
+  -d '{"initial_query":"What is Deep Learning?", "retrieved_docs": [{"text":"Deep Learning is not..."}, {"text":"Deep learning is..."}]}' \
+  -H 'Content-Type: application/json'
+```
+
+6. TGI Service
+
+```bash
+curl http://${host_ip}:9009/generate \
+  -X POST \
+  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
+  -H 'Content-Type: application/json'
+```
+
+7. LLM Microservice
+
+```bash
+curl http://${host_ip}:9000/v1/chat/completions\
+  -X POST \
+  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
+  -H 'Content-Type: application/json'
+```
+
+8. MegaService
+
+```bash
+curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
+     "messages": "What is the revenue of Nike in 2023?"
+     }'
+```
+
+9. Dataprep Microservice（Optional）
+
+If you want to update the default knowledge base, you can use the following commands:
+
+Update Knowledge Base via Local File Upload:
+
+```bash
+curl -X POST "http://${host_ip}:6007/v1/dataprep" \
+     -H "Content-Type: multipart/form-data" \
+     -F "files=@./nke-10k-2023.pdf"
+```
+
+This command updates a knowledge base by uploading a local file for processing. Update the file path according to your environment.
+
+Add Knowledge Base via HTTP Links:
+
+```bash
+curl -X POST "http://${host_ip}:6007/v1/dataprep" \
+     -H "Content-Type: multipart/form-data" \
+     -F 'link_list=["https://opea.dev"]'
+```
+
+This command updates a knowledge base by submitting a list of HTTP links for processing.
+
+Also, you are able to get the file list that you uploaded:
+
+```bash
+curl -X POST "http://${host_ip}:6008/v1/dataprep/get_file" \
+     -H "Content-Type: application/json"
+```
+
+## Enable LangSmith for Monotoring Application (Optional)
+
+LangSmith offers tools to debug, evaluate, and monitor language models and intelligent agents. It can be used to assess benchmark data for each microservice. Before launching your services with `docker compose -f compose_pinecone.yaml up -d`, you need to enable LangSmith tracing by setting the `LANGCHAIN_TRACING_V2` environment variable to true and configuring your LangChain API key.
+
+Here's how you can do it:
+
+1. Install the latest version of LangSmith:
+
+```bash
+pip install -U langsmith
+```
+
+2. Set the necessary environment variables:
+
+```bash
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=ls_...
+```
+
+## 🚀 Launch the UI
+
+To access the frontend, open the following URL in your browser: http://{host_ip}:5173. By default, the UI runs on port 5173 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
+
+```yaml
+  chaqna-gaudi-ui-server:
+    image: opea/chatqna-ui:latest
+    ...
+    ports:
+      - "80:5173"
+```
+
+## 🚀 Launch the Conversational UI (react)
+
+To access the Conversational UI frontend, open the following URL in your browser: http://{host_ip}:5174. By default, the UI runs on port 80 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
+
+```yaml
+  chaqna-xeon-conversation-ui-server:
+    image: opea/chatqna-conversation-ui:latest
+    ...
+    ports:
+      - "80:80"
+```
+
+![project-screenshot](../../../../assets/img/chat_ui_init.png)
+
+Here is an example of running ChatQnA:
+
+![project-screenshot](../../../../assets/img/chat_ui_response.png)
+
+Here is an example of running ChatQnA with Conversational UI (React):
+
+![project-screenshot](../../../../assets/img/conversation_ui_response.png)
--- a/ChatQnA/docker_compose/intel/cpu/xeon/README_qdrant.md
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/README_qdrant.md
@@ -252,9 +252,9 @@ For details on how to verify the correctness of the response, refer to [how-to-v
   Then try the `cURL` command below to validate TGI.

   ```bash
-   curl http://${host_ip}:6042/generate \
+   curl http://${host_ip}:6042/v1/chat/completions \
     -X POST \
-     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
+     -d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
     -H 'Content-Type: application/json'
   ```

--- a/ChatQnA/docker_compose/intel/cpu/xeon/compose_pinecone.yaml
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/compose_pinecone.yaml
@@ -0,0 +1,151 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3.8"
+
+services:
+  dataprep-pinecone-service:
+    image: ${REGISTRY:-opea}/dataprep-pinecone:${TAG:-latest}
+    container_name: dataprep-pinecone-server
+    depends_on:
+      - tei-embedding-service
+    ports:
+      - "6007:6007"
+      - "6008:6008"
+      - "6009:6009"
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      PINECONE_API_KEY: ${PINECONE_API_KEY}
+      PINECONE_INDEX_NAME: ${PINECONE_INDEX_NAME}
+      TEI_EMBEDDING_ENDPOINT: http://tei-embedding-service:80
+      LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
+      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
+  tei-embedding-service:
+    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
+    container_name: tei-embedding-server
+    ports:
+      - "6006:80"
+    volumes:
+      - "./data:/data"
+    shm_size: 1g
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+    command: --model-id ${EMBEDDING_MODEL_ID} --auto-truncate
+  retriever:
+    image: ${REGISTRY:-opea}/retriever-pinecone:${TAG:-latest}
+    container_name: retriever-pinecone-server
+    ports:
+      - "7000:7000"
+    ipc: host
+    environment:
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      PINECONE_API_KEY: ${PINECONE_API_KEY}
+      INDEX_NAME: ${PINECONE_INDEX_NAME}
+      PINECONE_INDEX_NAME: ${PINECONE_INDEX_NAME}
+      LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
+      TEI_EMBEDDING_ENDPOINT: http://tei-embedding-service:80
+      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
+    restart: unless-stopped
+  tei-reranking-service:
+    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
+    container_name: tei-reranking-server
+    ports:
+      - "8808:80"
+    volumes:
+      - "./data:/data"
+    shm_size: 1g
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
+      HF_HUB_DISABLE_PROGRESS_BARS: 1
+      HF_HUB_ENABLE_HF_TRANSFER: 0
+    command: --model-id ${RERANK_MODEL_ID} --auto-truncate
+  tgi-service:
+    image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
+    container_name: tgi-service
+    ports:
+      - "9009:80"
+    volumes:
+      - "./data:/data"
+    shm_size: 1g
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
+      HF_HUB_DISABLE_PROGRESS_BARS: 1
+      HF_HUB_ENABLE_HF_TRANSFER: 0
+    command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
+  chatqna-xeon-backend-server:
+    image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
+    container_name: chatqna-xeon-backend-server
+    depends_on:
+      - tei-embedding-service
+      - dataprep-pinecone-service
+      - retriever
+      - tei-reranking-service
+      - tgi-service
+    ports:
+      - "8888:8888"
+    environment:
+      - no_proxy=${no_proxy}
+      - https_proxy=${https_proxy}
+      - http_proxy=${http_proxy}
+      - MEGA_SERVICE_HOST_IP=chatqna-xeon-backend-server
+      - EMBEDDING_SERVER_HOST_IP=tei-embedding-service
+      - EMBEDDING_SERVER_PORT=${EMBEDDING_SERVER_PORT:-80}
+      - RETRIEVER_SERVICE_HOST_IP=retriever
+      - RERANK_SERVER_HOST_IP=tei-reranking-service
+      - RERANK_SERVER_PORT=${RERANK_SERVER_PORT:-80}
+      - LLM_SERVER_HOST_IP=tgi-service
+      - LLM_SERVER_PORT=${LLM_SERVER_PORT:-80}
+      - LOGFLAG=${LOGFLAG}
+      - LLM_MODEL=${LLM_MODEL_ID}
+    ipc: host
+    restart: always
+  chatqna-xeon-ui-server:
+    image: ${REGISTRY:-opea}/chatqna-ui:${TAG:-latest}
+    container_name: chatqna-xeon-ui-server
+    depends_on:
+      - chatqna-xeon-backend-server
+    ports:
+      - "5173:5173"
+    environment:
+      - no_proxy=${no_proxy}
+      - https_proxy=${https_proxy}
+      - http_proxy=${http_proxy}
+    ipc: host
+    restart: always
+  chatqna-xeon-nginx-server:
+    image: ${REGISTRY:-opea}/nginx:${TAG:-latest}
+    container_name: chatqna-xeon-nginx-server
+    depends_on:
+      - chatqna-xeon-backend-server
+      - chatqna-xeon-ui-server
+    ports:
+      - "${NGINX_PORT:-80}:80"
+    environment:
+      - no_proxy=${no_proxy}
+      - https_proxy=${https_proxy}
+      - http_proxy=${http_proxy}
+      - FRONTEND_SERVICE_IP=chatqna-xeon-ui-server
+      - FRONTEND_SERVICE_PORT=5173
+      - BACKEND_SERVICE_NAME=chatqna
+      - BACKEND_SERVICE_IP=chatqna-xeon-backend-server
+      - BACKEND_SERVICE_PORT=8888
+      - DATAPREP_SERVICE_IP=dataprep-pinecone-service
+      - DATAPREP_SERVICE_PORT=6007
+    ipc: host
+    restart: always
+
+networks:
+  default:
+    driver: bridge
--- a/ChatQnA/docker_compose/intel/cpu/xeon/compose_vllm.yaml
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/compose_vllm.yaml
@@ -86,6 +86,7 @@ services:
      https_proxy: ${https_proxy}
      HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
      LLM_MODEL_ID: ${LLM_MODEL_ID}
+      VLLM_TORCH_PROFILER_DIR: "/mnt"
    command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80
  chatqna-xeon-backend-server:
    image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
--- a/ChatQnA/docker_compose/intel/cpu/xeon/set_env.sh
+++ b/ChatQnA/docker_compose/intel/cpu/xeon/set_env.sh
@@ -3,6 +3,9 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0

+pushd "../../../../../" > /dev/null
+source .set_env.sh
+popd > /dev/null

 export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
 export RERANK_MODEL_ID="BAAI/bge-reranker-base"
--- a/ChatQnA/docker_compose/intel/hpu/gaudi/README.md
+++ b/ChatQnA/docker_compose/intel/hpu/gaudi/README.md
@@ -192,7 +192,7 @@ For users in China who are unable to download models directly from Huggingface,
   export HF_TOKEN=${your_hf_token}
   export HF_ENDPOINT="https://hf-mirror.com"
   model_name="Intel/neural-chat-7b-v3-3"
-   docker run -p 8008:80 -v ./data:/data --name tgi-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id $model_name --max-input-tokens 1024 --max-total-tokens 2048
+   docker run -p 8008:80 -v ./data:/data --name tgi-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id $model_name --max-input-tokens 1024 --max-total-tokens 2048
   ```

 2. Offline
@@ -206,7 +206,7 @@ For users in China who are unable to download models directly from Huggingface,
     ```bash
     export HF_TOKEN=${your_hf_token}
     export model_path="/path/to/model"
-     docker run -p 8008:80 -v $model_path:/data --name tgi_service --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id /data --max-input-tokens 1024 --max-total-tokens 2048
+     docker run -p 8008:80 -v $model_path:/data --name tgi_service --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id /data --max-input-tokens 1024 --max-total-tokens 2048
     ```

 ### Setup Environment Variables
@@ -326,23 +326,18 @@ For validation details, please refer to [how-to-validate_service](./how_to_valid
   Then try the `cURL` command below to validate services.

   ```bash
-   #TGI Service
-   curl http://${host_ip}:8005/generate \
+   # TGI service
+   curl http://${host_ip}:9009/v1/chat/completions \
     -X POST \
-     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":64, "do_sample": true}}' \
+     -d '{"model": ${LLM_MODEL_ID}, "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
     -H 'Content-Type: application/json'
   ```

   ```bash
-   #vLLM Service
-   curl http://${host_ip}:8007/v1/completions \
+   # vLLM Service
+   curl http://${host_ip}:9009/v1/chat/completions \
     -H "Content-Type: application/json" \
-     -d '{
-     "model": "${LLM_MODEL_ID}",
-     "prompt": "What is Deep Learning?",
-     "max_tokens": 32,
-     "temperature": 0
-     }'
+     -d '{"model": ${LLM_MODEL_ID}, "messages": [{"role": "user", "content": "What is Deep Learning?"}]}'
   ```

 5. MegaService
@@ -439,6 +434,68 @@ curl http://${host_ip}:9090/v1/guardrails\
  -H 'Content-Type: application/json'
 ```

+### Profile Microservices
+
+To further analyze MicroService Performance, users could follow the instructions to profile MicroServices.
+
+#### 1. vLLM backend Service
+
+Users could follow previous section to testing vLLM microservice or ChatQnA MegaService.  
+ By default, vLLM profiling is not enabled. Users could start and stop profiling by following commands.
+
+##### Start vLLM profiling
+
+```bash
+curl http://${host_ip}:9009/start_profile \
+  -H "Content-Type: application/json" \
+  -d '{"model": ${LLM_MODEL_ID}}'
+```
+
+Users would see below docker logs from vllm-service if profiling is started correctly.
+
+```bash
+INFO api_server.py:361] Starting profiler...
+INFO api_server.py:363] Profiler started.
+INFO:     x.x.x.x:35940 - "POST /start_profile HTTP/1.1" 200 OK
+```
+
+After vLLM profiling is started, users could start asking questions and get responses from vLLM MicroService  
+ or ChatQnA MicroService.
+
+##### Stop vLLM profiling
+
+By following command, users could stop vLLM profliing and generate a \*.pt.trace.json.gz file as profiling result  
+ under /mnt folder in vllm-service docker instance.
+
+```bash
+# vLLM Service
+curl http://${host_ip}:9009/stop_profile \
+  -H "Content-Type: application/json" \
+  -d '{"model": ${LLM_MODEL_ID}}'
+```
+
+Users would see below docker logs from vllm-service if profiling is stopped correctly.
+
+```bash
+INFO api_server.py:368] Stopping profiler...
+INFO api_server.py:370] Profiler stopped.
+INFO:     x.x.x.x:41614 - "POST /stop_profile HTTP/1.1" 200 OK
+```
+
+After vllm profiling is stopped, users could use below command to get the \*.pt.trace.json.gz file under /mnt folder.
+
+```bash
+docker cp  vllm-service:/mnt/ .
+```
+
+##### Check profiling result
+
+Open a web browser and type "chrome://tracing" or "ui.perfetto.dev", and then load the json.gz file, you should be able  
+ to see the vLLM profiling result as below diagram.
+![image](https://github.com/user-attachments/assets/487c52c8-d187-46dc-ab3a-43f21d657d41)
+
+![image](https://github.com/user-attachments/assets/e3c51ce5-d704-4eb7-805e-0d88b0c158e3)
+
 ## 🚀 Launch the UI

 ### Launch with origin port
--- a/ChatQnA/docker_compose/intel/hpu/gaudi/compose.yaml
+++ b/ChatQnA/docker_compose/intel/hpu/gaudi/compose.yaml
@@ -57,7 +57,7 @@ services:
      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
    restart: unless-stopped
  tei-reranking-service:
-    image: ghcr.io/huggingface/tei-gaudi:latest
+    image: ghcr.io/huggingface/tei-gaudi:1.5.0
    container_name: tei-reranking-gaudi-server
    ports:
      - "8808:80"
@@ -78,7 +78,7 @@ services:
      MAX_WARMUP_SEQUENCE_LENGTH: 512
    command: --model-id ${RERANK_MODEL_ID} --auto-truncate
  tgi-service:
-    image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    container_name: tgi-gaudi-server
    ports:
      - "8005:80"
--- a/ChatQnA/docker_compose/intel/hpu/gaudi/compose_guardrails.yaml
+++ b/ChatQnA/docker_compose/intel/hpu/gaudi/compose_guardrails.yaml
@@ -26,7 +26,7 @@ services:
      TEI_ENDPOINT: http://tei-embedding-service:80
      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
  tgi-guardrails-service:
-    image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    container_name: tgi-guardrails-server
    ports:
      - "8088:80"
@@ -96,7 +96,7 @@ services:
      HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
    restart: unless-stopped
  tei-reranking-service:
-    image: ghcr.io/huggingface/tei-gaudi:latest
+    image: ghcr.io/huggingface/tei-gaudi:1.5.0
    container_name: tei-reranking-gaudi-server
    ports:
      - "8808:80"
@@ -117,7 +117,7 @@ services:
      MAX_WARMUP_SEQUENCE_LENGTH: 512
    command: --model-id ${RERANK_MODEL_ID} --auto-truncate
  tgi-service:
-    image: ghcr.io/huggingface/tgi-gaudi:2.0.5
+    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    container_name: tgi-gaudi-server
    ports:
      - "8008:80"
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
NeuralChatBot	bbb4e231d0	Freeze OPEA images tag Signed-off-by: NeuralChatBot <grp_neural_chat_bot@intel.com>	2024-11-21 14:24:16 +00:00
bjzhjing	da10068964	Adjustments for helm release change (#1173 ) Signed-off-by: Cathy Zhang <cathy.zhang@intel.com> (cherry picked from commit `ef2047b070`)	2024-11-21 16:57:30 +08:00
Letong Han	188b568467	Fix Translation Manifest CI with MODEL_ID (#1169 ) Signed-off-by: letonghan <letong.han@intel.com> (cherry picked from commit `94231584aa`)	2024-11-21 16:57:29 +08:00
minmin-intel	9e9af9766f	Fix DocIndexRetriever CI error on Xeon (#1167 ) Signed-off-by: minmin-intel <minmin.hou@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit `c5177c5e2f`)	2024-11-21 16:57:28 +08:00
chen, suyue	cc108b5a18	Fix DBQnA image build (#1165 ) Signed-off-by: chensuyue <suyue.chen@intel.com>	2024-11-20 10:56:49 +08:00
chen, suyue	f70d9c3853	chatqna benchmark for v1.1 release (#1120 ) Signed-off-by: chensuyue <suyue.chen@intel.com> Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>	2024-11-19 22:57:25 +08:00
ZePan110	8808b51e42	Rename image name XXX-hpu to XXX-gaudi (#1154 ) Signed-off-by: ZePan110 <ze.pan@intel.com>	2024-11-19 22:18:41 +08:00
chen, suyue	17d4b0c97f	freeze nodejs version in CI test (#1162 ) Signed-off-by: chensuyue <suyue.chen@intel.com>	2024-11-19 13:22:56 +08:00
Sun, Xuehao	3a03d31f8f	Update manual-freeze-tag workflow (#1161 ) Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com>	2024-11-19 11:00:36 +08:00
dependabot[bot]	179fd84362	Bump gradio from 4.44.0 to 5.5.0 in /DocSum/ui/gradio (#1157 ) Signed-off-by: dependabot[bot] <support@github.com>	2024-11-18 23:50:56 +08:00
chen, suyue	9ba034b22d	fix the docker image name for release image build (#1152 ) Signed-off-by: chensuyue <suyue.chen@intel.com>	2024-11-18 23:48:01 +08:00
jotpalch	c3e6f43ece	Fix command in README for deploying ChatQnA application (#1156 )	2024-11-18 22:59:22 +08:00
Theresa	1ac756a1c7	Rename the GraphRAG UI image (#1155 ) Signed-off-by: ichbinblau <theresa.shan@intel.com>	2024-11-18 20:07:22 +08:00
sgurunat	56f770cb28	ChatQnA with Remote Inference Endpoints (Kubernetes) (#1149 ) Signed-off-by: sgurunat <gurunath.s@intel.com> Co-authored-by: chen, suyue <suyue.chen@intel.com>	2024-11-18 20:06:17 +08:00
XinyaoWa	0cdeb946e4	DocSum Manifest support multimedia (#1158 ) Signed-off-by: Xinyao Wang <xinyao.wang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-18 18:46:01 +08:00
Artem Astafev	5648839411	Add compose example for FaqGen AMD ROCm (#1126 ) Signed-off-by: artem-astafev <a.astafev@datamonsters.com>	2024-11-18 17:38:21 +08:00
Mustafa	eb91d1f054	Docsum (#1095 ) Signed-off-by: Mustafa <mustafa.cetin@intel.com> Signed-off-by: Harsha Ramayanam <harsha.ramayanam@intel.com> Co-authored-by: Harsha Ramayanam <harsha.ramayanam@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: XinyaoWa <xinyao.wang@intel.com> Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com> Co-authored-by: chen, suyue <suyue.chen@intel.com>	2024-11-18 17:15:42 +08:00
Wang, Kai Lawrence	2587179224	Add instructions of modifying reranking docker image for NVGPU (#1133 ) Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-18 15:37:32 +08:00
chyundunovDatamonsters	7e62175c2e	Adding files to deploy CodeTrans application on AMD GPU (#1138 ) Signed-off-by: Chingis Yundunov <YundunovCN@sibedge.com>	2024-11-18 14:58:38 +08:00
Louie Tsai	152adf8012	maintain a version info for docker_compose yaml files among release (#1141 ) Signed-off-by: Tsai, Louie <louie.tsai@intel.com>	2024-11-17 22:39:41 -08:00
chyundunovDatamonsters	83172e9a99	Adding files to deploy CodeGen application on AMD GPU (#1130 ) Signed-off-by: Chingis Yundunov <YundunovCN@sibedge.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-18 14:36:23 +08:00
Liang Lv	fb514bb8ba	Add chatqna wrapper for multiple model selection (#1144 ) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: Ying Hu <ying.hu@intel.com> Co-authored-by: chen, suyue <suyue.chen@intel.com>	2024-11-18 10:48:09 +08:00
Artem Astafev	b1bb6db52d	Add compose example for DocSum amd rocm deployment (#1125 ) Signed-off-by: Artem Astafev <a.astafev@datamonsters.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-18 09:09:12 +08:00
rui2zhang	7949045176	EdgeCraftRAG: Add E2E test cases for EdgeCraftRAG - local LLM and vllm (#1137 ) Signed-off-by: Zhang, Rui <rui2.zhang@intel.com> Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Mingyuan Qi <mingyuan.qi@intel.com>	2024-11-17 18:22:32 +08:00
Lianhao Lu	cbe952ec5e	Fail CI manifest test if response content is not expected (#1145 ) Signed-off-by: Lianhao Lu <lianhao.lu@intel.com> Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>	2024-11-17 12:46:31 +08:00
chen, suyue	3b1a9fe9e1	optimize hardware list for test (#1151 ) Signed-off-by: chensuyue <suyue.chen@intel.com>	2024-11-15 22:46:02 +08:00
chen, suyue	e66d7fe381	fix typo involved in ci workflow (#1150 ) Signed-off-by: chensuyue <suyue.chen@intel.com>	2024-11-15 21:19:29 +08:00
Artem Astafev	6d3a017609	Add compose example for ChatQnA AMD ROCm deployment (#1122 ) Signed-off-by: Artem Astafev <a.astafev@datamonsters.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-15 17:24:06 +08:00
Ying Hu	dbf4ba03fa	Update AgentQnA README.md for refactor doc structure (#1146 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-15 16:30:13 +08:00
XinyaoWa	4f96d9e605	vllm hpu fix version for bug fix (#1142 ) Signed-off-by: Xinyao Wang <xinyao.wang@intel.com>	2024-11-15 15:12:53 +08:00
Ying Hu	a8f4245384	Update README.md for usage experience (#1135 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2024-11-15 14:23:12 +08:00
Mingyuan Qi	096a37aacc	EdgeCraftRAG: Fix multiple issues (#1143 ) Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-15 14:01:27 +08:00
rbrugaro	6f8fa6a689	Grag ex1.1 (#1123 ) Signed-off-by: Rita Brugarolas <rita.brugarolas.brufau@intel.com> Signed-off-by: theresa <theresa.shan@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: theresa <theresa.shan@intel.com>	2024-11-15 13:17:06 +08:00
Letong Han	39f68d5d6b	Fix SearchQnA CI Issue (#1134 ) Signed-off-by: letonghan <letong.han@intel.com>	2024-11-15 10:01:27 +08:00
Louie Tsai	00d9bb6128	Enable vLLM Profiling for ChatQnA on Gaudi (#1128 ) Signed-off-by: Tsai, Louie <louie.tsai@intel.com>	2024-11-14 15:46:33 -08:00
Abolfazl Shahbazi	59b624c677	Fix minor documentation build issue (#1139 ) Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>	2024-11-14 15:29:50 -08:00
chen, suyue	2b2c7ee2f5	upgrade setuptools version to fix CVE-2024-6345 (#999 ) Signed-off-by: chensuyue <suyue.chen@intel.com>	2024-11-14 14:57:16 +08:00
Hoong Tee, Yeoh	6b9a27dd83	DBQnA: Include workflow in README (#956 ) Signed-off-by: Yeoh, Hoong Tee <hoong.tee.yeoh@intel.com>	2024-11-14 14:05:28 +08:00
Yi Yao	5720cd45c0	Add benchmark launcher for AudioQnA (#981 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-14 13:58:51 +08:00
XinyaoWa	73879d3cec	fix faq ui bug (#1118 ) Signed-off-by: Xinyao Wang <xinyao.wang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-14 10:00:30 +08:00
Lucas Melo	7c9ed04132	ChatQnA - Add Terraform and Ansible Modules information (#970 ) Signed-off-by: chensuyue <suyue.chen@intel.com> Signed-off-by: lucasmelogithub <lucas.melo@intel.com> Co-authored-by: chen, suyue <suyue.chen@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com>	2024-11-13 11:42:12 -08:00
lvliang-intel	9ff7df9202	Use fixed version of TEI Gaudi for stability (#1101 ) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: Malini Bhandaru <malini.bhandaru@intel.com>	2024-11-13 10:45:50 -08:00
Abolfazl Shahbazi	b5f95f735e	Fix missing end of file chars (#1106 ) Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-13 09:40:53 -08:00
chen, suyue	393367e9f1	Fix left issue of tgi version update (#1121 ) Signed-off-by: chensuyue <suyue.chen@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-13 15:42:42 +08:00
Louie Tsai	7adbba6add	Enable vLLM Profiling for ChatQnA (#1124 )	2024-11-13 11:26:31 +08:00
pallavijaini0525	0d52c2f003	Pinecone update to Readme and docker compose for ChatQnA (#540 ) Signed-off-by: pallavi jaini <pallavi.jaini@intel.com> Signed-off-by: AI Workloads <aigoldrush1@g2-r3-2.iind.intel.com> Signed-off-by: Pallavi Jaini <pallavi,jaini@intel.com> Signed-off-by: Pallavi Jaini <pallavi.jaini@intel.com> Signed-off-by: root <root@test-pjaini.535545281608.us-region-2.idcservice.net> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: AI Workloads <aigoldrush1@g2-r3-2.iind.intel.com> Co-authored-by: Pallavi Jaini <pallavi,jaini@intel.com> Co-authored-by: root <root@test-pjaini.535545281608.us-region-2.idcservice.net> Co-authored-by: chen, suyue <suyue.chen@intel.com>	2024-11-13 09:32:37 +08:00
lvliang-intel	1ff85f6a85	Upgrade TGI Gaudi version to v2.0.6 (#1088 ) Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: chen, suyue <suyue.chen@intel.com>	2024-11-12 14:38:22 +08:00
bjzhjing	f7a7f8aa3f	Fix typo (#1117 ) Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>	2024-11-12 09:54:05 +08:00
lvliang-intel	e3187be819	Update ChatQnA manifests using always pull image policy (#1100 ) Signed-off-by: lvliang-intel <liang1.lv@intel.com>	2024-11-11 14:37:14 +08:00
Sihan Chen	abd9d12937	Fix non stream case (#1115 ) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-11-11 14:18:42 +08:00
bjzhjing	a7353bbaa4	Refine performance directory (#1017 ) Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>	2024-11-11 13:58:46 +08:00
Letong Han	aa314f6757	[Readme] Update ChatQnA Readme for LLM Endpoint (#1086 ) Signed-off-by: letonghan <letong.han@intel.com>	2024-11-11 13:53:06 +08:00