Add final README.md and set_env.sh script for quickstart review. Previous pull request was 1595. (#1662)

Signed-off-by: Edwards, James A <jaedwards@habana.ai> Co-authored-by: Edwards, James A <jaedwards@habana.ai>
2025-03-14 18:05:01 -05:00
parent 7159ce3731
commit 527b146a80
2 changed files with 293 additions and 542 deletions
--- a/ChatQnA/docker_compose/intel/hpu/gaudi/README.md
+++ b/ChatQnA/docker_compose/intel/hpu/gaudi/README.md
@@ -1,94 +1,106 @@
-# Build MegaService of ChatQnA on Gaudi
+# Example ChatQnA deployments on an Intel® Gaudi® Platform

-This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Gaudi server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`.
+This example covers the single-node on-premises deployment of the ChatQnA example using OPEA components. There are various ways to enable ChatQnA, but this example will focus on four options available for deploying the ChatQnA pipeline to Intel® Gaudi® AI Accelerators. This example begins with a Quick Start section and then documents how to modify deployments, leverage new models and configure the number of allocated devices.

-The default pipeline deploys with vLLM as the LLM serving component and leverages rerank component. It also provides options of not using rerank in the pipeline, leveraging guardrails, or using TGI backend for LLM microservice, please refer to [start-all-the-services-docker-containers](#start-all-the-services-docker-containers) section in this page.
+This example includes the following sections:

-Quick Start:
+- [ChatQnA Quick Start Deployment](#chatqna-quick-start-deployment): Demonstrates how to quickly deploy a ChatQnA application/pipeline on a Intel® Gaudi® platform.
+- [ChatQnA Docker Compose Files](#chatqna-docker-compose-files): Describes some example deployments and their docker compose files.
+- [ChatQnA Service Configuration](#chatqna-service-configuration): Describes the services and possible configuration changes.

-1. Set up the environment variables.
-2. Run Docker Compose.
-3. Consume the ChatQnA Service.
+**Note** This example requires access to a properly installed Intel® Gaudi® platform with a functional Docker service configured to use the habanalabs-container-runtime. Please consult the [Intel® Gaudi® software Installation Guide](https://docs.habana.ai/en/v1.20.0/Installation_Guide/Driver_Installation.html) for more information.

-Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models). We now support running the latest DeepSeek models, including [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) and [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) on Gaudi accelerators. To run `deepseek-ai/DeepSeek-R1-Distill-Llama-70B`, update the `LLM_MODEL_ID` and configure `NUM_CARDS` to 8 in the [set_env.sh](./set_env.sh) script. To run `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`, update the `LLM_MODEL_ID` and configure `NUM_CARDS` to 4 in the [set_env.sh](./set_env.sh) script.
+## ChatQnA Quick Start Deployment

-## Quick Start: 1.Setup Environment Variable
+This section describes how to quickly deploy and test the ChatQnA service manually on an Intel® Gaudi® platform. The basic steps are:

-To set up environment variables for deploying ChatQnA services, follow these steps:
+1. [Access the Code](#access-the-code)
+2. [Generate a HuggingFace Access Token](#generate-a-huggingface-access-token)
+3. [Configure the Deployment Environment](#configure-the-deployment-environment)
+4. [Deploy the Services Using Docker Compose](#deploy-the-services-using-docker-compose)
+5. [Check the Deployment Status](#check-the-deployment-status)
+6. [Test the Pipeline](#test-the-pipeline)
+7. [Cleanup the Deployment](#cleanup-the-deployment)

-1. Set the required environment variables:
+### Access the Code

-   ```bash
-   # Example: host_ip="192.168.1.1"
-   export host_ip="External_Public_IP"
-   export HUGGINGFACEHUB_API_TOKEN="Your_Huggingface_API_Token"
-   ```
+Clone the GenAIExample repository and access the ChatQnA Intel® Gaudi® platform Docker Compose files and supporting scripts:

-2. If you are in a proxy environment, also set the proxy-related environment variables:
+```
+git clone https://github.com/opea-project/GenAIExamples.git
+cd GenAIExamples/ChatQnA/docker_compose/intel/hpu/gaudi/
+```

-   ```bash
-   export http_proxy="Your_HTTP_Proxy"
-   export https_proxy="Your_HTTPs_Proxy"
-   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
-   export no_proxy="Your_No_Proxy",chatqna-gaudi-ui-server,chatqna-gaudi-backend-server,dataprep-redis-service,tei-embedding-service,retriever,tei-reranking-service,tgi-service,vllm-service,guardrails
-   ```
+Checkout a released version, such as v1.2:

-3. Set up other environment variables:
+```
+git checkout v1.2
+```

-   ```bash
-   source ./set_env.sh
-   ```
+### Generate a HuggingFace Access Token

-4. Change Model for LLM serving
+Some HuggingFace resources, such as some models, are only accessible if you have an access token. If you do not already have a HuggingFace access token, you can create one by first creating an account by following the steps provided at [HuggingFace](https://huggingface.co/) and then generating a [user access token](https://huggingface.co/docs/transformers.js/en/guides/private#step-1-generating-a-user-access-token).

-   By default, Meta-Llama-3-8B-Instruct is used for LLM serving, the default model can be changed to other validated LLM models.  
-   Please pick a [validated llm models](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/src/text-generation#validated-llm-models) from the table.  
-   To change the default model defined in set_env.sh, overwrite it by exporting LLM_MODEL_ID to the new model or by modifying set_env.sh, and then repeat step 3.  
-   For example, change to DeepSeek-R1-Distill-Qwen-32B using the following command.
+### Configure the Deployment Environment

-   ```bash
-   export LLM_MODEL_ID="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
-   ```
+To set up environment variables for deploying ChatQnA services, source the _setup_env.sh_ script in this directory:

-   Please also check [required gaudi cards for different models](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/src/text-generation#system-requirements-for-llm-models) for new models.  
-   It might be necessary to increase the number of Gaudi cards for the model by exporting NUM_CARDS to the new model or by modifying set_env.sh, and then repeating step 3. For example, increase the number of Gaudi cards for DeepSeek-R1-
-   Distill-Qwen-32B using the following command:
+```
+source ./set_env.sh
+```

-   ```bash
-   export NUM_CARDS=4
-   ```
+The _set_env.sh_ script will prompt for required and optional environment variables used to configure the ChatQnA services. If a value is not entered, the script will use a default value for the same. It will also generate a _.env_ file defining the desired configuration. Consult the section on [ChatQnA Service configuration](#chatqna-service-configuration) for information on how service specific configuration parameters affect deployments.

-## Quick Start: 2.Run Docker Compose
+### Deploy the Services Using Docker Compose
+
+To deploy the ChatQnA services, execute the `docker compose up` command with the appropriate arguments. For a default deployment, execute:

 ```bash
 docker compose up -d
 ```

-To enable Open Telemetry Tracing, compose.telemetry.yaml file need to be merged along with default compose.yaml file.
+The ChatQnA docker images should automatically be downloaded from the `OPEA registry` and deployed on the Intel® Gaudi® Platform:

-> NOTE : To get supported Grafana Dashboard, please run download_opea_dashboard.sh following below commands.
-
-```bash
-./grafana/dashboards/download_opea_dashboard.sh
-docker compose -f compose.yaml -f compose.telemetry.yaml up -d
+```
+[+] Running 10/10
+ ✔ Network gaudi_default                   Created                                                                      0.1s
+ ✔ Container tei-reranking-gaudi-server    Started                                                                      0.7s
+ ✔ Container vllm-gaudi-server             Started                                                                      0.7s
+ ✔ Container tei-embedding-gaudi-server    Started                                                                      0.3s
+ ✔ Container redis-vector-db               Started                                                                      0.6s
+ ✔ Container retriever-redis-server        Started                                                                      1.1s
+ ✔ Container dataprep-redis-server         Started                                                                      1.1s
+ ✔ Container chatqna-gaudi-backend-server  Started                                                                      1.3s
+ ✔ Container chatqna-gaudi-ui-server       Started                                                                      1.7s
+ ✔ Container chatqna-gaudi-nginx-server    Started                                                                      1.9s
 ```

-It will automatically download the docker image on `docker hub`:
+### Check the Deployment Status

-```bash
-docker pull opea/chatqna:latest
-docker pull opea/chatqna-ui:latest
+After running docker compose, check if all the containers launched via docker compose have started:
+
+```
+docker ps -a
 ```

-In following cases, you could build docker image from source by yourself.
+For the default deployment, the following 10 containers should have started:

- Failed to download the docker image.
+```
+CONTAINER ID   IMAGE                                                                                           COMMAND                  CREATED         STATUS                      PORTS                                                                                  NAMES
+8365b0a6024d   opea/nginx:latest                                                                               "/docker-entrypoint.…"   2 minutes ago   Up 2 minutes                0.0.0.0:80->80/tcp, :::80->80/tcp                                                      chatqna-gaudi-nginx-server
+f090fe262c74   opea/chatqna-ui:latest                                                                          "docker-entrypoint.s…"   2 minutes ago   Up 2 minutes                0.0.0.0:5173->5173/tcp, :::5173->5173/tcp                                              chatqna-gaudi-ui-server
+ec97d7651c96   opea/chatqna:latest                                                                             "python chatqna.py"      2 minutes ago   Up 2 minutes                0.0.0.0:8888->8888/tcp, :::8888->8888/tcp                                              chatqna-gaudi-backend-server
+a61fb7dc4fae   opea/dataprep:latest                                                                            "sh -c 'python $( [ …"   2 minutes ago   Up 2 minutes                0.0.0.0:6007->5000/tcp, [::]:6007->5000/tcp                                            dataprep-redis-server
+d560c232b120   opea/retriever:latest                                                                           "python opea_retriev…"   2 minutes ago   Up 2 minutes                0.0.0.0:7000->7000/tcp, :::7000->7000/tcp                                              retriever-redis-server
+a1d7ca2d3787   ghcr.io/huggingface/tei-gaudi:1.5.0                                                             "text-embeddings-rou…"   2 minutes ago   Up 2 minutes                0.0.0.0:8808->80/tcp, [::]:8808->80/tcp                                                tei-reranking-gaudi-server
+9a9f3fd4fd4c   opea/vllm-gaudi:latest                                                                          "python3 -m vllm.ent…"   2 minutes ago   Exited (1) 2 minutes ago                                                                                           vllm-gaudi-server
+1ab9bbdf5182   redis/redis-stack:7.2.0-v9                                                                      "/entrypoint.sh"         2 minutes ago   Up 2 minutes                0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 0.0.0.0:8001->8001/tcp, :::8001->8001/tcp   redis-vector-db
+9ee0789d819e   ghcr.io/huggingface/text-embeddings-inference:cpu-1.5                                           "text-embeddings-rou…"   2 minutes ago   Up 2 minutes                0.0.0.0:8090->80/tcp, [::]:8090->80/tcp                                                tei-embedding-gaudi-server
+```

- If you want to use a specific version of Docker image.
+### Test the Pipeline

-Please refer to 'Build Docker Images' in below.
-
-## QuickStart: 3.Consume the ChatQnA Service
+Once the ChatQnA services are running, test the pipeline using the following command:

 ```bash
 curl http://${host_ip}:8888/v1/chatqna \
@@ -98,504 +110,171 @@ curl http://${host_ip}:8888/v1/chatqna \
    }'
 ```

-## 🚀 Build Docker Images
+**Note** The value of _host_ip_ was set using the _set_env.sh_ script and can be found in the _.env_ file.

-First of all, you need to build Docker Images locally. This step can be ignored after the Docker images published to Docker hub.
+### Cleanup the Deployment

-```bash
-git clone https://github.com/opea-project/GenAIComps.git
-cd GenAIComps
+To stop the containers associated with the deployment, execute the following command:
+
+```
+docker compose -f compose.yaml down
 ```

-### 1. Build Retriever Image
-
-```bash
-docker build --no-cache -t opea/retriever:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/src/Dockerfile .
+```
+[+] Running 10/10
+ ✔ Container chatqna-gaudi-nginx-server    Removed                                                                                                 10.5s
+ ✔ Container dataprep-redis-server         Removed                                                                                                 10.5s
+ ✔ Container chatqna-gaudi-ui-server       Removed                                                                                                 10.3s
+ ✔ Container chatqna-gaudi-backend-server  Removed                                                                                                 10.3s
+ ✔ Container vllm-gaudi-server             Removed                                                                                                  0.0s
+ ✔ Container retriever-redis-server        Removed                                                                                                 10.4s
+ ✔ Container tei-reranking-gaudi-server    Removed                                                                                                  2.0s
+ ✔ Container tei-embedding-gaudi-server    Removed                                                                                                  1.2s
+ ✔ Container redis-vector-db               Removed                                                                                                  0.4s
+ ✔ Network gaudi_default                   Removed                                                                                                  0.4s
 ```

-### 2. Build Dataprep Image
+All the ChatQnA containers will be stopped and then removed on completion of the "down" command.
+
+## ChatQnA Docker Compose Files
+
+In the context of deploying a ChatQnA pipeline on an Intel® Gaudi® platform, the allocation and utilization of Gaudi devices across different services are important considerations for optimizing performance and resource efficiency. Each of the four example deployments, defined by the example Docker compose yaml files, demonstrates a unique approach to leveraging Gaudi hardware, reflecting different priorities and operational strategies.
+
+### compose.yaml - Default Deployment
+
+The default deployment utilizes Gaudi devices primarily for the `vllm-service`, which handles large language model (LLM) tasks. This service is configured to maximize the use of Gaudi's capabilities, potentially allocating multiple devices to enhance parallel processing and throughput. The `tei-reranking-service` also uses Gaudi hardware (1 card), however, indicating a balanced approach where both LLM processing and reranking tasks benefit from Gaudi's performance enhancements.
+
+| Service Name                 | Image Name                                            | Gaudi Use    |
+| ---------------------------- | ----------------------------------------------------- | ------------ |
+| redis-vector-db              | redis/redis-stack:7.2.0-v9                            | No           |
+| dataprep-redis-service       | opea/dataprep:latest                                  | No           |
+| tei-embedding-service        | ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 | No           |
+| retriever                    | opea/retriever:latest                                 | No           |
+| tei-reranking-service        | ghcr.io/huggingface/tei-gaudi:1.5.0                   | 1 card       |
+| vllm-service                 | opea/vllm-gaudi:latest                                | Configurable |
+| chatqna-gaudi-backend-server | opea/chatqna:latest                                   | No           |
+| chatqna-gaudi-ui-server      | opea/chatqna-ui:latest                                | No           |
+| chatqna-gaudi-nginx-server   | opea/nginx:latest                                     | No           |
+
+### compose_tgi.yaml - TGI Deployment
+
+The TGI (Text Generation Inference) deployment and the default deployment differ primarily in their service configurations and specific focus on handling large language models (LLMs). The TGI deployment includes a unique `tgi-service`, which utilizes the `ghcr.io/huggingface/tgi-gaudi:2.0.6` image and is specifically configured to run on Gaudi hardware. This service is designed to handle LLM tasks with optimizations such as `ENABLE_HPU_GRAPH` and `USE_FLASH_ATTENTION`. The `chatqna-gaudi-backend-server` in the TGI deployment depends on the `tgi-service`, whereas in the default deployment, it relies on the `vllm-service`.
+
+| Service Name                 | Image Name                                            | Gaudi Specific |
+| ---------------------------- | ----------------------------------------------------- | -------------- |
+| redis-vector-db              | redis/redis-stack:7.2.0-v9                            | No             |
+| dataprep-redis-service       | opea/dataprep:latest                                  | No             |
+| tei-embedding-service        | ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 | No             |
+| retriever                    | opea/retriever:latest                                 | No             |
+| tei-reranking-service        | ghcr.io/huggingface/tei-gaudi:1.5.0                   | 1 card         |
+| **tgi-service**              | ghcr.io/huggingface/tgi-gaudi:2.0.6                   | Configurable   |
+| chatqna-gaudi-backend-server | opea/chatqna:latest                                   | No             |
+| chatqna-gaudi-ui-server      | opea/chatqna-ui:latest                                | No             |
+| chatqna-gaudi-nginx-server   | opea/nginx:latest                                     | No             |
+
+This deployment may allocate more Gaudi resources to the tgi-service to optimize LLM tasks depending on the specific configuration and workload requirements.
+
+### compose_without_rerank.yaml - No ReRank Deployment
+
+The _compose_without_rerank.yaml_ Docker Compose file is distinct from the default deployment primarily due to the exclusion of the reranking service. In this version, the `tei-reranking-service`, which is typically responsible for providing reranking capabilities for text embeddings and is configured to run on Gaudi hardware, is absent. This omission simplifies the service architecture by removing a layer of processing that would otherwise enhance the ranking of text embeddings. Consequently, the `chatqna-gaudi-backend-server` in this deployment uses a specialized image, `opea/chatqna-without-rerank:latest`, indicating that it is tailored to function without the reranking feature. As a result, the backend server's dependencies are adjusted, without the need for the reranking service. This streamlined setup may impact the application's functionality and performance by focusing on core operations without the additional processing layer provided by reranking, potentially making it more efficient for scenarios where reranking is not essential and freeing Intel® Gaudi® accelerators for other tasks.
+
+| Service Name                 | Image Name                                            | Gaudi Specific |
+| ---------------------------- | ----------------------------------------------------- | -------------- |
+| redis-vector-db              | redis/redis-stack:7.2.0-v9                            | No             |
+| dataprep-redis-service       | opea/dataprep:latest                                  | No             |
+| tei-embedding-service        | ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 | No             |
+| retriever                    | opea/retriever:latest                                 | No             |
+| vllm-service                 | opea/vllm-gaudi:latest                                | Configurable   |
+| chatqna-gaudi-backend-server | **opea/chatqna-without-rerank:latest**                | No             |
+| chatqna-gaudi-ui-server      | opea/chatqna-ui:latest                                | No             |
+| chatqna-gaudi-nginx-server   | opea/nginx:latest                                     | No             |
+
+This setup might allow for more Gaudi devices to be dedicated to the `vllm-service`, enhancing LLM processing capabilities and accommodating larger models. However, it also means that the benefits of reranking are sacrificed, which could impact the overall quality of the pipeline's output.
+
+### compose_guardrails.yaml - Guardrails Deployment
+
+The _compose_guardrails.yaml_ Docker Compose file introduces enhancements over the default deployment by incorporating additional services focused on safety and ChatQnA response control. Notably, it includes the `tgi-guardrails-service` and `guardrails` services. The `tgi-guardrails-service` uses the `ghcr.io/huggingface/tgi-gaudi:2.0.6` image and is configured to run on Gaudi hardware, providing functionality to manage input constraints and ensure safe operations within defined limits. The guardrails service, using the `opea/guardrails:latest` image, acts as a safety layer that interfaces with the `tgi-guardrails-service` to enforce safety protocols and manage interactions with the large language model (LLM). Additionally, the `chatqna-gaudi-backend-server` is updated to use the `opea/chatqna-guardrails:latest` image, indicating its design to integrate with these new guardrail services. This backend server now depends on the `tgi-guardrails-service` and `guardrails`, alongside existing dependencies like `redis-vector-db`, `tei-embedding-service`, `retriever`, `tei-reranking-service`, and `vllm-service`. The environment configurations for the backend are also updated to include settings for the guardrail services.
+
+| Service Name                 | Image Name                                            | Gaudi Specific | Uses LLM |
+| ---------------------------- | ----------------------------------------------------- | -------------- | -------- |
+| redis-vector-db              | redis/redis-stack:7.2.0-v9                            | No             | No       |
+| dataprep-redis-service       | opea/dataprep:latest                                  | No             | No       |
+| _tgi-guardrails-service_     | ghcr.io/huggingface/tgi-gaudi:2.0.6                   | 1 card         | Yes      |
+| _guardrails_                 | opea/guardrails:latest                                | No             | No       |
+| tei-embedding-service        | ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 | No             | No       |
+| retriever                    | opea/retriever:latest                                 | No             | No       |
+| tei-reranking-service        | ghcr.io/huggingface/tei-gaudi:1.5.0                   | 1 card         | No       |
+| vllm-service                 | opea/vllm-gaudi:latest                                | Configurable   | Yes      |
+| chatqna-gaudi-backend-server | opea/chatqna-guardrails:latest                        | No             | No       |
+| chatqna-gaudi-ui-server      | opea/chatqna-ui:latest                                | No             | No       |
+| chatqna-gaudi-nginx-server   | opea/nginx:latest                                     | No             | No       |
+
+The deployment with guardrails introduces additional Gaudi-specific services, such as the `tgi-guardrails-service`, which necessitates careful consideration of Gaudi allocation. This deployment aims to balance safety and performance, potentially requiring a strategic distribution of Gaudi devices between the guardrail services and the LLM tasks to maintain both operational safety and efficiency.
+
+### Telemetry Enablement - compose.telemetry.yaml and compose_tgi.telemetry.yaml
+
+The telemetry Docker Compose files are incremental configurations designed to enhance existing deployments by integrating telemetry metrics, thereby providing valuable insights into the performance and behavior of certain services. This setup modifies specific services, such as the `tgi-service`, `tei-embedding-service` and `tei-reranking-service`, by adding a command-line argument that specifies an OpenTelemetry Protocol (OTLP) endpoint. This enables these services to export telemetry data to a designated endpoint, facilitating detailed monitoring and analysis. The `chatqna-gaudi-backend-server` is configured with environment variables that enable telemetry and specify the telemetry endpoint, ensuring that the backend server's operations are also monitored.
+
+Additionally, the telemetry files introduce a new service, `jaeger`, which uses the `jaegertracing/all-in-one:latest` image. Jaeger is a powerful open-source tool for tracing and monitoring distributed systems, offering a user-friendly interface for visualizing traces and understanding the flow of requests through the system.
+
+To enable Open Telemetry Tracing, compose.telemetry.yaml file needs to be merged along with default compose.yaml file on deployment:

-```bash
-docker build --no-cache -t opea/dataprep:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/src/Dockerfile .
 ```
-
-### 3. Build Guardrails Docker Image (Optional)
-
-To fortify AI initiatives in production, Guardrails microservice can secure model inputs and outputs, building Trustworthy, Safe, and Secure LLM-based Applications.
-
-```bash
-docker build -t opea/guardrails:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/guardrails/src/guardrails/Dockerfile .
-```
-
-### 4. Build MegaService Docker Image
-
-1. MegaService with Rerank
-
-   To construct the Mega Service with Rerank, we utilize the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline within the `chatqna.py` Python script. Build the MegaService Docker image using the command below:
-
-   ```bash
-   git clone https://github.com/opea-project/GenAIExamples.git
-   cd GenAIExamples/ChatQnA
-   docker build --no-cache -t opea/chatqna:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .
-   ```
-
-2. MegaService with Guardrails
-
-   If you want to enable guardrails microservice in the pipeline, please use the below command instead:
-
-   ```bash
-   git clone https://github.com/opea-project/GenAIExamples.git
-   cd GenAIExamples/ChatQnA/
-   docker build --no-cache -t opea/chatqna-guardrails:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile.guardrails .
-   ```
-
-3. MegaService without Rerank
-
-   To construct the Mega Service without Rerank, we utilize the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline within the `chatqna_without_rerank.py` Python script. Build MegaService Docker image via below command:
-
-   ```bash
-   git clone https://github.com/opea-project/GenAIExamples.git
-   cd GenAIExamples/ChatQnA
-   docker build --no-cache -t opea/chatqna-without-rerank:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile.without_rerank .
-   ```
-
-### 5. Build UI Docker Image
-
-Construct the frontend Docker image using the command below:
-
-```bash
-cd GenAIExamples/ChatQnA/ui
-docker build --no-cache -t opea/chatqna-ui:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f ./docker/Dockerfile .
-```
-
-### 6. Build Conversational React UI Docker Image (Optional)
-
-Build frontend Docker image that enables Conversational experience with ChatQnA megaservice via below command:
-
-**Export the value of the public IP address of your Gaudi node to the `host_ip` environment variable**
-
-```bash
-cd GenAIExamples/ChatQnA/ui
-docker build --no-cache -t opea/chatqna-conversation-ui:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f ./docker/Dockerfile.react .
-```
-
-### 7. Build Nginx Docker Image
-
-```bash
-cd GenAIComps
-docker build -t opea/nginx:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/third_parties/nginx/src/Dockerfile .
-```
-
-Then run the command `docker images`, you will have the following 5 Docker Images:
-
- `opea/retriever:latest`
- `opea/dataprep:latest`
- `opea/chatqna:latest`
- `opea/chatqna-ui:latest`
- `opea/nginx:latest`
-
-If Conversation React UI is built, you will find one more image:
-
- `opea/chatqna-conversation-ui:latest`
-
-If Guardrails docker image is built, you will find one more image:
-
- `opea/guardrails:latest`
-
-## 🚀 Start MicroServices and MegaService
-
-### Required Models
-
-By default, the embedding, reranking and LLM models are set to a default value as listed below:
-
-| Service   | Model                               |
-| --------- | ----------------------------------- |
-| Embedding | BAAI/bge-base-en-v1.5               |
-| Reranking | BAAI/bge-reranker-base              |
-| LLM       | meta-llama/Meta-Llama-3-8B-Instruct |
-
-Change the `xxx_MODEL_ID` below for your needs.
-
-For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. The vLLM/TGI can load the models either online or offline as described below:
-
-1. Online
-
-   ```bash
-   export HF_TOKEN=${your_hf_token}
-   export HF_ENDPOINT="https://hf-mirror.com"
-   model_name="meta-llama/Meta-Llama-3-8B-Instruct"
-   # Start vLLM LLM Service
-   docker run -p 8007:80 -v ./data:/data --name vllm-gaudi-server -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e VLLM_TORCH_PROFILER_DIR="/mnt" --cap-add=sys_nice --ipc=host opea/vllm-gaudi:latest --model $model_name --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
-   # Start TGI LLM Service
-   docker run -p 8005:80 -v ./data:/data --name tgi-gaudi-server -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id $model_name --max-input-tokens 1024 --max-total-tokens 2048
-   ```
-
-2. Offline
-
-   - Search your model name in ModelScope. For example, check [this page](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files) for model `Meta-Llama-3-8B-Instruct`.
-
-   - Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
-
-   - Run the following command to start the LLM service.
-
-     ```bash
-     export HF_TOKEN=${your_hf_token}
-     export model_path="/path/to/model"
-     # Start vLLM LLM Service
-     docker run -p 8007:80 -v $model_path:/data --name vllm-gaudi-server --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e VLLM_TORCH_PROFILER_DIR="/mnt" --cap-add=sys_nice --ipc=host opea/vllm-gaudi:latest --model /data --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
-     # Start TGI LLM Service
-     docker run -p 8005:80 -v $model_path:/data --name tgi-gaudi-server --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id /data --max-input-tokens 1024 --max-total-tokens 2048
-     ```
-
-### Setup Environment Variables
-
-1. Set the required environment variables:
-
-   ```bash
-   # Example: host_ip="192.168.1.1"
-   export host_ip="External_Public_IP"
-   export HUGGINGFACEHUB_API_TOKEN="Your_Huggingface_API_Token"
-   # Example: NGINX_PORT=80
-   export NGINX_PORT=${your_nginx_port}
-   ```
-
-2. If you are in a proxy environment, also set the proxy-related environment variables:
-
-   ```bash
-   export http_proxy="Your_HTTP_Proxy"
-   export https_proxy="Your_HTTPs_Proxy"
-   # Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
-   export no_proxy="Your_No_Proxy",chatqna-gaudi-ui-server,chatqna-gaudi-backend-server,dataprep-redis-service,tei-embedding-service,retriever,tei-reranking-service,tgi-service,vllm-service,guardrails
-   ```
-
-3. Set up other environment variables:
-
-   ```bash
-   source ./set_env.sh
-   ```
-
-### Start all the services Docker Containers
-
-```bash
-cd GenAIExamples/ChatQnA/docker_compose/intel/hpu/gaudi/
-```
-
-If use vLLM as the LLM serving backend.
-
-```bash
-# Start ChatQnA with Rerank Pipeline
-docker compose -f compose.yaml up -d
-# Start ChatQnA without Rerank Pipeline
-docker compose -f compose_without_rerank.yaml up -d
-# Start ChatQnA with Rerank Pipeline and Open Telemetry Tracing
 docker compose -f compose.yaml -f compose.telemetry.yaml up -d
 ```

-If use TGI as the LLM serving backend.
+For a TGI Deployment, this would become:

-```bash
-docker compose -f compose_tgi.yaml up -d
-# Start ChatQnA with Open Telemetry Tracing
+```
 docker compose -f compose_tgi.yaml -f compose_tgi.telemetry.yaml up -d
 ```

-If you want to enable guardrails microservice in the pipeline, please follow the below command instead:
+## ChatQnA Service Configuration

-```bash
-cd GenAIExamples/ChatQnA/docker_compose/intel/hpu/gaudi/
-docker compose -f compose_guardrails.yaml up -d
-```
+The table provides a comprehensive overview of the ChatQnA services utilized across various deployments as illustrated in the example Docker Compose files. Each row in the table represents a distinct service, detailing its possible images used to enable it and a concise description of its function within the deployment architecture. These services collectively enable functionalities such as data storage and management, text embedding, retrieval, reranking, and large language model processing. Additionally, specialized services like `tgi-service` and `guardrails` are included to enhance text generation inference and ensure operational safety, respectively. The table also highlights the integration of telemetry through the `jaeger` service, which provides tracing and monitoring capabilities.

-> **_NOTE:_** Users need at least two Gaudi cards to run the ChatQnA successfully.
+| Service Name                 | Possible Image Names                                  | Optional | Description                                                                                        |
+| ---------------------------- | ----------------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------- |
+| redis-vector-db              | redis/redis-stack:7.2.0-v9                            | No       | Acts as a Redis database for storing and managing data.                                            |
+| dataprep-redis-service       | opea/dataprep:latest                                  | No       | Prepares data and interacts with the Redis database.                                               |
+| tei-embedding-service        | ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 | No       | Provides text embedding services, often using Hugging Face models.                                 |
+| retriever                    | opea/retriever:latest                                 | No       | Retrieves data from the Redis database and interacts with embedding services.                      |
+| tei-reranking-service        | ghcr.io/huggingface/tei-gaudi:1.5.0                   | Yes      | Reranks text embeddings, typically using Gaudi hardware for enhanced performance.                  |
+| vllm-service                 | opea/vllm-gaudi:latest                                | No       | Handles large language model (LLM) tasks, utilizing Gaudi hardware.                                |
+| tgi-service                  | ghcr.io/huggingface/tgi-gaudi:2.0.6                   | Yes      | Specific to the TGI deployment, focuses on text generation inference using Gaudi hardware.         |
+| tgi-guardrails-service       | ghcr.io/huggingface/tgi-gaudi:2.0.6                   | Yes      | Provides guardrails functionality, ensuring safe operations within defined limits.                 |
+| guardrails                   | opea/guardrails:latest                                | Yes      | Acts as a safety layer, interfacing with the `tgi-guardrails-service` to enforce safety protocols. |
+| chatqna-gaudi-backend-server | opea/chatqna:latest                                   | No       | Serves as the backend for the ChatQnA application, with variations depending on the deployment.    |
+|                              | opea/chatqna-without-rerank:latest                    |          |                                                                                                    |
+|                              | opea/chatqna-guardrails:latest                        |          |                                                                                                    |
+| chatqna-gaudi-ui-server      | opea/chatqna-ui:latest                                | No       | Provides the user interface for the ChatQnA application.                                           |
+| chatqna-gaudi-nginx-server   | opea/nginx:latest                                     | No       | Acts as a reverse proxy, managing traffic between the UI and backend services.                     |
+| jaeger                       | jaegertracing/all-in-one:latest                       | Yes      | Provides tracing and monitoring capabilities for distributed systems.                              |

-### Validate MicroServices and MegaService
+Many of these services provide pipeline support required for all ChatQnA deployments, and are not specific to supporting the Intel® Gaudi® platform. Therefore, while the `redis-vector-db`, `dataprep-redis-service`, `retriever`, `chatqna-gaudi-backend-server`, `chatqna-gaudi-ui-server`, `chatqna-gaudi-nginx-server`, `jaeger` are configurable, they will not be covered by this example, which will focus on the configuration specifics of the services modified to support the Intel® Gaudi® platform.

-Follow the instructions to validate MicroServices.
-For validation details, please refer to [how-to-validate_service](./how_to_validate_service.md).
+### vllm-service & tgi-service

-1. TEI Embedding Service
+In the configuration of the `vllm-service` and the `tgi-service`, two variables play a primary role in determining the service's performance and functionality: `LLM_MODEL_ID` and `NUM_CARDS`. Both can be set using the appropriate environment variables. The `LLM_MODEL_ID` parameter specifies the particular large language model (LLM) that the service will utilize, effectively determining the capabilities and characteristics of the language processing tasks it can perform. This model identifier ensures that the service is aligned with the specific requirements of the application, whether it involves text generation, comprehension, or other language-related tasks. The `NUM_CARDS` parameter dictates the number of Gaudi devices allocated to the service. A higher number of Gaudi devices can enhance parallel processing capabilities, reduce latency, and improve throughput.

-   ```bash
-   curl ${host_ip}:8090/embed \
-       -X POST \
-       -d '{"inputs":"What is Deep Learning?"}' \
-       -H 'Content-Type: application/json'
-   ```
+However, developers need to be aware of the models that have been tested with the respective service image supporting the `vllm-service` and `tgi-service`. For example, documentation for the OPEA GenAIComps v1.0 release specify the list of [validated LLM models](https://github.com/opea-project/GenAIComps/blob/v1.0/comps/llms/text-generation/README.md#validated-llm-models) for each Gaudi enabled service image. Specific models may have stringent requirements on the number of Intel® Gaudi® devices required to support them.

-2. Retriever Microservice
+#### Deepseek Model Support for Intel® Gaudi® Platform ChatQnA pipeline

-   To consume the retriever microservice, you need to generate a mock embedding vector by Python script. The length of embedding vector
-   is determined by the embedding model.
-   Here we use the model `EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"`, which vector size is 768.
+ChatQnA now supports running the latest DeepSeek models, including [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) and [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) on Gaudi accelerators. To run `deepseek-ai/DeepSeek-R1-Distill-Llama-70B`, set the `LLM_MODEL_ID` appropriately and the `NUM_CARDS` to 8. To run `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`, update the `LLM_MODEL_ID` appropriately and set the `NUM_CARDS` to 4.

-   Check the vecotor dimension of your embedding model, set `your_embedding` dimension equals to it.
+### tei-embedding-service & tei-reranking-service

-   ```bash
-   export your_embedding=$(python3 -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")
-   curl http://${host_ip}:7000/v1/retrieval \
-     -X POST \
-     -d "{\"text\":\"test\",\"embedding\":${your_embedding}}" \
-     -H 'Content-Type: application/json'
-   ```
+The `ghcr.io/huggingface/text-embeddings-inference:cpu-1.5` image supporting `tei-embedding-service` and `tei-reranking-service` depends on the `EMBEDDING_MODEL_ID` or `RERANK_MODEL_ID` environment variables respectively to specify the embedding model and reranking model used for converting text into vector representations and rankings. This choice impacts the quality and relevance of the embeddings rerankings for various applications. Unlike the `vllm-service`, the `tei-embedding-service` and `tei-reranking-service` each typically acquires only one Gaudi device and does not use the `NUM_CARDS` parameter; embedding and reranking tasks generally do not require extensive parallel processing and one Gaudi per service is appropriate. The list of [supported embedding and reranking models](https://github.com/huggingface/tei-gaudi?tab=readme-ov-file#supported-models) can be found at the the [huggingface/tei-gaudi](https://github.com/huggingface/tei-gaudi?tab=readme-ov-file#supported-models) website.

-3. TEI Reranking Service
+### tgi-gaurdrails-service

-   > Skip for ChatQnA without Rerank pipeline
+The `tgi-guardrails-service` uses the `GUARDRAILS_MODEL_ID` parameter to select a [supported model](https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#tested-models-and-configurations) for the associated `ghcr.io/huggingface/tgi-gaudi:2.0.6` image. Like the `tei-embedding-service` and `tei-reranking-service` services, it doesn't use the `NUM_CARDS` parameter.

-   ```bash
-   curl http://${host_ip}:8808/rerank \
-       -X POST \
-       -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
-       -H 'Content-Type: application/json'
-   ```
+## Conclusion

-4. LLM backend Service
+In examining the various services and configurations across different deployments, developers should gain a comprehensive understanding of how each component contributes to the overall functionality and performance of a ChatQnA pipeline on an Intel® Gaudi® platform. Key services such as the `vllm-service`, `tei-embedding-service`, `tei-reranking-service`, and `tgi-guardrails-service` each consume Gaudi accelerators, leveraging specific models and hardware resources to optimize their respective tasks. The `LLM_MODEL_ID`, `EMBEDDING_MODEL_ID`, `RERANK_MODEL_ID`, and `GUARDRAILS_MODEL_ID` parameters specify the models used, directly impacting the quality and effectiveness of language processing, embedding, reranking, and safety operations.

-   In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
+The allocation of Gaudi devices, affected by the Gaudi dependent services and the `NUM_CARDS` parameter supporting the `vllm-service` or `tgi-service`, determines where computational power is utilized to enhance performance.

-   Try the command below to check whether the LLM serving is ready.
-
-   ```bash
-   # vLLM service
-   docker logs vllm-gaudi-server 2>&1 | grep complete
-   # If the service is ready, you will get the response like below.
-   INFO:     Application startup complete.
-   ```
-
-   ```bash
-   # TGI service
-   docker logs tgi-gaudi-server | grep Connected
-   If the service is ready, you will get the response like below.
-   2024-09-03T02:47:53.402023Z  INFO text_generation_router::server: router/src/server.rs:2311: Connected
-   ```
-
-   Then try the `cURL` command below to validate services.
-
-   ```bash
-   # vLLM Service
-   curl http://${host_ip}:8007/v1/chat/completions \
-     -X POST \
-     -d '{"model": ${LLM_MODEL_ID}, "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-     -H 'Content-Type: application/json'
-   ```
-
-   ```bash
-   # TGI service
-   curl http://${host_ip}:8005/v1/chat/completions \
-     -X POST \
-     -d '{"model": ${LLM_MODEL_ID}, "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-     -H 'Content-Type: application/json'
-   ```
-
-5. MegaService
-
-   ```bash
-   curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
-         "messages": "What is the revenue of Nike in 2023?"
-         }'
-   ```
-
-6. Nginx Service
-
-   ```bash
-   curl http://${host_ip}:${NGINX_PORT}/v1/chatqna \
-       -H "Content-Type: application/json" \
-       -d '{"messages": "What is the revenue of Nike in 2023?"}'
-   ```
-
-7. Dataprep Microservice（Optional）
-
-If you want to update the default knowledge base, you can use the following commands:
-
-Update Knowledge Base via Local File Upload:
-
-```bash
-curl -X POST "http://${host_ip}:6007/v1/dataprep/ingest" \
-     -H "Content-Type: multipart/form-data" \
-     -F "files=@./nke-10k-2023.pdf"
-```
-
-This command updates a knowledge base by uploading a local file for processing. Update the file path according to your environment.
-
-Add Knowledge Base via HTTP Links:
-
-```bash
-curl -X POST "http://${host_ip}:6007/v1/dataprep/ingest" \
-     -H "Content-Type: multipart/form-data" \
-     -F 'link_list=["https://opea.dev"]'
-```
-
-This command updates a knowledge base by submitting a list of HTTP links for processing.
-
-Also, you are able to get the file/link list that you uploaded:
-
-```bash
-curl -X POST "http://${host_ip}:6007/v1/dataprep/get" \
-     -H "Content-Type: application/json"
-```
-
-Then you will get the response JSON like this. Notice that the returned `name`/`id` of the uploaded link is `https://xxx.txt`.
-
-```json
-[
-  {
-    "name": "nke-10k-2023.pdf",
-    "id": "nke-10k-2023.pdf",
-    "type": "File",
-    "parent": ""
-  },
-  {
-    "name": "https://opea.dev.txt",
-    "id": "https://opea.dev.txt",
-    "type": "File",
-    "parent": ""
-  }
-]
-```
-
-To delete the file/link you uploaded:
-
-```bash
-# delete link
-curl -X POST "http://${host_ip}:6007/v1/dataprep/delete" \
-     -d '{"file_path": "https://opea.dev.txt"}' \
-     -H "Content-Type: application/json"
-
-# delete file
-curl -X POST "http://${host_ip}:6007/v1/dataprep/delete" \
-     -d '{"file_path": "nke-10k-2023.pdf"}' \
-     -H "Content-Type: application/json"
-
-# delete all uploaded files and links
-curl -X POST "http://${host_ip}:6007/v1/dataprep/delete" \
-     -d '{"file_path": "all"}' \
-     -H "Content-Type: application/json"
-```
-
-8. Guardrails (Optional)
-
-```bash
-curl http://${host_ip}:9090/v1/guardrails\
-  -X POST \
-  -d '{"text":"How do you buy a tiger in the US?","parameters":{"max_new_tokens":32}}' \
-  -H 'Content-Type: application/json'
-```
-
-### Profile Microservices
-
-To further analyze MicroService Performance, users could follow the instructions to profile MicroServices.
-
-#### 1. vLLM backend Service
-
-Users could follow previous section to testing vLLM microservice or ChatQnA MegaService.  
- By default, vLLM profiling is not enabled. Users could start and stop profiling by following commands.
-
-##### Start vLLM profiling
-
-```bash
-curl http://${host_ip}:9009/start_profile \
-  -H "Content-Type: application/json" \
-  -d '{"model": ${LLM_MODEL_ID}}'
-```
-
-Users would see below docker logs from vllm-service if profiling is started correctly.
-
-```bash
-INFO api_server.py:361] Starting profiler...
-INFO api_server.py:363] Profiler started.
-INFO:     x.x.x.x:35940 - "POST /start_profile HTTP/1.1" 200 OK
-```
-
-After vLLM profiling is started, users could start asking questions and get responses from vLLM MicroService  
- or ChatQnA MicroService.
-
-##### Stop vLLM profiling
-
-By following command, users could stop vLLM profliing and generate a \*.pt.trace.json.gz file as profiling result  
- under /mnt folder in vllm-service docker instance.
-
-```bash
-# vLLM Service
-curl http://${host_ip}:9009/stop_profile \
-  -H "Content-Type: application/json" \
-  -d '{"model": ${LLM_MODEL_ID}}'
-```
-
-Users would see below docker logs from vllm-service if profiling is stopped correctly.
-
-```bash
-INFO api_server.py:368] Stopping profiler...
-INFO api_server.py:370] Profiler stopped.
-INFO:     x.x.x.x:41614 - "POST /stop_profile HTTP/1.1" 200 OK
-```
-
-After vllm profiling is stopped, users could use below command to get the \*.pt.trace.json.gz file under /mnt folder.
-
-```bash
-docker cp  vllm-service:/mnt/ .
-```
-
-##### Check profiling result
-
-Open a web browser and type "chrome://tracing" or "ui.perfetto.dev", and then load the json.gz file, you should be able  
- to see the vLLM profiling result as below diagram.
-![image](https://github.com/user-attachments/assets/487c52c8-d187-46dc-ab3a-43f21d657d41)
-
-![image](https://github.com/user-attachments/assets/e3c51ce5-d704-4eb7-805e-0d88b0c158e3)
-
-## 🚀 Launch the UI
-
-### Launch with origin port
-
-To access the frontend, open the following URL in your browser: http://{host_ip}:5173. By default, the UI runs on port 5173 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
-
-```yaml
-  chatqna-gaudi-ui-server:
-    image: opea/chatqna-ui:latest
-    ...
-    ports:
-      - "80:5173"
-```
-
-### Launch with Nginx
-
-If you want to launch the UI using Nginx, open this URL: `http://${host_ip}:${NGINX_PORT}` in your browser to access the frontend.
-
-## 🚀 Launch the Conversational UI (Optional)
-
-To access the Conversational UI (react based) frontend, modify the UI service in the `compose.yaml` file. Replace `chatqna-gaudi-ui-server` service with the `chatqna-gaudi-conversation-ui-server` service as per the config below:
-
-```yaml
-chatqna-gaudi-conversation-ui-server:
-  image: opea/chatqna-conversation-ui:latest
-  container_name: chatqna-gaudi-conversation-ui-server
-  environment:
-    - APP_BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
-    - APP_DATA_PREP_SERVICE_URL=${DATAPREP_SERVICE_ENDPOINT}
-  ports:
-    - "5174:80"
-  depends_on:
-    - chatqna-gaudi-backend-server
-  ipc: host
-  restart: always
-```
-
-Once the services are up, open the following URL in your browser: http://{host_ip}:5174. By default, the UI runs on port 80 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
-
-```yaml
-  chatqna-gaudi-conversation-ui-server:
-    image: opea/chatqna-conversation-ui:latest
-    ...
-    ports:
-      - "80:80"
-```
-
-![project-screenshot](../../../../assets/img/chat_ui_init.png)
-
-Here is an example of running ChatQnA:
-
-![project-screenshot](../../../../assets/img/chat_ui_response.png)
-
-Here is an example of running ChatQnA with Conversational UI (React):
-
-![project-screenshot](../../../../assets/img/conversation_ui_response.png)
+Overall, the strategic configuration of these services, through careful selection of models and resource allocation, enables a balanced and efficient deployment. This approach ensures that the ChatQnA pipeline can meet diverse operational needs, from high-performance language model processing to robust safety protocols, all while optimizing the use of available hardware resources.
--- a/ChatQnA/docker_compose/intel/hpu/gaudi/set_env.sh
+++ b/ChatQnA/docker_compose/intel/hpu/gaudi/set_env.sh
@@ -1,22 +1,94 @@
-#!/usr/bin/env bash
+#/usr/bin/env bash

 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
+
+# Function to prompt for input and set environment variables
+prompt_for_env_var() {
+  local var_name="$1"
+  local prompt_message="$2"
+  local default_value="$3"
+  local mandatory="$4"
+
+  if [[ "$mandatory" == "true" ]]; then
+    while [[ -z "$value" ]]; do
+      read -p "$prompt_message [default: \"${default_value}\"]: " value
+      if [[ -z "$value" ]]; then
+        echo "Input cannot be empty. Please try again."
+      fi
+    done
+  else
+    read -p "$prompt_message [default: \"${default_value}\"]: " value
+  fi
+
+  if [[ "$value" == "" ]]; then
+      export "$var_name"="$default_value"
+  else
+      export "$var_name"="$value"
+  fi
+}
+
 pushd "../../../../../" > /dev/null
 source .set_env.sh
 popd > /dev/null

+# Prompt the user for each required environment variable
+prompt_for_env_var "EMBEDDING_MODEL_ID" "Enter the EMBEDDING_MODEL_ID" "BAAI/bge-base-en-v1.5" false
+prompt_for_env_var "HUGGINGFACEHUB_API_TOKEN" "Enter the HUGGINGFACEHUB_API_TOKEN" "" true
+prompt_for_env_var "RERANK_MODEL_ID" "Enter the RERANK_MODEL_ID" "BAAI/bge-reranker-base" false
+prompt_for_env_var "LLM_MODEL_ID" "Enter the LLM_MODEL_ID" "meta-llama/Meta-Llama-3-8B-Instruct" false
+prompt_for_env_var "INDEX_NAME" "Enter the INDEX_NAME" "rag-redis" false
+prompt_for_env_var "NUM_CARDS" "Enter the number of Gaudi devices" "1" false
+prompt_for_env_var "host_ip" "Enter the host_ip" "$(curl ifconfig.me)" false

-export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
-export RERANK_MODEL_ID="BAAI/bge-reranker-base"
-export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
-export INDEX_NAME="rag-redis"
-export NUM_CARDS=1
-# Set it as a non-null string, such as true, if you want to enable logging facility,
-# otherwise, keep it as "" to disable it.
-export LOGFLAG=""
-# Set OpenTelemetry Tracing Endpoint
-export JAEGER_IP=$(ip route get 8.8.8.8 | grep -oP 'src \K[^ ]+')
-export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
-export TELEMETRY_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
-export no_proxy="$no_proxy,chatqna-gaudi-ui-server,chatqna-gaudi-backend-server,dataprep-redis-service,tei-embedding-service,retriever,tei-reranking-service,tgi-gaudi-server,vllm-gaudi-server,guardrails,jaeger,prometheus,grafana,node-exporter,gaudi-exporter,$JAEGER_IP"
+#Query for enabling http_proxy
+prompt_for_env_var "http_proxy" "Enter the http_proxy." "" false
+
+#Query for enabling https_proxy
+prompt_for_env_var "https_proxy" "Enter the https_proxy." "" false
+
+#Query for enabling no_proxy
+prompt_for_env_var "no_proxy" "Enter the no_proxy." "" false
+
+# Query for enabling logging
+read -p "Enable logging? (yes/no): " logging && logging=$(echo "$logging" | tr '[:upper:]' '[:lower:]')
+if [[ "$logging" == "yes" || "$logging" == "y" ]]; then
+  export LOGFLAG=true
+else
+  export LOGFLAG=false
+fi
+
+# Query for enabling OpenTelemetry Tracing Endpoint
+read -p "Enable OpenTelemetry Tracing Endpoint? (yes/no): " telemetry && telemetry=$(echo "$telemetry" | tr '[:upper:]' '[:lower:]')
+if [[ "$telemetry" == "yes" || "$telemetry" == "y" ]]; then
+    export JAEGER_IP=$(ip route get 8.8.8.8 | grep -oP 'src \K[^ ]+')
+    export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
+    export TELEMETRY_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
+    telemetry_flag=true
+else
+    telemetry_flag=false
+fi
+
+# Generate the .env file
+cat <<EOF > .env
+#!/bin/bash
+# Set all required ENV values
+export TAG=${TAG}
+export EMBEDDING_MODEL_ID=${EMBEDDING_MODEL_ID}
+export HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN
+export RERANK_MODEL_ID=${RERANK_MODEL_ID}
+export LLM_MODEL_ID=${LLM_MODEL_ID}
+export INDEX_NAME=${INDEX_NAME}
+export NUM_CARDS=${NUM_CARDS}
+export host_ip=${host_ip}
+export http_proxy=${http_proxy}
+export https_proxy=${https_proxy}
+export no_proxy=${no_proxy}
+export LOGFLAG=${LOGFLAG}
+export JAEGER_IP=${JAEGER_IP}
+export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=${OTEL_EXPORTER_OTLP_TRACES_ENDPOINT}
+export TELEMETRY_ENDPOINT=${TELEMETRY_ENDPOINT}
+EOF
+
+echo ".env file has been created with the following content:"
+cat .env