MultimodalQnA Image and Audio Support Phase 1 (#1071)

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>
Signed-off-by: okhleif-IL <omar.khleif@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Co-authored-by: Omar Khleif <omar.khleif@intel.com>
Co-authored-by: dmsuehir <dina.s.jones@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
This commit is contained in:
Melanie Hart Buehler
2024-11-07 23:54:49 -08:00
committed by GitHub
parent dd9623d3d5
commit bbc95bb708
15 changed files with 472 additions and 156 deletions

View File

@@ -2,7 +2,7 @@
Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
The MultimodalQnA architecture shows below:
@@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component
By default, the embedding and LVM models are set to a default value as listed below:
| Service | Model |
| -------------------- | ------------------------------------------- |
| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi |
| LVM | llava-hf/llava-v1.6-vicuna-13b-hf |
| Service | HW | Model |
| -------------------- | ----- | ----------------------------------------- |
| embedding-multimodal | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
| embedding-multimodal | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |
You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.

View File

@@ -84,16 +84,18 @@ export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
```
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -274,54 +276,76 @@ curl http://${host_ip}:9399/v1/lvm \
6. dataprep-multimodal-redis
Download a sample video
Download a sample video, image, and audio file and create a caption
```bash
export video_fn="WeAreGoingOnBullrun.mp4"
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
export image_fn="apple.png"
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
export caption_fn="apple.txt"
echo "This is an apple." > ${caption_fn}
export audio_fn="AudioSample.wav"
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
```
Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST \
-F "files=@./${video_fn}" \
-F "files=@./${audio_fn}"
```
Also, test dataprep microservice with generating caption using lvm microservice
Also, test dataprep microservice with generating an image caption using lvm microservice
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST -F "files=@./${image_fn}"
```
Also, you are able to get the list of all videos that you uploaded:
Now, test the microservice with posting a custom caption along with an image
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
```
Also, you are able to get the list of all files that you uploaded:
```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_GET_VIDEO_ENDPOINT}
${DATAPREP_GET_FILE_ENDPOINT}
```
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
```bash
[
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
]
```
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_DELETE_VIDEO_ENDPOINT}
${DATAPREP_DELETE_FILE_ENDPOINT}
```
7. MegaService

View File

@@ -36,6 +36,7 @@ services:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PORT: ${EMBEDDER_PORT}
entrypoint: ["python", "bridgetower_server.py", "--device", "cpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
restart: unless-stopped
embedding-multimodal:
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -76,6 +77,7 @@ services:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
entrypoint: ["python", "llava_server.py", "--device", "cpu", "--model_name_or_path", $LVM_MODEL_ID]
restart: unless-stopped
lvm-llava-svc:
image: ${REGISTRY:-opea}/lvm-llava-svc:${TAG:-latest}
@@ -125,6 +127,7 @@ services:
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
ipc: host

View File

@@ -15,13 +15,15 @@ export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"

View File

@@ -40,10 +40,11 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
```
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
@@ -224,56 +225,76 @@ curl http://${host_ip}:9399/v1/lvm \
6. Multimodal Dataprep Microservice
Download a sample video
Download a sample video, image, and audio file and create a caption
```bash
export video_fn="WeAreGoingOnBullrun.mp4"
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
export image_fn="apple.png"
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
export caption_fn="apple.txt"
echo "This is an apple." > ${caption_fn}
export audio_fn="AudioSample.wav"
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
```
Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
Test dataprep microservice with generating transcript using whisper model
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST \
-F "files=@./${video_fn}" \
-F "files=@./${audio_fn}"
```
Also, test dataprep microservice with generating caption using lvm-tgi
Also, test dataprep microservice with generating an image caption using lvm-tgi
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${video_fn}"
-X POST -F "files=@./${image_fn}"
```
Also, you are able to get the list of all videos that you uploaded:
Now, test the microservice with posting a custom caption along with an image
```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
```
Also, you are able to get the list of all files that you uploaded:
```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_GET_VIDEO_ENDPOINT}
${DATAPREP_GET_FILE_ENDPOINT}
```
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
```bash
[
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
]
```
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
```bash
curl -X POST \
-H "Content-Type: application/json" \
${DATAPREP_DELETE_VIDEO_ENDPOINT}
${DATAPREP_DELETE_FILE_ENDPOINT}
```
7. MegaService

View File

@@ -36,6 +36,7 @@ services:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PORT: ${EMBEDDER_PORT}
entrypoint: ["python", "bridgetower_server.py", "--device", "hpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
restart: unless-stopped
embedding-multimodal:
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
@@ -139,6 +140,7 @@ services:
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
ipc: host

View File

@@ -22,7 +22,8 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"

View File

@@ -14,12 +14,13 @@ WORKPATH=$(dirname "$PWD")
LOG_PATH="$WORKPATH/tests"
ip_address=$(hostname -I | awk '{print $1}')
export image_fn="apple.png"
export video_fn="WeAreGoingOnBullrun.mp4"
export caption_fn="apple.txt"
function build_docker_images() {
cd $WORKPATH/docker_image_build
git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
echo "Build all the images with --no-cache, check docker_image_build.log for details..."
service_list="multimodalqna multimodalqna-ui embedding-multimodal-bridgetower embedding-multimodal retriever-multimodal-redis lvm-tgi dataprep-multimodal-redis"
docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log
@@ -40,17 +41,18 @@ function setup_env() {
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export LVM_MODEL_ID="llava-hf/llava-v1.6-vicuna-13b-hf"
export LVM_MODEL_ID="llava-hf/llava-v1.6-vicuna-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
}
function start_services() {
@@ -63,12 +65,15 @@ function start_services() {
function prepare_data() {
cd $LOG_PATH
echo "Downloading video"
echo "Downloading image and video"
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
echo "Writing caption file"
echo "This is an apple." > ${caption_fn}
sleep 30s
}
function validate_service() {
local URL="$1"
local EXPECTED_RESULT="$2"
@@ -76,9 +81,15 @@ function validate_service() {
local DOCKER_NAME="$4"
local INPUT_DATA="$5"
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
@@ -147,27 +158,34 @@ function validate_microservices() {
sleep 1m # retrieval can't curl as expected, try to wait for more time
# test data prep
echo "Data Prep with Generating Transcript"
echo "Data Prep with Generating Transcript for Video"
validate_service \
"${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
"dataprep-multimodal-redis" \
"dataprep-multimodal-redis-transcript" \
"dataprep-multimodal-redis"
echo "Data Prep with Generating Transcript"
echo "Data Prep with Image & Caption Ingestion"
validate_service \
"${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
"${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
"dataprep-multimodal-redis" \
"dataprep-multimodal-redis-ingest" \
"dataprep-multimodal-redis"
echo "Validating get file"
echo "Validating get file returns mp4"
validate_service \
"${DATAPREP_GET_VIDEO_ENDPOINT}" \
"${DATAPREP_GET_FILE_ENDPOINT}" \
'.mp4' \
"dataprep_get" \
"dataprep-multimodal-redis"
echo "Validating get file returns png"
validate_service \
"${DATAPREP_GET_FILE_ENDPOINT}" \
'.png' \
"dataprep_get" \
"dataprep-multimodal-redis"
sleep 1m
# multimodal retrieval microservice
@@ -180,7 +198,7 @@ function validate_microservices() {
"retriever-multimodal-redis" \
"{\"text\":\"test\",\"embedding\":${your_embedding}}"
sleep 10s
sleep 3m
# llava server
echo "Evaluating LLAVA tgi-gaudi"
@@ -200,6 +218,14 @@ function validate_microservices() {
"lvm-tgi" \
'{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
# data prep requiring lvm
echo "Data Prep with Generating Caption for Image"
validate_service \
"${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
"dataprep-multimodal-redis-caption" \
"dataprep-multimodal-redis"
sleep 1m
}
@@ -224,14 +250,22 @@ function validate_megaservice() {
}
function validate_delete {
echo "Validate data prep delete videos"
echo "Validate data prep delete files"
validate_service \
"${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
"${DATAPREP_DELETE_FILE_ENDPOINT}" \
'{"status":true}' \
"dataprep_del" \
"dataprep-multimodal-redis"
}
function delete_data() {
cd $LOG_PATH
echo "Deleting image, video, and caption"
rm -rf ${image_fn}
rm -rf ${video_fn}
rm -rf ${caption_fn}
}
function stop_docker() {
cd $WORKPATH/docker_compose/intel/hpu/gaudi
docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
@@ -256,6 +290,7 @@ function main() {
validate_delete
echo "==== delete validated ===="
delete_data
stop_docker
echo y | docker system prune

View File

@@ -14,7 +14,9 @@ WORKPATH=$(dirname "$PWD")
LOG_PATH="$WORKPATH/tests"
ip_address=$(hostname -I | awk '{print $1}')
export image_fn="apple.png"
export video_fn="WeAreGoingOnBullrun.mp4"
export caption_fn="apple.txt"
function build_docker_images() {
cd $WORKPATH/docker_image_build
@@ -37,6 +39,7 @@ function setup_env() {
export INDEX_NAME="mm-rag-redis"
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
@@ -44,10 +47,11 @@ function setup_env() {
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
}
function start_services() {
@@ -61,12 +65,14 @@ function start_services() {
function prepare_data() {
cd $LOG_PATH
echo "Downloading video"
echo "Downloading image and video"
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
echo "Writing caption file"
echo "This is an apple." > ${caption_fn}
sleep 1m
}
function validate_service() {
local URL="$1"
local EXPECTED_RESULT="$2"
@@ -74,9 +80,15 @@ function validate_service() {
local DOCKER_NAME="$4"
local INPUT_DATA="$5"
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
cd $LOG_PATH
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
@@ -145,27 +157,34 @@ function validate_microservices() {
sleep 1m # retrieval can't curl as expected, try to wait for more time
# test data prep
echo "Data Prep with Generating Transcript"
echo "Data Prep with Generating Transcript for Video"
validate_service \
"${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
"dataprep-multimodal-redis" \
"dataprep-multimodal-redis-transcript" \
"dataprep-multimodal-redis"
# echo "Data Prep with Generating Caption"
# validate_service \
# "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
# "Data preparation succeeded" \
# "dataprep-multimodal-redis" \
# "dataprep-multimodal-redis"
echo "Validating get file"
echo "Data Prep with Image & Caption Ingestion"
validate_service \
"${DATAPREP_GET_VIDEO_ENDPOINT}" \
"${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
"dataprep-multimodal-redis-ingest" \
"dataprep-multimodal-redis"
echo "Validating get file returns mp4"
validate_service \
"${DATAPREP_GET_FILE_ENDPOINT}" \
'.mp4' \
"dataprep_get" \
"dataprep-multimodal-redis"
echo "Validating get file returns png"
validate_service \
"${DATAPREP_GET_FILE_ENDPOINT}" \
'.png' \
"dataprep_get" \
"dataprep-multimodal-redis"
sleep 1m
# multimodal retrieval microservice
@@ -178,7 +197,7 @@ function validate_microservices() {
"retriever-multimodal-redis" \
"{\"text\":\"test\",\"embedding\":${your_embedding}}"
sleep 10s
sleep 3m
# llava server
echo "Evaluating lvm-llava"
@@ -198,6 +217,14 @@ function validate_microservices() {
"lvm-llava-svc" \
'{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
# data prep requiring lvm
echo "Data Prep with Generating Caption for Image"
validate_service \
"${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
"Data preparation succeeded" \
"dataprep-multimodal-redis-caption" \
"dataprep-multimodal-redis"
sleep 3m
}
@@ -222,14 +249,22 @@ function validate_megaservice() {
}
function validate_delete {
echo "Validate data prep delete videos"
echo "Validate data prep delete files"
validate_service \
"${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
"${DATAPREP_DELETE_FILE_ENDPOINT}" \
'{"status":true}' \
"dataprep_del" \
"dataprep-multimodal-redis"
}
function delete_data() {
cd $LOG_PATH
echo "Deleting image, video, and caption"
rm -rf ${image_fn}
rm -rf ${video_fn}
rm -rf ${caption_fn}
}
function stop_docker() {
cd $WORKPATH/docker_compose/intel/cpu/xeon
docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
@@ -254,6 +289,7 @@ function main() {
validate_delete
echo "==== delete validated ===="
delete_data
stop_docker
echo y | docker system prune

View File

@@ -30,6 +30,7 @@ class Conversation:
base64_frame: str = None
skip_next: bool = False
split_video: str = None
image: str = None
def _template_caption(self):
out = ""
@@ -59,6 +60,8 @@ class Conversation:
else:
base64_frame = get_b64_frame_from_timestamp(self.video_file, self.time_of_frame_ms)
self.base64_frame = base64_frame
if base64_frame is None:
base64_frame = ""
content.append({"type": "image_url", "image_url": {"url": base64_frame}})
else:
content = message
@@ -137,6 +140,7 @@ class Conversation:
"caption": self.caption,
"base64_frame": self.base64_frame,
"split_video": self.split_video,
"image": self.image,
}
@@ -152,4 +156,5 @@ multimodalqna_conv = Conversation(
time_of_frame_ms=None,
base64_frame=None,
split_video=None,
image=None,
)

View File

@@ -13,7 +13,7 @@ import uvicorn
from conversation import multimodalqna_conv
from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles
from utils import build_logger, moderation_msg, server_error_msg, split_video
from utils import build_logger, make_temp_image, moderation_msg, server_error_msg, split_video
logger = build_logger("gradio_web_server", "gradio_web_server.log")
@@ -47,22 +47,24 @@ def clear_history(state, request: gr.Request):
logger.info(f"clear_history. ip: {request.client.host}")
if state.split_video and os.path.exists(state.split_video):
os.remove(state.split_video)
if state.image and os.path.exists(state.image):
os.remove(state.image)
state = multimodalqna_conv.copy()
return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 1
return (state, state.to_gradio_chatbot(), None, None, None) + (disable_btn,) * 1
def add_text(state, text, request: gr.Request):
logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}")
if len(text) <= 0:
state.skip_next = True
return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 1
return (state, state.to_gradio_chatbot(), None) + (no_change_btn,) * 1
text = text[:2000] # Hard cut-off
state.append_message(state.roles[0], text)
state.append_message(state.roles[1], None)
state.skip_next = False
return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 1
return (state, state.to_gradio_chatbot(), None) + (disable_btn,) * 1
def http_bot(state, request: gr.Request):
@@ -73,7 +75,7 @@ def http_bot(state, request: gr.Request):
if state.skip_next:
# This generate call is skipped due to invalid inputs
path_to_sub_videos = state.get_path_to_subvideos()
yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (no_change_btn,) * 1
yield (state, state.to_gradio_chatbot(), path_to_sub_videos, None) + (no_change_btn,) * 1
return
if len(state.messages) == state.offset + 2:
@@ -97,7 +99,7 @@ def http_bot(state, request: gr.Request):
logger.info(f"==== url request ====\n{gateway_addr}")
state.messages[-1][-1] = ""
yield (state, state.to_gradio_chatbot(), state.split_video) + (disable_btn,) * 1
yield (state, state.to_gradio_chatbot(), state.split_video, state.image) + (disable_btn,) * 1
try:
response = requests.post(
@@ -108,6 +110,7 @@ def http_bot(state, request: gr.Request):
)
print(response.status_code)
print(response.json())
if response.status_code == 200:
response = response.json()
choice = response["choices"][-1]
@@ -123,44 +126,61 @@ def http_bot(state, request: gr.Request):
video_file = metadata["source_video"]
state.video_file = os.path.join(static_dir, metadata["source_video"])
state.time_of_frame_ms = metadata["time_of_frame_ms"]
try:
splited_video_path = split_video(
state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
)
except:
print(f"video {state.video_file} does not exist in UI host!")
splited_video_path = None
state.split_video = splited_video_path
file_ext = os.path.splitext(state.video_file)[-1]
if file_ext == ".mp4":
try:
splited_video_path = split_video(
state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
)
except:
print(f"video {state.video_file} does not exist in UI host!")
splited_video_path = None
state.split_video = splited_video_path
elif file_ext in [".jpg", ".jpeg", ".png", ".gif"]:
try:
output_image_path = make_temp_image(state.video_file, file_ext)
except:
print(f"image {state.video_file} does not exist in UI host!")
output_image_path = None
state.image = output_image_path
else:
raise requests.exceptions.RequestException
except requests.exceptions.RequestException as e:
state.messages[-1][-1] = server_error_msg
yield (state, state.to_gradio_chatbot(), None) + (enable_btn,)
yield (state, state.to_gradio_chatbot(), None, None) + (enable_btn,)
return
state.messages[-1][-1] = message
yield (state, state.to_gradio_chatbot(), state.split_video) + (enable_btn,) * 1
yield (
state,
state.to_gradio_chatbot(),
gr.Video(state.split_video, visible=state.split_video is not None),
gr.Image(state.image, visible=state.image is not None),
) + (enable_btn,) * 1
logger.info(f"{state.messages[-1][-1]}")
return
def ingest_video_gen_transcript(filepath, request: gr.Request):
yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
def ingest_gen_transcript(filepath, filetype, request: gr.Request):
yield (
gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
)
verified_filepath = os.path.normpath(filepath)
if not verified_filepath.startswith(tmp_upload_folder):
print("Found malicious video file name!")
print(f"Found malicious {filetype} file name!")
yield (
gr.Textbox(
visible=True,
value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
)
)
return
basename = os.path.basename(verified_filepath)
dest = os.path.join(static_dir, basename)
shutil.copy(verified_filepath, dest)
print("Done copy uploaded file to static folder!")
print("Done copying uploaded file to static folder.")
headers = {
# 'Content-Type': 'multipart/form-data'
}
@@ -172,17 +192,17 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
if response.status_code == 200:
response = response.json()
print(response)
yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
time.sleep(2)
fn_no_ext = Path(dest).stem
if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
print(response["video_id_maps"][fn_no_ext])
if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
print(response["file_id_maps"][fn_no_ext])
os.rename(dest, new_dst)
yield (
gr.Textbox(
visible=True,
value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
)
)
return
@@ -190,51 +210,53 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
yield (
gr.Textbox(
visible=True,
value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
)
)
time.sleep(2)
return
def ingest_video_gen_caption(filepath, request: gr.Request):
yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
def ingest_gen_caption(filepath, filetype, request: gr.Request):
yield (
gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
)
verified_filepath = os.path.normpath(filepath)
if not verified_filepath.startswith(tmp_upload_folder):
print("Found malicious video file name!")
print(f"Found malicious {filetype} file name!")
yield (
gr.Textbox(
visible=True,
value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
)
)
return
basename = os.path.basename(verified_filepath)
dest = os.path.join(static_dir, basename)
shutil.copy(verified_filepath, dest)
print("Done copy uploaded file to static folder!")
print("Done copying uploaded file to static folder.")
headers = {
# 'Content-Type': 'multipart/form-data'
}
files = {
"files": open(dest, "rb"),
}
response = requests.post(dataprep_gen_captiono_addr, headers=headers, files=files)
response = requests.post(dataprep_gen_caption_addr, headers=headers, files=files)
print(response.status_code)
if response.status_code == 200:
response = response.json()
print(response)
yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
time.sleep(2)
fn_no_ext = Path(dest).stem
if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
print(response["video_id_maps"][fn_no_ext])
if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
print(response["file_id_maps"][fn_no_ext])
os.rename(dest, new_dst)
yield (
gr.Textbox(
visible=True,
value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
)
)
return
@@ -242,48 +264,181 @@ def ingest_video_gen_caption(filepath, request: gr.Request):
yield (
gr.Textbox(
visible=True,
value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
)
)
time.sleep(2)
return
def clear_uploaded_video(request: gr.Request):
def ingest_with_text(filepath, text, request: gr.Request):
yield (gr.Textbox(visible=True, value="Please wait for your uploaded image to be ingested into the database..."))
verified_filepath = os.path.normpath(filepath)
if not verified_filepath.startswith(tmp_upload_folder):
print("Found malicious image file name!")
yield (
gr.Textbox(
visible=True,
value="Your uploaded image's file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
)
)
return
basename = os.path.basename(verified_filepath)
dest = os.path.join(static_dir, basename)
shutil.copy(verified_filepath, dest)
text_basename = "{}.txt".format(os.path.splitext(basename)[0])
text_dest = os.path.join(static_dir, text_basename)
with open(text_dest, "w") as file:
file.write(text)
print("Done copying uploaded files to static folder!")
headers = {
# 'Content-Type': 'multipart/form-data'
}
files = [("files", (basename, open(dest, "rb"))), ("files", (text_basename, open(text_dest, "rb")))]
try:
response = requests.post(dataprep_ingest_addr, headers=headers, files=files)
finally:
os.remove(text_dest)
print(response.status_code)
if response.status_code == 200:
response = response.json()
print(response)
yield (gr.Textbox(visible=True, value="Image ingestion is done. Saving your uploaded image..."))
time.sleep(2)
fn_no_ext = Path(dest).stem
if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
print(response["file_id_maps"][fn_no_ext])
os.rename(dest, new_dst)
yield (
gr.Textbox(
visible=True,
value="Congratulation! Your upload is done!\nClick the X button on the top right of the image upload box to upload another image.",
)
)
return
else:
yield (
gr.Textbox(
visible=True,
value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the image upload box to reupload your image!",
)
)
time.sleep(2)
return
def hide_text(request: gr.Request):
return gr.Textbox(visible=False)
with gr.Blocks() as upload_gen_trans:
gr.Markdown("# Ingest Your Own Video - Utilizing Generated Transcripts")
gr.Markdown(
"Please use this interface to ingest your own video if the video has meaningful audio (e.g., announcements, discussions, etc...)"
)
def clear_text(request: gr.Request):
return None
with gr.Blocks() as upload_video:
gr.Markdown("# Ingest Your Own Video Using Generated Transcripts or Captions")
gr.Markdown("Use this interface to ingest your own video and generate transcripts or captions for it")
def select_upload_type(choice, request: gr.Request):
if choice == "transcript":
return gr.Video(sources="upload", visible=True), gr.Video(sources="upload", visible=False)
else:
return gr.Video(sources="upload", visible=False), gr.Video(sources="upload", visible=True)
with gr.Row():
with gr.Column(scale=6):
video_upload = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload")
video_upload_trans = gr.Video(sources="upload", elem_id="video_upload_trans", visible=True)
video_upload_cap = gr.Video(sources="upload", elem_id="video_upload_cap", visible=False)
with gr.Column(scale=3):
text_options_radio = gr.Radio(
[
("Generate transcript (video contains voice)", "transcript"),
("Generate captions (video does not contain voice)", "caption"),
],
label="Text Options",
info="How should text be ingested?",
value="transcript",
)
text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
video_upload_trans.upload(
ingest_gen_transcript, [video_upload_trans, gr.Textbox(value="video", visible=False)], [text_upload_result]
)
video_upload_trans.clear(hide_text, [], [text_upload_result])
video_upload_cap.upload(
ingest_gen_caption, [video_upload_cap, gr.Textbox(value="video", visible=False)], [text_upload_result]
)
video_upload_cap.clear(hide_text, [], [text_upload_result])
text_options_radio.change(select_upload_type, [text_options_radio], [video_upload_trans, video_upload_cap])
with gr.Blocks() as upload_image:
gr.Markdown("# Ingest Your Own Image Using Generated or Custom Captions/Labels")
gr.Markdown("Use this interface to ingest your own image and generate a caption for it")
def select_upload_type(choice, request: gr.Request):
if choice == "gen_caption":
return gr.Image(sources="upload", visible=True), gr.Image(sources="upload", visible=False)
else:
return gr.Image(sources="upload", visible=False), gr.Image(sources="upload", visible=True)
with gr.Row():
with gr.Column(scale=6):
image_upload_cap = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=True)
image_upload_text = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=False)
with gr.Column(scale=3):
text_options_radio = gr.Radio(
[("Generate caption", "gen_caption"), ("Custom caption or label", "custom_caption")],
label="Text Options",
info="How should text be ingested?",
value="gen_caption",
)
custom_caption = gr.Textbox(visible=True, interactive=True, label="Custom Caption or Label")
text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
image_upload_cap.upload(
ingest_gen_caption, [image_upload_cap, gr.Textbox(value="image", visible=False)], [text_upload_result]
)
image_upload_cap.clear(hide_text, [], [text_upload_result])
image_upload_text.upload(ingest_with_text, [image_upload_text, custom_caption], [text_upload_result]).then(
clear_text, [], [custom_caption]
)
image_upload_text.clear(hide_text, [], [text_upload_result])
text_options_radio.change(select_upload_type, [text_options_radio], [image_upload_cap, image_upload_text])
with gr.Blocks() as upload_audio:
gr.Markdown("# Ingest Your Own Audio Using Generated Transcripts")
gr.Markdown("Use this interface to ingest your own audio file and generate a transcript for it")
with gr.Row():
with gr.Column(scale=6):
audio_upload = gr.Audio(type="filepath")
with gr.Column(scale=3):
text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
video_upload.upload(ingest_video_gen_transcript, [video_upload], [text_upload_result])
video_upload.clear(clear_uploaded_video, [], [text_upload_result])
audio_upload.upload(
ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
)
audio_upload.stop_recording(
ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
)
audio_upload.clear(hide_text, [], [text_upload_result])
with gr.Blocks() as upload_gen_captions:
gr.Markdown("# Ingest Your Own Video - Utilizing Generated Captions")
gr.Markdown(
"Please use this interface to ingest your own video if the video has meaningless audio (e.g., background musics, etc...)"
)
with gr.Blocks() as upload_pdf:
gr.Markdown("# Ingest Your Own PDF")
gr.Markdown("Use this interface to ingest your own PDF file with text, tables, images, and graphs")
with gr.Row():
with gr.Column(scale=6):
video_upload_cap = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload_cap")
image_upload_cap = gr.File()
with gr.Column(scale=3):
text_upload_result_cap = gr.Textbox(visible=False, interactive=False, label="Upload Status")
video_upload_cap.upload(ingest_video_gen_transcript, [video_upload_cap], [text_upload_result_cap])
video_upload_cap.clear(clear_uploaded_video, [], [text_upload_result_cap])
image_upload_cap.upload(
ingest_gen_caption, [image_upload_cap, gr.Textbox(value="PDF", visible=False)], [text_upload_result_cap]
)
image_upload_cap.clear(hide_text, [], [text_upload_result_cap])
with gr.Blocks() as qna:
state = gr.State(multimodalqna_conv.copy())
with gr.Row():
with gr.Column(scale=4):
video = gr.Video(height=512, width=512, elem_id="video")
video = gr.Video(height=512, width=512, elem_id="video", visible=True, label="Media")
image = gr.Image(height=512, width=512, elem_id="image", visible=False, label="Media")
with gr.Column(scale=7):
chatbot = gr.Chatbot(elem_id="chatbot", label="MultimodalQnA Chatbot", height=390)
with gr.Row():
@@ -293,7 +448,8 @@ with gr.Blocks() as qna:
# show_label=False,
# container=False,
label="Query",
info="Enter your query here!",
info="Enter a text query below",
# submit_btn=False,
)
with gr.Column(scale=1, min_width=100):
with gr.Row():
@@ -306,7 +462,7 @@ with gr.Blocks() as qna:
[
state,
],
[state, chatbot, textbox, video, clear_btn],
[state, chatbot, textbox, video, image, clear_btn],
)
submit_btn.click(
@@ -318,17 +474,19 @@ with gr.Blocks() as qna:
[
state,
],
[state, chatbot, video, clear_btn],
[state, chatbot, video, image, clear_btn],
)
with gr.Blocks(css=css) as demo:
gr.Markdown("# MultimodalQnA")
with gr.Tabs():
with gr.TabItem("MultimodalQnA With Your Videos"):
with gr.TabItem("MultimodalQnA"):
qna.render()
with gr.TabItem("Upload Your Own Videos"):
upload_gen_trans.render()
with gr.TabItem("Upload Your Own Videos"):
upload_gen_captions.render()
with gr.TabItem("Upload Video"):
upload_video.render()
with gr.TabItem("Upload Image"):
upload_image.render()
with gr.TabItem("Upload Audio"):
upload_audio.render()
demo.queue()
app = gr.mount_gradio_app(app, demo, path="/")
@@ -343,6 +501,9 @@ if __name__ == "__main__":
parser.add_argument("--share", action="store_true")
backend_service_endpoint = os.getenv("BACKEND_SERVICE_ENDPOINT", "http://localhost:8888/v1/multimodalqna")
dataprep_ingest_endpoint = os.getenv(
"DATAPREP_INGEST_SERVICE_ENDPOINT", "http://localhost:6007/v1/ingest_with_text"
)
dataprep_gen_transcript_endpoint = os.getenv(
"DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT", "http://localhost:6007/v1/generate_transcripts"
)
@@ -353,9 +514,11 @@ if __name__ == "__main__":
logger.info(f"args: {args}")
global gateway_addr
gateway_addr = backend_service_endpoint
global dataprep_ingest_addr
dataprep_ingest_addr = dataprep_ingest_endpoint
global dataprep_gen_transcript_addr
dataprep_gen_transcript_addr = dataprep_gen_transcript_endpoint
global dataprep_gen_captiono_addr
dataprep_gen_captiono_addr = dataprep_gen_caption_endpoint
global dataprep_gen_caption_addr
dataprep_gen_caption_addr = dataprep_gen_caption_endpoint
uvicorn.run(app, host=args.host, port=args.port)

View File

@@ -5,6 +5,7 @@ import base64
import logging
import logging.handlers
import os
import shutil
import sys
from pathlib import Path
@@ -118,6 +119,18 @@ def maintain_aspect_ratio_resize(image, width=None, height=None, inter=cv2.INTER
return cv2.resize(image, dim, interpolation=inter)
def make_temp_image(
image_name,
file_ext,
output_image_path: str = "./public/images",
output_image_name: str = "image_tmp",
):
Path(output_image_path).mkdir(parents=True, exist_ok=True)
output_image = os.path.join(output_image_path, "{}.{}".format(output_image_name, file_ext))
shutil.copy(image_name, output_image)
return output_image
# function to split video at a timestamp
def split_video(
video_path,