MultimodalQnA Image and Audio Support Phase 1 (#1071)
Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com> Signed-off-by: okhleif-IL <omar.khleif@intel.com> Signed-off-by: dmsuehir <dina.s.jones@intel.com> Co-authored-by: Omar Khleif <omar.khleif@intel.com> Co-authored-by: dmsuehir <dina.s.jones@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
dd9623d3d5
commit
bbc95bb708
@@ -2,7 +2,7 @@
|
||||
|
||||
Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
|
||||
|
||||
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the video ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
|
||||
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
|
||||
|
||||
The MultimodalQnA architecture shows below:
|
||||
|
||||
@@ -100,10 +100,12 @@ In the below, we provide a table that describes for each microservice component
|
||||
|
||||
By default, the embedding and LVM models are set to a default value as listed below:
|
||||
|
||||
| Service | Model |
|
||||
| -------------------- | ------------------------------------------- |
|
||||
| embedding-multimodal | BridgeTower/bridgetower-large-itm-mlm-gaudi |
|
||||
| LVM | llava-hf/llava-v1.6-vicuna-13b-hf |
|
||||
| Service | HW | Model |
|
||||
| -------------------- | ----- | ----------------------------------------- |
|
||||
| embedding-multimodal | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
|
||||
| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
|
||||
| embedding-multimodal | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
|
||||
| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |
|
||||
|
||||
You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.
|
||||
|
||||
|
||||
@@ -84,16 +84,18 @@ export INDEX_NAME="mm-rag-redis"
|
||||
export LLAVA_SERVER_PORT=8399
|
||||
export LVM_ENDPOINT="http://${host_ip}:8399"
|
||||
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
|
||||
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
|
||||
export WHISPER_MODEL="base"
|
||||
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
|
||||
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
|
||||
export LVM_SERVICE_HOST_IP=${host_ip}
|
||||
export MEGA_SERVICE_HOST_IP=${host_ip}
|
||||
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
|
||||
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
|
||||
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
|
||||
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
|
||||
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
|
||||
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
|
||||
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
|
||||
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
|
||||
```
|
||||
|
||||
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
|
||||
@@ -274,54 +276,76 @@ curl http://${host_ip}:9399/v1/lvm \
|
||||
|
||||
6. dataprep-multimodal-redis
|
||||
|
||||
Download a sample video
|
||||
Download a sample video, image, and audio file and create a caption
|
||||
|
||||
```bash
|
||||
export video_fn="WeAreGoingOnBullrun.mp4"
|
||||
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
|
||||
|
||||
export image_fn="apple.png"
|
||||
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
|
||||
|
||||
export caption_fn="apple.txt"
|
||||
echo "This is an apple." > ${caption_fn}
|
||||
|
||||
export audio_fn="AudioSample.wav"
|
||||
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
|
||||
```
|
||||
|
||||
Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
|
||||
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
|
||||
|
||||
```bash
|
||||
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
|
||||
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-X POST -F "files=@./${video_fn}"
|
||||
-X POST \
|
||||
-F "files=@./${video_fn}" \
|
||||
-F "files=@./${audio_fn}"
|
||||
```
|
||||
|
||||
Also, test dataprep microservice with generating caption using lvm microservice
|
||||
Also, test dataprep microservice with generating an image caption using lvm microservice
|
||||
|
||||
```bash
|
||||
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
|
||||
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-X POST -F "files=@./${video_fn}"
|
||||
-X POST -F "files=@./${image_fn}"
|
||||
```
|
||||
|
||||
Also, you are able to get the list of all videos that you uploaded:
|
||||
Now, test the microservice with posting a custom caption along with an image
|
||||
|
||||
```bash
|
||||
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
|
||||
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
|
||||
```
|
||||
|
||||
Also, you are able to get the list of all files that you uploaded:
|
||||
|
||||
```bash
|
||||
curl -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
${DATAPREP_GET_VIDEO_ENDPOINT}
|
||||
${DATAPREP_GET_FILE_ENDPOINT}
|
||||
```
|
||||
|
||||
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
|
||||
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
|
||||
|
||||
```bash
|
||||
[
|
||||
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
|
||||
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
|
||||
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
|
||||
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
|
||||
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
|
||||
]
|
||||
```
|
||||
|
||||
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
|
||||
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
|
||||
|
||||
```bash
|
||||
curl -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
${DATAPREP_DELETE_VIDEO_ENDPOINT}
|
||||
${DATAPREP_DELETE_FILE_ENDPOINT}
|
||||
```
|
||||
|
||||
7. MegaService
|
||||
|
||||
@@ -36,6 +36,7 @@ services:
|
||||
http_proxy: ${http_proxy}
|
||||
https_proxy: ${https_proxy}
|
||||
PORT: ${EMBEDDER_PORT}
|
||||
entrypoint: ["python", "bridgetower_server.py", "--device", "cpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
|
||||
restart: unless-stopped
|
||||
embedding-multimodal:
|
||||
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
|
||||
@@ -76,6 +77,7 @@ services:
|
||||
no_proxy: ${no_proxy}
|
||||
http_proxy: ${http_proxy}
|
||||
https_proxy: ${https_proxy}
|
||||
entrypoint: ["python", "llava_server.py", "--device", "cpu", "--model_name_or_path", $LVM_MODEL_ID]
|
||||
restart: unless-stopped
|
||||
lvm-llava-svc:
|
||||
image: ${REGISTRY:-opea}/lvm-llava-svc:${TAG:-latest}
|
||||
@@ -125,6 +127,7 @@ services:
|
||||
- https_proxy=${https_proxy}
|
||||
- http_proxy=${http_proxy}
|
||||
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
|
||||
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
|
||||
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
|
||||
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
|
||||
ipc: host
|
||||
|
||||
@@ -15,13 +15,15 @@ export INDEX_NAME="mm-rag-redis"
|
||||
export LLAVA_SERVER_PORT=8399
|
||||
export LVM_ENDPOINT="http://${host_ip}:8399"
|
||||
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
|
||||
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
|
||||
export WHISPER_MODEL="base"
|
||||
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
|
||||
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
|
||||
export LVM_SERVICE_HOST_IP=${host_ip}
|
||||
export MEGA_SERVICE_HOST_IP=${host_ip}
|
||||
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
|
||||
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
|
||||
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
|
||||
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
|
||||
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
|
||||
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
|
||||
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
|
||||
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
|
||||
|
||||
@@ -40,10 +40,11 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
|
||||
export LVM_SERVICE_HOST_IP=${host_ip}
|
||||
export MEGA_SERVICE_HOST_IP=${host_ip}
|
||||
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
|
||||
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
|
||||
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
|
||||
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
|
||||
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
|
||||
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
|
||||
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
|
||||
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
|
||||
```
|
||||
|
||||
Note: Please replace with `host_ip` with you external IP address, do not use localhost.
|
||||
@@ -224,56 +225,76 @@ curl http://${host_ip}:9399/v1/lvm \
|
||||
|
||||
6. Multimodal Dataprep Microservice
|
||||
|
||||
Download a sample video
|
||||
Download a sample video, image, and audio file and create a caption
|
||||
|
||||
```bash
|
||||
export video_fn="WeAreGoingOnBullrun.mp4"
|
||||
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
|
||||
|
||||
export image_fn="apple.png"
|
||||
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
|
||||
|
||||
export caption_fn="apple.txt"
|
||||
echo "This is an apple." > ${caption_fn}
|
||||
|
||||
export audio_fn="AudioSample.wav"
|
||||
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn}
|
||||
```
|
||||
|
||||
Test dataprep microservice. This command updates a knowledge base by uploading a local video .mp4.
|
||||
|
||||
Test dataprep microservice with generating transcript using whisper model
|
||||
Test dataprep microservice with generating transcript. This command updates a knowledge base by uploading a local video .mp4 and an audio .wav file.
|
||||
|
||||
```bash
|
||||
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
|
||||
${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT} \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-X POST -F "files=@./${video_fn}"
|
||||
-X POST \
|
||||
-F "files=@./${video_fn}" \
|
||||
-F "files=@./${audio_fn}"
|
||||
```
|
||||
|
||||
Also, test dataprep microservice with generating caption using lvm-tgi
|
||||
Also, test dataprep microservice with generating an image caption using lvm-tgi
|
||||
|
||||
```bash
|
||||
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
|
||||
${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT} \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-X POST -F "files=@./${video_fn}"
|
||||
-X POST -F "files=@./${image_fn}"
|
||||
```
|
||||
|
||||
Also, you are able to get the list of all videos that you uploaded:
|
||||
Now, test the microservice with posting a custom caption along with an image
|
||||
|
||||
```bash
|
||||
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
|
||||
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
|
||||
-H 'Content-Type: multipart/form-data' \
|
||||
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
|
||||
```
|
||||
|
||||
Also, you are able to get the list of all files that you uploaded:
|
||||
|
||||
```bash
|
||||
curl -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
${DATAPREP_GET_VIDEO_ENDPOINT}
|
||||
${DATAPREP_GET_FILE_ENDPOINT}
|
||||
```
|
||||
|
||||
Then you will get the response python-style LIST like this. Notice the name of each uploaded video e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded video. The same video that are uploaded twice will have different `uuid`.
|
||||
Then you will get the response python-style LIST like this. Notice the name of each uploaded file e.g., `videoname.mp4` will become `videoname_uuid.mp4` where `uuid` is a unique ID for each uploaded file. The same files that are uploaded twice will have different `uuid`.
|
||||
|
||||
```bash
|
||||
[
|
||||
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
|
||||
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4"
|
||||
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
|
||||
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
|
||||
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
|
||||
]
|
||||
```
|
||||
|
||||
To delete all uploaded videos along with data indexed with `$INDEX_NAME` in REDIS.
|
||||
To delete all uploaded files along with data indexed with `$INDEX_NAME` in REDIS.
|
||||
|
||||
```bash
|
||||
curl -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
${DATAPREP_DELETE_VIDEO_ENDPOINT}
|
||||
${DATAPREP_DELETE_FILE_ENDPOINT}
|
||||
```
|
||||
|
||||
7. MegaService
|
||||
|
||||
@@ -36,6 +36,7 @@ services:
|
||||
http_proxy: ${http_proxy}
|
||||
https_proxy: ${https_proxy}
|
||||
PORT: ${EMBEDDER_PORT}
|
||||
entrypoint: ["python", "bridgetower_server.py", "--device", "hpu", "--model_name_or_path", $EMBEDDING_MODEL_ID]
|
||||
restart: unless-stopped
|
||||
embedding-multimodal:
|
||||
image: ${REGISTRY:-opea}/embedding-multimodal:${TAG:-latest}
|
||||
@@ -139,6 +140,7 @@ services:
|
||||
- https_proxy=${https_proxy}
|
||||
- http_proxy=${http_proxy}
|
||||
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
|
||||
- DATAPREP_INGEST_SERVICE_ENDPOINT=${DATAPREP_INGEST_SERVICE_ENDPOINT}
|
||||
- DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT=${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}
|
||||
- DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT=${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}
|
||||
ipc: host
|
||||
|
||||
@@ -22,7 +22,8 @@ export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
|
||||
export LVM_SERVICE_HOST_IP=${host_ip}
|
||||
export MEGA_SERVICE_HOST_IP=${host_ip}
|
||||
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
|
||||
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
|
||||
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
|
||||
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
|
||||
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
|
||||
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
|
||||
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
|
||||
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
|
||||
|
||||
@@ -14,12 +14,13 @@ WORKPATH=$(dirname "$PWD")
|
||||
LOG_PATH="$WORKPATH/tests"
|
||||
ip_address=$(hostname -I | awk '{print $1}')
|
||||
|
||||
export image_fn="apple.png"
|
||||
export video_fn="WeAreGoingOnBullrun.mp4"
|
||||
export caption_fn="apple.txt"
|
||||
|
||||
function build_docker_images() {
|
||||
cd $WORKPATH/docker_image_build
|
||||
git clone https://github.com/opea-project/GenAIComps.git && cd GenAIComps && git checkout "${opea_branch:-"main"}" && cd ../
|
||||
|
||||
echo "Build all the images with --no-cache, check docker_image_build.log for details..."
|
||||
service_list="multimodalqna multimodalqna-ui embedding-multimodal-bridgetower embedding-multimodal retriever-multimodal-redis lvm-tgi dataprep-multimodal-redis"
|
||||
docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log
|
||||
@@ -40,17 +41,18 @@ function setup_env() {
|
||||
export LLAVA_SERVER_PORT=8399
|
||||
export LVM_ENDPOINT="http://${host_ip}:8399"
|
||||
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
|
||||
export LVM_MODEL_ID="llava-hf/llava-v1.6-vicuna-13b-hf"
|
||||
export LVM_MODEL_ID="llava-hf/llava-v1.6-vicuna-7b-hf"
|
||||
export WHISPER_MODEL="base"
|
||||
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
|
||||
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
|
||||
export LVM_SERVICE_HOST_IP=${host_ip}
|
||||
export MEGA_SERVICE_HOST_IP=${host_ip}
|
||||
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
|
||||
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
|
||||
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
|
||||
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
|
||||
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
|
||||
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
|
||||
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
|
||||
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
|
||||
}
|
||||
|
||||
function start_services() {
|
||||
@@ -63,12 +65,15 @@ function start_services() {
|
||||
|
||||
function prepare_data() {
|
||||
cd $LOG_PATH
|
||||
echo "Downloading video"
|
||||
echo "Downloading image and video"
|
||||
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
|
||||
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
|
||||
echo "Writing caption file"
|
||||
echo "This is an apple." > ${caption_fn}
|
||||
|
||||
sleep 30s
|
||||
|
||||
}
|
||||
|
||||
function validate_service() {
|
||||
local URL="$1"
|
||||
local EXPECTED_RESULT="$2"
|
||||
@@ -76,9 +81,15 @@ function validate_service() {
|
||||
local DOCKER_NAME="$4"
|
||||
local INPUT_DATA="$5"
|
||||
|
||||
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
|
||||
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
|
||||
cd $LOG_PATH
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
|
||||
cd $LOG_PATH
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
|
||||
cd $LOG_PATH
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
|
||||
@@ -147,27 +158,34 @@ function validate_microservices() {
|
||||
sleep 1m # retrieval can't curl as expected, try to wait for more time
|
||||
|
||||
# test data prep
|
||||
echo "Data Prep with Generating Transcript"
|
||||
echo "Data Prep with Generating Transcript for Video"
|
||||
validate_service \
|
||||
"${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
|
||||
"Data preparation succeeded" \
|
||||
"dataprep-multimodal-redis" \
|
||||
"dataprep-multimodal-redis-transcript" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
echo "Data Prep with Generating Transcript"
|
||||
echo "Data Prep with Image & Caption Ingestion"
|
||||
validate_service \
|
||||
"${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
|
||||
"${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
|
||||
"Data preparation succeeded" \
|
||||
"dataprep-multimodal-redis" \
|
||||
"dataprep-multimodal-redis-ingest" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
echo "Validating get file"
|
||||
echo "Validating get file returns mp4"
|
||||
validate_service \
|
||||
"${DATAPREP_GET_VIDEO_ENDPOINT}" \
|
||||
"${DATAPREP_GET_FILE_ENDPOINT}" \
|
||||
'.mp4' \
|
||||
"dataprep_get" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
echo "Validating get file returns png"
|
||||
validate_service \
|
||||
"${DATAPREP_GET_FILE_ENDPOINT}" \
|
||||
'.png' \
|
||||
"dataprep_get" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
sleep 1m
|
||||
|
||||
# multimodal retrieval microservice
|
||||
@@ -180,7 +198,7 @@ function validate_microservices() {
|
||||
"retriever-multimodal-redis" \
|
||||
"{\"text\":\"test\",\"embedding\":${your_embedding}}"
|
||||
|
||||
sleep 10s
|
||||
sleep 3m
|
||||
|
||||
# llava server
|
||||
echo "Evaluating LLAVA tgi-gaudi"
|
||||
@@ -200,6 +218,14 @@ function validate_microservices() {
|
||||
"lvm-tgi" \
|
||||
'{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
|
||||
|
||||
# data prep requiring lvm
|
||||
echo "Data Prep with Generating Caption for Image"
|
||||
validate_service \
|
||||
"${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
|
||||
"Data preparation succeeded" \
|
||||
"dataprep-multimodal-redis-caption" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
sleep 1m
|
||||
}
|
||||
|
||||
@@ -224,14 +250,22 @@ function validate_megaservice() {
|
||||
}
|
||||
|
||||
function validate_delete {
|
||||
echo "Validate data prep delete videos"
|
||||
echo "Validate data prep delete files"
|
||||
validate_service \
|
||||
"${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
|
||||
"${DATAPREP_DELETE_FILE_ENDPOINT}" \
|
||||
'{"status":true}' \
|
||||
"dataprep_del" \
|
||||
"dataprep-multimodal-redis"
|
||||
}
|
||||
|
||||
function delete_data() {
|
||||
cd $LOG_PATH
|
||||
echo "Deleting image, video, and caption"
|
||||
rm -rf ${image_fn}
|
||||
rm -rf ${video_fn}
|
||||
rm -rf ${caption_fn}
|
||||
}
|
||||
|
||||
function stop_docker() {
|
||||
cd $WORKPATH/docker_compose/intel/hpu/gaudi
|
||||
docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
|
||||
@@ -256,6 +290,7 @@ function main() {
|
||||
validate_delete
|
||||
echo "==== delete validated ===="
|
||||
|
||||
delete_data
|
||||
stop_docker
|
||||
echo y | docker system prune
|
||||
|
||||
|
||||
@@ -14,7 +14,9 @@ WORKPATH=$(dirname "$PWD")
|
||||
LOG_PATH="$WORKPATH/tests"
|
||||
ip_address=$(hostname -I | awk '{print $1}')
|
||||
|
||||
export image_fn="apple.png"
|
||||
export video_fn="WeAreGoingOnBullrun.mp4"
|
||||
export caption_fn="apple.txt"
|
||||
|
||||
function build_docker_images() {
|
||||
cd $WORKPATH/docker_image_build
|
||||
@@ -37,6 +39,7 @@ function setup_env() {
|
||||
export INDEX_NAME="mm-rag-redis"
|
||||
export LLAVA_SERVER_PORT=8399
|
||||
export LVM_ENDPOINT="http://${host_ip}:8399"
|
||||
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
|
||||
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
|
||||
export WHISPER_MODEL="base"
|
||||
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
|
||||
@@ -44,10 +47,11 @@ function setup_env() {
|
||||
export LVM_SERVICE_HOST_IP=${host_ip}
|
||||
export MEGA_SERVICE_HOST_IP=${host_ip}
|
||||
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
|
||||
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
|
||||
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
|
||||
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
|
||||
export DATAPREP_GET_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_videos"
|
||||
export DATAPREP_DELETE_VIDEO_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_videos"
|
||||
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
|
||||
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
|
||||
}
|
||||
|
||||
function start_services() {
|
||||
@@ -61,12 +65,14 @@ function start_services() {
|
||||
|
||||
function prepare_data() {
|
||||
cd $LOG_PATH
|
||||
echo "Downloading video"
|
||||
echo "Downloading image and video"
|
||||
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
|
||||
wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn}
|
||||
|
||||
echo "Writing caption file"
|
||||
echo "This is an apple." > ${caption_fn}
|
||||
sleep 1m
|
||||
|
||||
}
|
||||
|
||||
function validate_service() {
|
||||
local URL="$1"
|
||||
local EXPECTED_RESULT="$2"
|
||||
@@ -74,9 +80,15 @@ function validate_service() {
|
||||
local DOCKER_NAME="$4"
|
||||
local INPUT_DATA="$5"
|
||||
|
||||
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis"* ]]; then
|
||||
if [[ $SERVICE_NAME == *"dataprep-multimodal-redis-transcript"* ]]; then
|
||||
cd $LOG_PATH
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${video_fn}" -H 'Content-Type: multipart/form-data' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-caption"* ]]; then
|
||||
cd $LOG_PATH
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -H 'Content-Type: multipart/form-data' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep-multimodal-redis-ingest"* ]]; then
|
||||
cd $LOG_PATH
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./${image_fn}" -F "files=@./apple.txt" -H 'Content-Type: multipart/form-data' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep_get"* ]]; then
|
||||
HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -H 'Content-Type: application/json' "$URL")
|
||||
elif [[ $SERVICE_NAME == *"dataprep_del"* ]]; then
|
||||
@@ -145,27 +157,34 @@ function validate_microservices() {
|
||||
sleep 1m # retrieval can't curl as expected, try to wait for more time
|
||||
|
||||
# test data prep
|
||||
echo "Data Prep with Generating Transcript"
|
||||
echo "Data Prep with Generating Transcript for Video"
|
||||
validate_service \
|
||||
"${DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT}" \
|
||||
"Data preparation succeeded" \
|
||||
"dataprep-multimodal-redis" \
|
||||
"dataprep-multimodal-redis-transcript" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
# echo "Data Prep with Generating Caption"
|
||||
# validate_service \
|
||||
# "${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
|
||||
# "Data preparation succeeded" \
|
||||
# "dataprep-multimodal-redis" \
|
||||
# "dataprep-multimodal-redis"
|
||||
|
||||
echo "Validating get file"
|
||||
echo "Data Prep with Image & Caption Ingestion"
|
||||
validate_service \
|
||||
"${DATAPREP_GET_VIDEO_ENDPOINT}" \
|
||||
"${DATAPREP_INGEST_SERVICE_ENDPOINT}" \
|
||||
"Data preparation succeeded" \
|
||||
"dataprep-multimodal-redis-ingest" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
echo "Validating get file returns mp4"
|
||||
validate_service \
|
||||
"${DATAPREP_GET_FILE_ENDPOINT}" \
|
||||
'.mp4' \
|
||||
"dataprep_get" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
echo "Validating get file returns png"
|
||||
validate_service \
|
||||
"${DATAPREP_GET_FILE_ENDPOINT}" \
|
||||
'.png' \
|
||||
"dataprep_get" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
sleep 1m
|
||||
|
||||
# multimodal retrieval microservice
|
||||
@@ -178,7 +197,7 @@ function validate_microservices() {
|
||||
"retriever-multimodal-redis" \
|
||||
"{\"text\":\"test\",\"embedding\":${your_embedding}}"
|
||||
|
||||
sleep 10s
|
||||
sleep 3m
|
||||
|
||||
# llava server
|
||||
echo "Evaluating lvm-llava"
|
||||
@@ -198,6 +217,14 @@ function validate_microservices() {
|
||||
"lvm-llava-svc" \
|
||||
'{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
|
||||
|
||||
# data prep requiring lvm
|
||||
echo "Data Prep with Generating Caption for Image"
|
||||
validate_service \
|
||||
"${DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT}" \
|
||||
"Data preparation succeeded" \
|
||||
"dataprep-multimodal-redis-caption" \
|
||||
"dataprep-multimodal-redis"
|
||||
|
||||
sleep 3m
|
||||
}
|
||||
|
||||
@@ -222,14 +249,22 @@ function validate_megaservice() {
|
||||
}
|
||||
|
||||
function validate_delete {
|
||||
echo "Validate data prep delete videos"
|
||||
echo "Validate data prep delete files"
|
||||
validate_service \
|
||||
"${DATAPREP_DELETE_VIDEO_ENDPOINT}" \
|
||||
"${DATAPREP_DELETE_FILE_ENDPOINT}" \
|
||||
'{"status":true}' \
|
||||
"dataprep_del" \
|
||||
"dataprep-multimodal-redis"
|
||||
}
|
||||
|
||||
function delete_data() {
|
||||
cd $LOG_PATH
|
||||
echo "Deleting image, video, and caption"
|
||||
rm -rf ${image_fn}
|
||||
rm -rf ${video_fn}
|
||||
rm -rf ${caption_fn}
|
||||
}
|
||||
|
||||
function stop_docker() {
|
||||
cd $WORKPATH/docker_compose/intel/cpu/xeon
|
||||
docker compose -f compose.yaml stop && docker compose -f compose.yaml rm -f
|
||||
@@ -254,6 +289,7 @@ function main() {
|
||||
validate_delete
|
||||
echo "==== delete validated ===="
|
||||
|
||||
delete_data
|
||||
stop_docker
|
||||
echo y | docker system prune
|
||||
|
||||
|
||||
@@ -30,6 +30,7 @@ class Conversation:
|
||||
base64_frame: str = None
|
||||
skip_next: bool = False
|
||||
split_video: str = None
|
||||
image: str = None
|
||||
|
||||
def _template_caption(self):
|
||||
out = ""
|
||||
@@ -59,6 +60,8 @@ class Conversation:
|
||||
else:
|
||||
base64_frame = get_b64_frame_from_timestamp(self.video_file, self.time_of_frame_ms)
|
||||
self.base64_frame = base64_frame
|
||||
if base64_frame is None:
|
||||
base64_frame = ""
|
||||
content.append({"type": "image_url", "image_url": {"url": base64_frame}})
|
||||
else:
|
||||
content = message
|
||||
@@ -137,6 +140,7 @@ class Conversation:
|
||||
"caption": self.caption,
|
||||
"base64_frame": self.base64_frame,
|
||||
"split_video": self.split_video,
|
||||
"image": self.image,
|
||||
}
|
||||
|
||||
|
||||
@@ -152,4 +156,5 @@ multimodalqna_conv = Conversation(
|
||||
time_of_frame_ms=None,
|
||||
base64_frame=None,
|
||||
split_video=None,
|
||||
image=None,
|
||||
)
|
||||
|
||||
@@ -13,7 +13,7 @@ import uvicorn
|
||||
from conversation import multimodalqna_conv
|
||||
from fastapi import FastAPI
|
||||
from fastapi.staticfiles import StaticFiles
|
||||
from utils import build_logger, moderation_msg, server_error_msg, split_video
|
||||
from utils import build_logger, make_temp_image, moderation_msg, server_error_msg, split_video
|
||||
|
||||
logger = build_logger("gradio_web_server", "gradio_web_server.log")
|
||||
|
||||
@@ -47,22 +47,24 @@ def clear_history(state, request: gr.Request):
|
||||
logger.info(f"clear_history. ip: {request.client.host}")
|
||||
if state.split_video and os.path.exists(state.split_video):
|
||||
os.remove(state.split_video)
|
||||
if state.image and os.path.exists(state.image):
|
||||
os.remove(state.image)
|
||||
state = multimodalqna_conv.copy()
|
||||
return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 1
|
||||
return (state, state.to_gradio_chatbot(), None, None, None) + (disable_btn,) * 1
|
||||
|
||||
|
||||
def add_text(state, text, request: gr.Request):
|
||||
logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}")
|
||||
if len(text) <= 0:
|
||||
state.skip_next = True
|
||||
return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 1
|
||||
return (state, state.to_gradio_chatbot(), None) + (no_change_btn,) * 1
|
||||
|
||||
text = text[:2000] # Hard cut-off
|
||||
|
||||
state.append_message(state.roles[0], text)
|
||||
state.append_message(state.roles[1], None)
|
||||
state.skip_next = False
|
||||
return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 1
|
||||
return (state, state.to_gradio_chatbot(), None) + (disable_btn,) * 1
|
||||
|
||||
|
||||
def http_bot(state, request: gr.Request):
|
||||
@@ -73,7 +75,7 @@ def http_bot(state, request: gr.Request):
|
||||
if state.skip_next:
|
||||
# This generate call is skipped due to invalid inputs
|
||||
path_to_sub_videos = state.get_path_to_subvideos()
|
||||
yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (no_change_btn,) * 1
|
||||
yield (state, state.to_gradio_chatbot(), path_to_sub_videos, None) + (no_change_btn,) * 1
|
||||
return
|
||||
|
||||
if len(state.messages) == state.offset + 2:
|
||||
@@ -97,7 +99,7 @@ def http_bot(state, request: gr.Request):
|
||||
logger.info(f"==== url request ====\n{gateway_addr}")
|
||||
|
||||
state.messages[-1][-1] = "▌"
|
||||
yield (state, state.to_gradio_chatbot(), state.split_video) + (disable_btn,) * 1
|
||||
yield (state, state.to_gradio_chatbot(), state.split_video, state.image) + (disable_btn,) * 1
|
||||
|
||||
try:
|
||||
response = requests.post(
|
||||
@@ -108,6 +110,7 @@ def http_bot(state, request: gr.Request):
|
||||
)
|
||||
print(response.status_code)
|
||||
print(response.json())
|
||||
|
||||
if response.status_code == 200:
|
||||
response = response.json()
|
||||
choice = response["choices"][-1]
|
||||
@@ -123,44 +126,61 @@ def http_bot(state, request: gr.Request):
|
||||
video_file = metadata["source_video"]
|
||||
state.video_file = os.path.join(static_dir, metadata["source_video"])
|
||||
state.time_of_frame_ms = metadata["time_of_frame_ms"]
|
||||
try:
|
||||
splited_video_path = split_video(
|
||||
state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
|
||||
)
|
||||
except:
|
||||
print(f"video {state.video_file} does not exist in UI host!")
|
||||
splited_video_path = None
|
||||
state.split_video = splited_video_path
|
||||
file_ext = os.path.splitext(state.video_file)[-1]
|
||||
if file_ext == ".mp4":
|
||||
try:
|
||||
splited_video_path = split_video(
|
||||
state.video_file, state.time_of_frame_ms, tmp_dir, f"{state.time_of_frame_ms}__{video_file}"
|
||||
)
|
||||
except:
|
||||
print(f"video {state.video_file} does not exist in UI host!")
|
||||
splited_video_path = None
|
||||
state.split_video = splited_video_path
|
||||
elif file_ext in [".jpg", ".jpeg", ".png", ".gif"]:
|
||||
try:
|
||||
output_image_path = make_temp_image(state.video_file, file_ext)
|
||||
except:
|
||||
print(f"image {state.video_file} does not exist in UI host!")
|
||||
output_image_path = None
|
||||
state.image = output_image_path
|
||||
|
||||
else:
|
||||
raise requests.exceptions.RequestException
|
||||
except requests.exceptions.RequestException as e:
|
||||
state.messages[-1][-1] = server_error_msg
|
||||
yield (state, state.to_gradio_chatbot(), None) + (enable_btn,)
|
||||
yield (state, state.to_gradio_chatbot(), None, None) + (enable_btn,)
|
||||
return
|
||||
|
||||
state.messages[-1][-1] = message
|
||||
yield (state, state.to_gradio_chatbot(), state.split_video) + (enable_btn,) * 1
|
||||
yield (
|
||||
state,
|
||||
state.to_gradio_chatbot(),
|
||||
gr.Video(state.split_video, visible=state.split_video is not None),
|
||||
gr.Image(state.image, visible=state.image is not None),
|
||||
) + (enable_btn,) * 1
|
||||
|
||||
logger.info(f"{state.messages[-1][-1]}")
|
||||
return
|
||||
|
||||
|
||||
def ingest_video_gen_transcript(filepath, request: gr.Request):
|
||||
yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
|
||||
def ingest_gen_transcript(filepath, filetype, request: gr.Request):
|
||||
yield (
|
||||
gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
|
||||
)
|
||||
verified_filepath = os.path.normpath(filepath)
|
||||
if not verified_filepath.startswith(tmp_upload_folder):
|
||||
print("Found malicious video file name!")
|
||||
print(f"Found malicious {filetype} file name!")
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
|
||||
value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
|
||||
)
|
||||
)
|
||||
return
|
||||
basename = os.path.basename(verified_filepath)
|
||||
dest = os.path.join(static_dir, basename)
|
||||
shutil.copy(verified_filepath, dest)
|
||||
print("Done copy uploaded file to static folder!")
|
||||
print("Done copying uploaded file to static folder.")
|
||||
headers = {
|
||||
# 'Content-Type': 'multipart/form-data'
|
||||
}
|
||||
@@ -172,17 +192,17 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
|
||||
if response.status_code == 200:
|
||||
response = response.json()
|
||||
print(response)
|
||||
yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
|
||||
yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
|
||||
time.sleep(2)
|
||||
fn_no_ext = Path(dest).stem
|
||||
if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
|
||||
new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
|
||||
print(response["video_id_maps"][fn_no_ext])
|
||||
if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
|
||||
new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
|
||||
print(response["file_id_maps"][fn_no_ext])
|
||||
os.rename(dest, new_dst)
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
|
||||
value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
|
||||
)
|
||||
)
|
||||
return
|
||||
@@ -190,51 +210,53 @@ def ingest_video_gen_transcript(filepath, request: gr.Request):
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
|
||||
value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
|
||||
)
|
||||
)
|
||||
time.sleep(2)
|
||||
return
|
||||
|
||||
|
||||
def ingest_video_gen_caption(filepath, request: gr.Request):
|
||||
yield (gr.Textbox(visible=True, value="Please wait for ingesting your uploaded video into database..."))
|
||||
def ingest_gen_caption(filepath, filetype, request: gr.Request):
|
||||
yield (
|
||||
gr.Textbox(visible=True, value=f"Please wait while your uploaded {filetype} is ingested into the database...")
|
||||
)
|
||||
verified_filepath = os.path.normpath(filepath)
|
||||
if not verified_filepath.startswith(tmp_upload_folder):
|
||||
print("Found malicious video file name!")
|
||||
print(f"Found malicious {filetype} file name!")
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Your uploaded video's file name has special characters that are not allowed. Please consider update the video file name!",
|
||||
value=f"Your uploaded {filetype}'s file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
|
||||
)
|
||||
)
|
||||
return
|
||||
basename = os.path.basename(verified_filepath)
|
||||
dest = os.path.join(static_dir, basename)
|
||||
shutil.copy(verified_filepath, dest)
|
||||
print("Done copy uploaded file to static folder!")
|
||||
print("Done copying uploaded file to static folder.")
|
||||
headers = {
|
||||
# 'Content-Type': 'multipart/form-data'
|
||||
}
|
||||
files = {
|
||||
"files": open(dest, "rb"),
|
||||
}
|
||||
response = requests.post(dataprep_gen_captiono_addr, headers=headers, files=files)
|
||||
response = requests.post(dataprep_gen_caption_addr, headers=headers, files=files)
|
||||
print(response.status_code)
|
||||
if response.status_code == 200:
|
||||
response = response.json()
|
||||
print(response)
|
||||
yield (gr.Textbox(visible=True, value="Video ingestion is done. Saving your uploaded video..."))
|
||||
yield (gr.Textbox(visible=True, value=f"The {filetype} ingestion is done. Saving your uploaded {filetype}..."))
|
||||
time.sleep(2)
|
||||
fn_no_ext = Path(dest).stem
|
||||
if "video_id_maps" in response and fn_no_ext in response["video_id_maps"]:
|
||||
new_dst = os.path.join(static_dir, response["video_id_maps"][fn_no_ext])
|
||||
print(response["video_id_maps"][fn_no_ext])
|
||||
if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
|
||||
new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
|
||||
print(response["file_id_maps"][fn_no_ext])
|
||||
os.rename(dest, new_dst)
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Congratulation! Your upload is done!\nClick the X button on the top right of the video upload box to upload another video.",
|
||||
value=f"Congratulations, your upload is done!\nClick the X button on the top right of the {filetype} upload box to upload another {filetype}.",
|
||||
)
|
||||
)
|
||||
return
|
||||
@@ -242,48 +264,181 @@ def ingest_video_gen_caption(filepath, request: gr.Request):
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Something wrong!\nPlease click the X button on the top right of the video upload boxreupload your video!",
|
||||
value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the {filetype} upload box to reupload your video.",
|
||||
)
|
||||
)
|
||||
time.sleep(2)
|
||||
return
|
||||
|
||||
|
||||
def clear_uploaded_video(request: gr.Request):
|
||||
def ingest_with_text(filepath, text, request: gr.Request):
|
||||
yield (gr.Textbox(visible=True, value="Please wait for your uploaded image to be ingested into the database..."))
|
||||
verified_filepath = os.path.normpath(filepath)
|
||||
if not verified_filepath.startswith(tmp_upload_folder):
|
||||
print("Found malicious image file name!")
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Your uploaded image's file name has special characters that are not allowed (depends on the OS, some examples are \, /, :, and *). Please consider changing the file name.",
|
||||
)
|
||||
)
|
||||
return
|
||||
basename = os.path.basename(verified_filepath)
|
||||
dest = os.path.join(static_dir, basename)
|
||||
shutil.copy(verified_filepath, dest)
|
||||
text_basename = "{}.txt".format(os.path.splitext(basename)[0])
|
||||
text_dest = os.path.join(static_dir, text_basename)
|
||||
with open(text_dest, "w") as file:
|
||||
file.write(text)
|
||||
print("Done copying uploaded files to static folder!")
|
||||
headers = {
|
||||
# 'Content-Type': 'multipart/form-data'
|
||||
}
|
||||
files = [("files", (basename, open(dest, "rb"))), ("files", (text_basename, open(text_dest, "rb")))]
|
||||
try:
|
||||
response = requests.post(dataprep_ingest_addr, headers=headers, files=files)
|
||||
finally:
|
||||
os.remove(text_dest)
|
||||
print(response.status_code)
|
||||
if response.status_code == 200:
|
||||
response = response.json()
|
||||
print(response)
|
||||
yield (gr.Textbox(visible=True, value="Image ingestion is done. Saving your uploaded image..."))
|
||||
time.sleep(2)
|
||||
fn_no_ext = Path(dest).stem
|
||||
if "file_id_maps" in response and fn_no_ext in response["file_id_maps"]:
|
||||
new_dst = os.path.join(static_dir, response["file_id_maps"][fn_no_ext])
|
||||
print(response["file_id_maps"][fn_no_ext])
|
||||
os.rename(dest, new_dst)
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value="Congratulation! Your upload is done!\nClick the X button on the top right of the image upload box to upload another image.",
|
||||
)
|
||||
)
|
||||
return
|
||||
else:
|
||||
yield (
|
||||
gr.Textbox(
|
||||
visible=True,
|
||||
value=f"Something went wrong (server error: {response.status_code})!\nPlease click the X button on the top right of the image upload box to reupload your image!",
|
||||
)
|
||||
)
|
||||
time.sleep(2)
|
||||
return
|
||||
|
||||
|
||||
def hide_text(request: gr.Request):
|
||||
return gr.Textbox(visible=False)
|
||||
|
||||
|
||||
with gr.Blocks() as upload_gen_trans:
|
||||
gr.Markdown("# Ingest Your Own Video - Utilizing Generated Transcripts")
|
||||
gr.Markdown(
|
||||
"Please use this interface to ingest your own video if the video has meaningful audio (e.g., announcements, discussions, etc...)"
|
||||
)
|
||||
def clear_text(request: gr.Request):
|
||||
return None
|
||||
|
||||
|
||||
with gr.Blocks() as upload_video:
|
||||
gr.Markdown("# Ingest Your Own Video Using Generated Transcripts or Captions")
|
||||
gr.Markdown("Use this interface to ingest your own video and generate transcripts or captions for it")
|
||||
|
||||
def select_upload_type(choice, request: gr.Request):
|
||||
if choice == "transcript":
|
||||
return gr.Video(sources="upload", visible=True), gr.Video(sources="upload", visible=False)
|
||||
else:
|
||||
return gr.Video(sources="upload", visible=False), gr.Video(sources="upload", visible=True)
|
||||
|
||||
with gr.Row():
|
||||
with gr.Column(scale=6):
|
||||
video_upload = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload")
|
||||
video_upload_trans = gr.Video(sources="upload", elem_id="video_upload_trans", visible=True)
|
||||
video_upload_cap = gr.Video(sources="upload", elem_id="video_upload_cap", visible=False)
|
||||
with gr.Column(scale=3):
|
||||
text_options_radio = gr.Radio(
|
||||
[
|
||||
("Generate transcript (video contains voice)", "transcript"),
|
||||
("Generate captions (video does not contain voice)", "caption"),
|
||||
],
|
||||
label="Text Options",
|
||||
info="How should text be ingested?",
|
||||
value="transcript",
|
||||
)
|
||||
text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
|
||||
video_upload_trans.upload(
|
||||
ingest_gen_transcript, [video_upload_trans, gr.Textbox(value="video", visible=False)], [text_upload_result]
|
||||
)
|
||||
video_upload_trans.clear(hide_text, [], [text_upload_result])
|
||||
video_upload_cap.upload(
|
||||
ingest_gen_caption, [video_upload_cap, gr.Textbox(value="video", visible=False)], [text_upload_result]
|
||||
)
|
||||
video_upload_cap.clear(hide_text, [], [text_upload_result])
|
||||
text_options_radio.change(select_upload_type, [text_options_radio], [video_upload_trans, video_upload_cap])
|
||||
|
||||
with gr.Blocks() as upload_image:
|
||||
gr.Markdown("# Ingest Your Own Image Using Generated or Custom Captions/Labels")
|
||||
gr.Markdown("Use this interface to ingest your own image and generate a caption for it")
|
||||
|
||||
def select_upload_type(choice, request: gr.Request):
|
||||
if choice == "gen_caption":
|
||||
return gr.Image(sources="upload", visible=True), gr.Image(sources="upload", visible=False)
|
||||
else:
|
||||
return gr.Image(sources="upload", visible=False), gr.Image(sources="upload", visible=True)
|
||||
|
||||
with gr.Row():
|
||||
with gr.Column(scale=6):
|
||||
image_upload_cap = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=True)
|
||||
image_upload_text = gr.Image(type="filepath", sources="upload", elem_id="image_upload_cap", visible=False)
|
||||
with gr.Column(scale=3):
|
||||
text_options_radio = gr.Radio(
|
||||
[("Generate caption", "gen_caption"), ("Custom caption or label", "custom_caption")],
|
||||
label="Text Options",
|
||||
info="How should text be ingested?",
|
||||
value="gen_caption",
|
||||
)
|
||||
custom_caption = gr.Textbox(visible=True, interactive=True, label="Custom Caption or Label")
|
||||
text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
|
||||
image_upload_cap.upload(
|
||||
ingest_gen_caption, [image_upload_cap, gr.Textbox(value="image", visible=False)], [text_upload_result]
|
||||
)
|
||||
image_upload_cap.clear(hide_text, [], [text_upload_result])
|
||||
image_upload_text.upload(ingest_with_text, [image_upload_text, custom_caption], [text_upload_result]).then(
|
||||
clear_text, [], [custom_caption]
|
||||
)
|
||||
image_upload_text.clear(hide_text, [], [text_upload_result])
|
||||
text_options_radio.change(select_upload_type, [text_options_radio], [image_upload_cap, image_upload_text])
|
||||
|
||||
with gr.Blocks() as upload_audio:
|
||||
gr.Markdown("# Ingest Your Own Audio Using Generated Transcripts")
|
||||
gr.Markdown("Use this interface to ingest your own audio file and generate a transcript for it")
|
||||
with gr.Row():
|
||||
with gr.Column(scale=6):
|
||||
audio_upload = gr.Audio(type="filepath")
|
||||
with gr.Column(scale=3):
|
||||
text_upload_result = gr.Textbox(visible=False, interactive=False, label="Upload Status")
|
||||
video_upload.upload(ingest_video_gen_transcript, [video_upload], [text_upload_result])
|
||||
video_upload.clear(clear_uploaded_video, [], [text_upload_result])
|
||||
audio_upload.upload(
|
||||
ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
|
||||
)
|
||||
audio_upload.stop_recording(
|
||||
ingest_gen_transcript, [audio_upload, gr.Textbox(value="audio", visible=False)], [text_upload_result]
|
||||
)
|
||||
audio_upload.clear(hide_text, [], [text_upload_result])
|
||||
|
||||
with gr.Blocks() as upload_gen_captions:
|
||||
gr.Markdown("# Ingest Your Own Video - Utilizing Generated Captions")
|
||||
gr.Markdown(
|
||||
"Please use this interface to ingest your own video if the video has meaningless audio (e.g., background musics, etc...)"
|
||||
)
|
||||
with gr.Blocks() as upload_pdf:
|
||||
gr.Markdown("# Ingest Your Own PDF")
|
||||
gr.Markdown("Use this interface to ingest your own PDF file with text, tables, images, and graphs")
|
||||
with gr.Row():
|
||||
with gr.Column(scale=6):
|
||||
video_upload_cap = gr.Video(sources="upload", height=512, width=512, elem_id="video_upload_cap")
|
||||
image_upload_cap = gr.File()
|
||||
with gr.Column(scale=3):
|
||||
text_upload_result_cap = gr.Textbox(visible=False, interactive=False, label="Upload Status")
|
||||
video_upload_cap.upload(ingest_video_gen_transcript, [video_upload_cap], [text_upload_result_cap])
|
||||
video_upload_cap.clear(clear_uploaded_video, [], [text_upload_result_cap])
|
||||
image_upload_cap.upload(
|
||||
ingest_gen_caption, [image_upload_cap, gr.Textbox(value="PDF", visible=False)], [text_upload_result_cap]
|
||||
)
|
||||
image_upload_cap.clear(hide_text, [], [text_upload_result_cap])
|
||||
|
||||
with gr.Blocks() as qna:
|
||||
state = gr.State(multimodalqna_conv.copy())
|
||||
with gr.Row():
|
||||
with gr.Column(scale=4):
|
||||
video = gr.Video(height=512, width=512, elem_id="video")
|
||||
video = gr.Video(height=512, width=512, elem_id="video", visible=True, label="Media")
|
||||
image = gr.Image(height=512, width=512, elem_id="image", visible=False, label="Media")
|
||||
with gr.Column(scale=7):
|
||||
chatbot = gr.Chatbot(elem_id="chatbot", label="MultimodalQnA Chatbot", height=390)
|
||||
with gr.Row():
|
||||
@@ -293,7 +448,8 @@ with gr.Blocks() as qna:
|
||||
# show_label=False,
|
||||
# container=False,
|
||||
label="Query",
|
||||
info="Enter your query here!",
|
||||
info="Enter a text query below",
|
||||
# submit_btn=False,
|
||||
)
|
||||
with gr.Column(scale=1, min_width=100):
|
||||
with gr.Row():
|
||||
@@ -306,7 +462,7 @@ with gr.Blocks() as qna:
|
||||
[
|
||||
state,
|
||||
],
|
||||
[state, chatbot, textbox, video, clear_btn],
|
||||
[state, chatbot, textbox, video, image, clear_btn],
|
||||
)
|
||||
|
||||
submit_btn.click(
|
||||
@@ -318,17 +474,19 @@ with gr.Blocks() as qna:
|
||||
[
|
||||
state,
|
||||
],
|
||||
[state, chatbot, video, clear_btn],
|
||||
[state, chatbot, video, image, clear_btn],
|
||||
)
|
||||
with gr.Blocks(css=css) as demo:
|
||||
gr.Markdown("# MultimodalQnA")
|
||||
with gr.Tabs():
|
||||
with gr.TabItem("MultimodalQnA With Your Videos"):
|
||||
with gr.TabItem("MultimodalQnA"):
|
||||
qna.render()
|
||||
with gr.TabItem("Upload Your Own Videos"):
|
||||
upload_gen_trans.render()
|
||||
with gr.TabItem("Upload Your Own Videos"):
|
||||
upload_gen_captions.render()
|
||||
with gr.TabItem("Upload Video"):
|
||||
upload_video.render()
|
||||
with gr.TabItem("Upload Image"):
|
||||
upload_image.render()
|
||||
with gr.TabItem("Upload Audio"):
|
||||
upload_audio.render()
|
||||
|
||||
demo.queue()
|
||||
app = gr.mount_gradio_app(app, demo, path="/")
|
||||
@@ -343,6 +501,9 @@ if __name__ == "__main__":
|
||||
parser.add_argument("--share", action="store_true")
|
||||
|
||||
backend_service_endpoint = os.getenv("BACKEND_SERVICE_ENDPOINT", "http://localhost:8888/v1/multimodalqna")
|
||||
dataprep_ingest_endpoint = os.getenv(
|
||||
"DATAPREP_INGEST_SERVICE_ENDPOINT", "http://localhost:6007/v1/ingest_with_text"
|
||||
)
|
||||
dataprep_gen_transcript_endpoint = os.getenv(
|
||||
"DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT", "http://localhost:6007/v1/generate_transcripts"
|
||||
)
|
||||
@@ -353,9 +514,11 @@ if __name__ == "__main__":
|
||||
logger.info(f"args: {args}")
|
||||
global gateway_addr
|
||||
gateway_addr = backend_service_endpoint
|
||||
global dataprep_ingest_addr
|
||||
dataprep_ingest_addr = dataprep_ingest_endpoint
|
||||
global dataprep_gen_transcript_addr
|
||||
dataprep_gen_transcript_addr = dataprep_gen_transcript_endpoint
|
||||
global dataprep_gen_captiono_addr
|
||||
dataprep_gen_captiono_addr = dataprep_gen_caption_endpoint
|
||||
global dataprep_gen_caption_addr
|
||||
dataprep_gen_caption_addr = dataprep_gen_caption_endpoint
|
||||
|
||||
uvicorn.run(app, host=args.host, port=args.port)
|
||||
|
||||
@@ -5,6 +5,7 @@ import base64
|
||||
import logging
|
||||
import logging.handlers
|
||||
import os
|
||||
import shutil
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
@@ -118,6 +119,18 @@ def maintain_aspect_ratio_resize(image, width=None, height=None, inter=cv2.INTER
|
||||
return cv2.resize(image, dim, interpolation=inter)
|
||||
|
||||
|
||||
def make_temp_image(
|
||||
image_name,
|
||||
file_ext,
|
||||
output_image_path: str = "./public/images",
|
||||
output_image_name: str = "image_tmp",
|
||||
):
|
||||
Path(output_image_path).mkdir(parents=True, exist_ok=True)
|
||||
output_image = os.path.join(output_image_path, "{}.{}".format(output_image_name, file_ext))
|
||||
shutil.copy(image_name, output_image)
|
||||
return output_image
|
||||
|
||||
|
||||
# function to split video at a timestamp
|
||||
def split_video(
|
||||
video_path,
|
||||
|
||||
Reference in New Issue
Block a user