adding dataprep support for CLIP based models for VideoRAGQnA example for v1.0 (#621)

* dataprep service

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* dataprep updates

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* rearranged dirs

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* added readme

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* removed checks

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* added features

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* added get method

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add dim at init, rm unused

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add wait after connect DB

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove unused

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* Update comps/dataprep/vdms/README.md

Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com>
Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* add test script for mm case

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* add return value and update readme

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check bug

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* fix mm-script

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* add into dataprep workflow

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* rm whitespace

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* updated readme and added test script

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* removed unused file

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move test script

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

* restructured repo

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updates path in test script

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>

* add name for build

Signed-off-by: BaoHuiling <huiling.bao@intel.com>

---------

Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com>
Signed-off-by: BaoHuiling <huiling.bao@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: BaoHuiling <huiling.bao@intel.com>
Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com>
This commit is contained in:
sri-intel
2024-09-11 01:19:04 -04:00
committed by GitHub
parent 4165c7d8c1
commit f84d91a3b0
20 changed files with 1475 additions and 0 deletions

View File

@@ -23,3 +23,7 @@ services:
build:
dockerfile: comps/dataprep/pinecone/langchain/Dockerfile
image: ${REGISTRY:-opea}/dataprep-pinecone:${TAG:-latest}
dataprep-vdms:
build:
dockerfile: comps/dataprep/vdms/multimodal_langchain/docker/Dockerfile
image: ${REGISTRY:-opea}/dataprep-vdms:${TAG:-latest}

View File

@@ -0,0 +1,189 @@
# Dataprep Microservice with VDMS
For dataprep microservice, we currently provide one framework: `Langchain`.
<!-- We also provide `Langchain_ray` which uses ray to parallel the data prep for multi-file performance improvement(observed 5x - 15x speedup by processing 1000 files/links.). -->
We organized the folders in the same way, so you can use either framework for dataprep microservice with the following constructions.
# 🚀1. Start Microservice with Python (Option 1)
## 1.1 Install Requirements
Install Single-process version (for 1-10 files processing)
```bash
apt-get update
apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
cd langchain
pip install -r requirements.txt
```
<!-- - option 2: Install multi-process version (for >10 files processing)
```bash
cd langchain_ray; pip install -r requirements_ray.txt
``` -->
## 1.2 Start VDMS Server
Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
## 1.3 Setup Environment Variables
```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export COLLECTION_NAME=${your_collection_name}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"
export PYTHONPATH=${path_to_comps}
```
## 1.4 Start Document Preparation Microservice for VDMS with Python Script
Start document preparation microservice for VDMS with below command.
Start single-process version (for 1-10 files processing)
```bash
python prepare_doc_vdms.py
```
<!-- - option 2: Start multi-process version (for >10 files processing)
```bash
python prepare_doc_redis_on_ray.py
``` -->
# 🚀2. Start Microservice with Docker (Option 2)
## 2.1 Start VDMS Server
Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
## 2.2 Setup Environment Variables
```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export TEI_ENDPOINT=${your_tei_endpoint}
export COLLECTION_NAME=${your_collection_name}
export SEARCH_ENGINE="FaissFlat"
export DISTANCE_STRATEGY="L2"
export PYTHONPATH=${path_to_comps}
```
## 2.3 Build Docker Image
- Build docker image with langchain
Start single-process version (for 1-10 files processing)
```bash
cd ../../../
docker build -t opea/dataprep-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
```
<!-- - option 2: Start multi-process version (for >10 files processing)
```bash
cd ../../../../
docker build -t opea/dataprep-on-ray-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain_ray/Dockerfile . -->
## 2.4 Run Docker with CLI
Start single-process version (for 1-10 files processing)
```bash
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TEI_ENDPOINT=$TEI_ENDPOINT \
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
opea/dataprep-vdms:latest
```
<!-- - option 2: Start multi-process version (for >10 files processing)
```bash
docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
-e http_proxy=$http_proxy -e https_proxy=$https_proxy \
-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
-e TIMEOUT_SECONDS=600 opea/dataprep-on-ray-vdms:latest
``` -->
# 🚀3. Status Microservice
```bash
docker container logs -f dataprep-vdms-server
```
# 🚀4. Consume Microservice
Once document preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
Make sure the file path after `files=@` is correct.
- Single file upload
```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
http://localhost:6007/v1/dataprep
```
You can specify chunk_size and chunk_size by the following commands.
```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./LLAMA2_page6.pdf" \
-F "chunk_size=1500" \
-F "chunk_overlap=100" \
http://localhost:6007/v1/dataprep
```
- Multiple file upload
```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
-F "files=@./file2.txt" \
-F "files=@./file3.txt" \
http://localhost:6007/v1/dataprep
```
- Links upload (not supported for llama_index now)
```bash
curl -X POST \
-F 'link_list=["https://www.ces.tech/"]' \
http://localhost:6007/v1/dataprep
```
or
```python
import requests
import json
proxies = {"http": ""}
url = "http://localhost:6007/v1/dataprep"
urls = [
"https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
]
payload = {"link_list": json.dumps(urls)}
try:
resp = requests.post(url=url, data=payload, proxies=proxies)
print(resp.text)
resp.raise_for_status() # Raise an exception for unsuccessful HTTP status codes
print("Request successful!")
except requests.exceptions.RequestException as e:
print("An error occurred:", e)
```

View File

@@ -0,0 +1,39 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
FROM python:3.11-slim
ENV LANG=C.UTF-8
ARG ARCH="cpu"
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
libcairo2-dev \
libgl1-mesa-glx \
libjemalloc-dev \
vim
RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/
USER user
COPY comps /home/user/comps
RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/langchain/requirements.txt
ENV PYTHONPATH=/home/user
USER root
RUN mkdir -p /home/user/comps/dataprep/vdms/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/langchain
USER user
WORKDIR /home/user/comps/dataprep/vdms/langchain
ENTRYPOINT ["python", "prepare_doc_vdms.py"]

View File

@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

View File

@@ -0,0 +1,33 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import os
def getEnv(key, default_value=None):
env_value = os.getenv(key, default=default_value)
print(f"{key}: {env_value}")
return env_value
# Embedding model
EMBED_MODEL = getEnv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")
# VDMS configuration
VDMS_HOST = getEnv("VDMS_HOST", "localhost")
VDMS_PORT = int(getEnv("VDMS_PORT", 55555))
COLLECTION_NAME = getEnv("COLLECTION_NAME", "rag-vdms")
SEARCH_ENGINE = getEnv("SEARCH_ENGINE", "FaissFlat")
DISTANCE_STRATEGY = getEnv("DISTANCE_STRATEGY", "L2")
# LLM/Embedding endpoints
TGI_LLM_ENDPOINT = getEnv("TGI_LLM_ENDPOINT", "http://localhost:8080")
TGI_LLM_ENDPOINT_NO_RAG = getEnv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
TEI_EMBEDDING_ENDPOINT = getEnv("TEI_ENDPOINT")
# chunk parameters
CHUNK_SIZE = getEnv("CHUNK_SIZE", 1500)
CHUNK_OVERLAP = getEnv("CHUNK_OVERLAP", 100)
current_file_path = os.path.abspath(__file__)
parent_dir = os.path.dirname(current_file_path)

View File

@@ -0,0 +1,28 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
version: "3"
services:
vdms-vector-db:
image: intellabs/vdms:latest
container_name: vdms-vector-db
ports:
- "55555:55555"
dataprep-vdms:
image: opea/dataprep-vdms:latest
container_name: dataprep-vdms-server
ports:
- "6007:6007"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
VDMS_HOST: ${VDMS_HOST}
VDMS_PORT: ${VDMS_PORT}
COLLECTION_NAME: ${COLLECTION_NAME}
restart: unless-stopped
networks:
default:
driver: bridge

View File

@@ -0,0 +1,166 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import json
import os
from typing import List, Optional, Union
from config import COLLECTION_NAME, DISTANCE_STRATEGY, EMBED_MODEL, SEARCH_ENGINE, VDMS_HOST, VDMS_PORT
from fastapi import Body, File, Form, HTTPException, UploadFile
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceEmbeddings, HuggingFaceHubEmbeddings
from langchain_community.vectorstores.vdms import VDMS, VDMS_Client
from langchain_text_splitters import HTMLHeaderTextSplitter
from comps import CustomLogger, DocPath, opea_microservices, register_microservice
from comps.dataprep.utils import (
create_upload_folder,
document_loader,
encode_filename,
get_separators,
get_tables_result,
parse_html,
save_content_to_local_disk,
)
tei_embedding_endpoint = os.getenv("TEI_ENDPOINT")
client = VDMS_Client(VDMS_HOST, int(VDMS_PORT))
logger = CustomLogger("prepare_doc_redis")
logflag = os.getenv("LOGFLAG", False)
upload_folder = "./uploaded_files/"
def ingest_data_to_vdms(doc_path: DocPath):
"""Ingest document to VDMS."""
path = doc_path.path
print(f"Parsing document {doc_path}.")
if path.endswith(".html"):
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
else:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=doc_path.chunk_size, chunk_overlap=100, add_start_index=True, separators=get_separators()
)
content = document_loader(path)
chunks = text_splitter.split_text(content)
if doc_path.process_table and path.endswith(".pdf"):
table_chunks = get_tables_result(path, doc_path.table_strategy)
chunks = chunks + table_chunks
print("Done preprocessing. Created ", len(chunks), " chunks of the original pdf")
# Create vectorstore
if tei_embedding_endpoint:
# create embeddings using TEI endpoint service
embedder = HuggingFaceHubEmbeddings(model=tei_embedding_endpoint)
else:
# create embeddings using local embedding model
embedder = HuggingFaceBgeEmbeddings(model_name=EMBED_MODEL)
# Batch size
batch_size = 32
num_chunks = len(chunks)
for i in range(0, num_chunks, batch_size):
batch_chunks = chunks[i : i + batch_size]
batch_texts = batch_chunks
_ = VDMS.from_texts(
client=client,
embedding=embedder,
collection_name=COLLECTION_NAME,
distance_strategy=DISTANCE_STRATEGY,
engine=SEARCH_ENGINE,
texts=batch_texts,
)
print(f"Processed batch {i//batch_size + 1}/{(num_chunks-1)//batch_size + 1}")
@register_microservice(
name="opea_service@prepare_doc_vdms",
endpoint="/v1/dataprep",
host="0.0.0.0",
port=6007,
)
async def ingest_documents(
files: Optional[Union[UploadFile, List[UploadFile]]] = File(None),
link_list: Optional[str] = Form(None),
chunk_size: int = Form(1500),
chunk_overlap: int = Form(100),
process_table: bool = Form(False),
table_strategy: str = Form("fast"),
):
if logflag:
logger.info(f"[ upload ] files:{files}")
logger.info(f"[ upload ] link_list:{link_list}")
if files:
if not isinstance(files, list):
files = [files]
uploaded_files = []
for file in files:
encode_file = encode_filename(file.filename)
doc_id = "file:" + encode_file
if logflag:
logger.info(f"[ upload ] processing file {doc_id}")
save_path = upload_folder + encode_file
await save_content_to_local_disk(save_path, file)
ingest_data_to_vdms(
DocPath(
path=save_path,
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
process_table=process_table,
table_strategy=table_strategy,
)
)
uploaded_files.append(save_path)
if logflag:
logger.info(f"[ upload ] Successfully saved file {save_path}")
result = {"status": 200, "message": "Data preparation succeeded"}
if logflag:
logger.info(result)
return result
if link_list:
link_list = json.loads(link_list) # Parse JSON string to list
if not isinstance(link_list, list):
raise HTTPException(status_code=400, detail=f"Link_list {link_list} should be a list.")
for link in link_list:
encoded_link = encode_filename(link)
doc_id = "file:" + encoded_link + ".txt"
if logflag:
logger.info(f"[ upload ] processing link {doc_id}")
# check whether the link file already exists
save_path = upload_folder + encoded_link + ".txt"
content = parse_html([link])[0][0]
await save_content_to_local_disk(save_path, content)
ingest_data_to_vdms(
DocPath(
path=save_path,
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
process_table=process_table,
table_strategy=table_strategy,
)
)
if logflag:
logger.info(f"[ upload ] Successfully saved link list {link_list}")
return {"status": 200, "message": "Data preparation succeeded"}
raise HTTPException(status_code=400, detail="Must provide either a file or a string list.")
if __name__ == "__main__":
create_upload_folder(upload_folder)
opea_microservices["opea_service@prepare_doc_vdms"].start()

View File

@@ -0,0 +1,37 @@
beautifulsoup4
cairosvg
decord
docarray[full]
docx2txt
easyocr
einops
fastapi
huggingface_hub
langchain
langchain-community
langchain-core
langchain-text-splitters
langsmith
markdown
numpy
opencv-python
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-sdk
pandas
Pillow
prometheus-fastapi-instrumentator
pymupdf
pyspark
python-bidi==0.4.2
python-docx
python-pptx
PyYAML
sentence_transformers
shortuuid
tqdm
typing
tzlocal
unstructured[all-docs]==0.11.5
uvicorn
vdms

View File

@@ -0,0 +1,40 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
FROM python:3.11-slim
ENV LANG=C.UTF-8
ARG ARCH="cpu"
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
build-essential \
libcairo2-dev \
libgl1-mesa-glx \
libjemalloc-dev \
vim
RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/
USER user
COPY comps /home/user/comps
RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/multimodal_langchain/requirements.txt
ENV PYTHONPATH=/home/user
USER root
RUN mkdir -p /home/user/comps/dataprep/vdms/multimodal_langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/multimodal_langchain
USER user
WORKDIR /home/user/comps/dataprep/vdms/multimodal_langchain
ENTRYPOINT ["python", "ingest_videos.py"]

View File

@@ -0,0 +1,124 @@
# Multimodal Dataprep Microservice with VDMS
For dataprep microservice, we currently provide one framework: `Langchain`.
# 🚀1. Start Microservice with Python (Option 1)
## 1.1 Install Requirements
- option 1: Install Single-process version (for 1-10 files processing)
```bash
apt-get update
apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
pip install -r requirements.txt
```
## 1.2 Start VDMS Server
```bash
docker run -d --name="vdms-vector-db" -p 55555:55555 intellabs/vdms:latest
```
## 1.3 Setup Environment Variables
```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export host_ip=$(hostname -I | awk '{print $1}')
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export INDEX_NAME="rag-vdms"
export your_hf_api_token="{your_hf_token}"
export PYTHONPATH=${path_to_comps}
```
## 1.4 Start Data Preparation Microservice for VDMS with Python Script
Start document preparation microservice for VDMS with below command.
```bash
python ingest_videos.py
```
# 🚀2. Start Microservice with Docker (Option 2)
## 2.1 Start VDMS Server
```bash
docker run -d --name="vdms-vector-db" -p 55555:55555 intellabs/vdms:latest
```
## 2.1 Setup Environment Variables
```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export host_ip=$(hostname -I | awk '{print $1}')
export VDMS_HOST=${host_ip}
export VDMS_PORT=55555
export INDEX_NAME="rag-vdms"
export your_hf_api_token="{your_hf_token}"
```
## 2.3 Build Docker Image
- Build docker image
```bash
cd ../../../
docker build -t opea/dataprep-vdms:latest --network host --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/multimodal_langchain/Dockerfile .
```
## 2.4 Run Docker Compose
```bash
docker compose -f comps/dataprep/vdms/multimodal_langchain/docker-compose-dataprep-vdms.yaml up -d
```
# 🚀3. Status Microservice
```bash
docker container logs -f dataprep-vdms-server
```
# 🚀4. Consume Microservice
Once data preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the videos to embedding and save to the database.
Make sure the file path after `files=@` is correct.
- Single file upload
```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.mp4" \
http://localhost:6007/v1/dataprep
```
- Multiple file upload
```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.mp4" \
-F "files=@./file2.mp4" \
-F "files=@./file3.mp4" \
http://localhost:6007/v1/dataprep
```
- List of uploaded files
```bash
curl -X GET http://localhost:6007/v1/dataprep/get_videos
```
- Download uploaded files
Please use the file name from the list
```bash
curl -X GET http://localhost:6007/v1/dataprep/get_file/${filename}
```

View File

@@ -0,0 +1,2 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

View File

@@ -0,0 +1,30 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
# Path to all videos
videos: uploaded_files/
# Do you want to extract frames of videos (True if not done already, else False)
generate_frames: True
# How do you want to generate feature embeddings?
embeddings:
vclip_model_name: "openai/clip-vit-base-patch32"
vclip_num_frm: 64
vector_dimensions: 512
path: "uploaded_files/embeddings"
# VL-branch config
vl_branch:
cfg_path: embedding/video_llama_config/video_llama_eval_only_vl.yaml
model_type: "llama_v2"
# Path to store metadata files
meta_output_dir: uploaded_files/video_metadata/
# Chunk duration defines the interval of time that each embedding will occur
chunk_duration: 30
# Clip duration defines the length of the interval in which the embedding will occur
clip_duration: 10
# e.g. For every <chunk_duration>, you embed the first <clip_duration>'s frames of that interval
vector_db:
choice_of_db: "vdms" # #Supported databases [vdms]
# LLM path
model_path: meta-llama/Llama-2-7b-chat-hf

View File

@@ -0,0 +1,28 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
version: "3"
services:
vdms-vector-db:
image: intellabs/vdms:latest
container_name: vdms-vector-db
ports:
- "55555:55555"
dataprep-vdms:
image: opea/dataprep-vdms:latest
container_name: dataprep-vdms-server
ports:
- "6007:6007"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
VDMS_HOST: ${VDMS_HOST}
VDMS_PORT: ${VDMS_PORT}
INDEX_NAME: ${INDEX_NAME}
restart: unless-stopped
networks:
default:
driver: bridge

View File

@@ -0,0 +1,177 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import json
import os
import shutil
import time
import uuid
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Type, Union
from fastapi import File, HTTPException, UploadFile
from fastapi.responses import FileResponse
from tqdm import tqdm
from utils import store_embeddings
from utils.utils import process_all_videos, read_config
from utils.vclip import vCLIP
from comps import opea_microservices, register_microservice
VECTORDB_SERVICE_HOST_IP = os.getenv("VDMS_HOST", "0.0.0.0")
VECTORDB_SERVICE_PORT = os.getenv("VDMS_PORT", 55555)
collection_name = os.getenv("INDEX_NAME", "rag-vdms")
def setup_vclip_model(config, device="cpu"):
model = vCLIP(config)
return model
def read_json(path):
with open(path) as f:
x = json.load(f)
return x
def store_into_vectordb(vs, metadata_file_path, dimensions):
GMetadata = read_json(metadata_file_path)
total_videos = len(GMetadata.keys())
for idx, (video, data) in enumerate(tqdm(GMetadata.items())):
metadata_list = []
ids = []
data["video"] = video
video_name_list = [data["video_path"]]
metadata_list = [data]
if vs.selected_db == "vdms":
vs.video_db.add_videos(
paths=video_name_list,
metadatas=metadata_list,
start_time=[data["timestamp"]],
clip_duration=[data["clip_duration"]],
)
else:
print(f"ERROR: selected_db {vs.selected_db} not supported. Supported:[vdms]")
# clean up tmp_ folders containing frames (jpeg)
for i in os.listdir():
if i.startswith("tmp_"):
print("removing tmp_*")
os.system("rm -r tmp_*")
break
def generate_video_id():
"""Generates a unique identifier for a video file."""
return str(uuid.uuid4())
def generate_embeddings(config, dimensions, vs):
process_all_videos(config)
global_metadata_file_path = os.path.join(config["meta_output_dir"], "metadata.json")
print(f"global metadata file available at {global_metadata_file_path}")
store_into_vectordb(vs, global_metadata_file_path, dimensions)
@register_microservice(name="opea_service@prepare_videodoc_vdms", endpoint="/v1/dataprep", host="0.0.0.0", port=6007)
async def process_videos(files: List[UploadFile] = File(None)):
"""Ingest videos to VDMS."""
config = read_config("./config.yaml")
meanclip_cfg = {
"model_name": config["embeddings"]["vclip_model_name"],
"num_frm": config["embeddings"]["vclip_num_frm"],
}
generate_frames = config["generate_frames"]
path = config["videos"]
meta_output_dir = config["meta_output_dir"]
emb_path = config["embeddings"]["path"]
host = VECTORDB_SERVICE_HOST_IP
port = int(VECTORDB_SERVICE_PORT)
selected_db = config["vector_db"]["choice_of_db"]
vector_dimensions = config["embeddings"]["vector_dimensions"]
print(f"Parsing videos {path}.")
# Saving videos
if files:
video_files = []
for file in files:
if os.path.splitext(file.filename)[1] == ".mp4":
video_files.append(file)
else:
raise HTTPException(
status_code=400, detail=f"File {file.filename} is not an mp4 file. Please upload mp4 files only."
)
for video_file in video_files:
video_id = generate_video_id()
video_name = os.path.splitext(video_file.filename)[0]
video_file_name = f"{video_name}_{video_id}.mp4"
video_dir_name = os.path.splitext(video_file_name)[0]
# Save video file in upload_directory
with open(os.path.join(path, video_file_name), "wb") as f:
shutil.copyfileobj(video_file.file, f)
# Creating DB
print(
"Creating DB with video embedding and metadata support, \nIt may take few minutes to download and load all required models if you are running for first time.",
flush=True,
)
print("Connecting to {} at {}:{}".format(selected_db, host, port), flush=True)
# init meanclip model
model = setup_vclip_model(meanclip_cfg, device="cpu")
vs = store_embeddings.VideoVS(
host, port, selected_db, model, collection_name, embedding_dimensions=vector_dimensions
)
print("done creating DB, sleep 5s", flush=True)
time.sleep(5)
generate_embeddings(config, vector_dimensions, vs)
return {"message": "Videos ingested successfully"}
@register_microservice(
name="opea_service@prepare_videodoc_vdms",
endpoint="/v1/dataprep/get_videos",
host="0.0.0.0",
port=6007,
methods=["GET"],
)
async def rag_get_file_structure():
"""Returns list of names of uploaded videos saved on the server."""
config = read_config("./config.yaml")
if not Path(config["videos"]).exists():
print("No file uploaded, return empty list.")
return []
uploaded_videos = os.listdir(config["videos"])
mp4_files = [file for file in uploaded_videos if file.endswith(".mp4")]
return mp4_files
@register_microservice(
name="opea_service@prepare_videodoc_vdms",
endpoint="/v1/dataprep/get_file/{filename}",
host="0.0.0.0",
port=6007,
methods=["GET"],
)
async def rag_get_file(filename: str):
"""Download the file from remote."""
config = read_config("./config.yaml")
UPLOAD_DIR = config["videos"]
file_path = os.path.join(UPLOAD_DIR, filename)
if os.path.exists(file_path):
return FileResponse(path=file_path, filename=filename)
else:
return {"error": "File not found"}
if __name__ == "__main__":
opea_microservices["opea_service@prepare_videodoc_vdms"].start()

View File

@@ -0,0 +1,37 @@
beautifulsoup4
cairosvg
decord
docarray[full]
docx2txt
easyocr
einops
fastapi
huggingface_hub
langchain
langchain-community
langchain-core
langchain-text-splitters
langsmith
markdown
numpy
opencv-python
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-sdk
pandas
Pillow
prometheus-fastapi-instrumentator
pymupdf
pyspark
python-bidi==0.4.2
python-docx
python-pptx
PyYAML
sentence_transformers
shortuuid
tqdm
typing
tzlocal
unstructured[all-docs]==0.11.5
uvicorn
vdms

View File

@@ -0,0 +1,138 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import os
import time
from typing import Any, Dict, Iterable, List, Optional
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from langchain.pydantic_v1 import BaseModel, root_validator
from langchain_community.vectorstores import VDMS
from langchain_community.vectorstores.vdms import VDMS_Client
from langchain_core.embeddings import Embeddings
from PIL import Image
toPIL = T.ToPILImage()
# 'similarity', 'similarity_score_threshold' (needs threshold), 'mmr'
class vCLIPEmbeddings(BaseModel, Embeddings):
"""MeanCLIP Embeddings model."""
model: Any
@root_validator(allow_reuse=True)
def validate_environment(cls, values: Dict) -> Dict:
"""Validate that open_clip and torch libraries are installed."""
try:
# Use the provided model if present
if "model" not in values:
raise ValueError("Model must be provided during initialization.")
except ImportError:
raise ImportError("Please ensure CLIP model is loaded")
return values
def embed_documents(self, texts: List[str]) -> List[List[float]]:
model_device = next(self.model.clip.parameters()).device
text_features = self.model.get_text_embeddings(texts)
return text_features.detach().numpy()
def embed_query(self, text: str) -> List[float]:
return self.embed_documents([text])[0]
def embed_video(self, paths: List[str], **kwargs: Any) -> List[List[float]]:
# Open images directly as PIL images
video_features = []
for vid_path in sorted(paths):
# Encode the video to get the embeddings
model_device = next(self.model.parameters()).device
# Preprocess the video for the model
clip_images = self.load_video_for_vclip(
vid_path,
num_frm=self.model.num_frm,
max_img_size=224,
start_time=kwargs.get("start_time", None),
clip_duration=kwargs.get("clip_duration", None),
)
embeddings_tensor = self.model.get_video_embeddings([clip_images])
# Convert tensor to list and add to the video_features list
embeddings_list = embeddings_tensor.tolist()
video_features.append(embeddings_list)
return video_features
def load_video_for_vclip(self, vid_path, num_frm=4, max_img_size=224, **kwargs):
# Load video with VideoReader
import decord
decord.bridge.set_bridge("torch")
vr = VideoReader(vid_path, ctx=cpu(0))
fps = vr.get_avg_fps()
num_frames = len(vr)
start_idx = int(fps * kwargs.get("start_time", [0])[0])
end_idx = start_idx + int(fps * kwargs.get("clip_duration", [num_frames])[0])
frame_idx = np.linspace(start_idx, end_idx, num=num_frm, endpoint=False, dtype=int) # Uniform sampling
clip_images = []
# read images
temp_frms = vr.get_batch(frame_idx.astype(int).tolist())
for idx in range(temp_frms.shape[0]):
im = temp_frms[idx] # H W C
clip_images.append(toPIL(im.permute(2, 0, 1)))
return clip_images
class VideoVS:
def __init__(
self,
host,
port,
selected_db,
video_retriever_model,
collection_name,
embedding_dimensions: int = 512,
chosen_video_search_type="similarity",
):
self.host = host
self.port = port
self.selected_db = selected_db
self.chosen_video_search_type = chosen_video_search_type
self.constraints = None
self.video_collection = collection_name
self.video_embedder = vCLIPEmbeddings(model=video_retriever_model)
self.chosen_video_search_type = chosen_video_search_type
self.embedding_dimensions = embedding_dimensions
# initialize_db
self.get_db_client()
self.init_db()
def get_db_client(self):
if self.selected_db == "vdms":
print("Connecting to VDMS db server . . .")
self.client = VDMS_Client(host=self.host, port=self.port)
def init_db(self):
print("Loading db instances")
if self.selected_db == "vdms":
self.video_db = VDMS(
client=self.client,
embedding=self.video_embedder,
collection_name=self.video_collection,
engine="FaissFlat",
distance_strategy="IP",
embedding_dimensions=self.embedding_dimensions,
)

View File

@@ -0,0 +1,138 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import datetime
import json
import os
import random
import time as t
import cv2
import yaml
from tqdm import tqdm
from tzlocal import get_localzone
def read_config(path):
with open(path, "r") as f:
config = yaml.safe_load(f)
return config
def calculate_intervals(video_path, chunk_duration, clip_duration):
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
print("Error: Could not open video.")
return
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
total_seconds = total_frames / fps
intervals = []
chunk_frames = int(chunk_duration * fps)
clip_frames = int(clip_duration * fps)
for start_frame in range(0, total_frames, chunk_frames):
end_frame = min(start_frame + clip_frames, total_frames)
start_time = start_frame / fps
end_time = end_frame / fps
intervals.append((start_frame, end_frame, start_time, end_time))
cap.release()
return intervals
def process_all_videos(config):
path = config["videos"]
meta_output_dir = config["meta_output_dir"]
selected_db = config["vector_db"]["choice_of_db"]
emb_path = config["embeddings"]["path"]
chunk_duration = config["chunk_duration"]
clip_duration = config["clip_duration"]
videos = [file for file in os.listdir(path) if file.endswith(".mp4")] # TODO: increase supported video formats
# print (f'Total {len(videos)} videos will be processed')
metadata = {}
for i, each_video in enumerate(tqdm(videos)):
metadata[each_video] = {}
keyname = each_video
video_path = os.path.join(path, each_video)
date_time = datetime.datetime.now() # FIXME CHECK: is this correct?
# date_time = t.ctime(os.stat(video_path).st_ctime)
# Get the local timezone of the machine
local_timezone = get_localzone()
time_format = "%a %b %d %H:%M:%S %Y"
if not isinstance(date_time, datetime.datetime):
date_time = datetime.datetime.strptime(date_time, time_format)
time = date_time.strftime("%H:%M:%S")
hours, minutes, seconds = map(float, time.split(":"))
date = date_time.strftime("%Y-%m-%d")
year, month, day = map(int, date.split("-"))
if clip_duration is not None and chunk_duration is not None and clip_duration <= chunk_duration:
interval_count = 0
metadata.pop(each_video)
for start_frame, end_frame, start_time, end_time in calculate_intervals(
video_path, chunk_duration, clip_duration
):
keyname = os.path.splitext(os.path.basename(video_path))[0] + f"_interval_{interval_count}"
metadata[keyname] = {"timestamp": start_time}
metadata[keyname].update(
{
"date": date,
"year": year,
"month": month,
"day": day,
"time": time,
"hours": hours,
"minutes": minutes,
"seconds": seconds,
}
)
if selected_db == "vdms":
# Localize the current time to the local timezone of the machine
# Tahani might not need this
current_time_local = date_time.replace(tzinfo=datetime.timezone.utc).astimezone(local_timezone)
# Convert the localized time to ISO 8601 format with timezone offset
iso_date_time = current_time_local.isoformat()
metadata[keyname]["date_time"] = {"_date": str(iso_date_time)}
# Open the video file
cap = cv2.VideoCapture(video_path)
if int(cv2.__version__.split(".")[0]) < 3:
fps = cap.get(cv2.cv.CV_CAP_PROP_FPS)
else:
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
# get the duration
metadata[keyname].update(
{
"clip_duration": (min(total_frames, end_frame) - start_frame) / fps,
"fps": fps,
"total_frames": total_frames,
#'embedding_path': os.path.join(emb_path, each_video+".pt"),
"video_path": f"{os.path.join(path,each_video)}",
}
)
cap.release()
interval_count += 1
metadata[keyname].update(
{
"fps": fps,
"total_frames": total_frames,
#'embedding_path': os.path.join(emb_path, each_video+".pt"),
"video_path": f"{os.path.join(path,each_video)}",
}
)
os.makedirs(meta_output_dir, exist_ok=True)
metadata_file = os.path.join(meta_output_dir, "metadata.json")
with open(metadata_file, "w") as f:
json.dump(metadata, f, indent=4)

View File

@@ -0,0 +1,56 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
import os
import sys
import numpy as np
import torch
import torchvision.transforms as T
import yaml
from decord import VideoReader, cpu
from transformers import AutoProcessor, AutoTokenizer, CLIPModel
toPIL = T.ToPILImage()
import torch.nn as nn
from einops import rearrange
class vCLIP(nn.Module):
def __init__(self, cfg):
super().__init__()
self.num_frm = cfg["num_frm"]
self.model_name = cfg["model_name"]
self.clip = CLIPModel.from_pretrained(self.model_name)
self.processor = AutoProcessor.from_pretrained(self.model_name)
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
def get_text_embeddings(self, texts):
"""Input is list of texts."""
text_inputs = self.tokenizer(texts, padding=True, return_tensors="pt")
text_features = self.clip.get_text_features(**text_inputs)
return text_features
def get_image_embeddings(self, images):
"""Input is list of images."""
image_inputs = self.processor(images=images, return_tensors="pt")
image_features = self.clip.get_image_features(**image_inputs)
return image_features
def get_video_embeddings(self, frames_batch):
"""Input is list of list of frames in video."""
self.batch_size = len(frames_batch)
vid_embs = []
for frames in frames_batch:
frame_embeddings = self.get_image_embeddings(frames)
frame_embeddings = rearrange(frame_embeddings, "(b n) d -> b n d", b=len(frames_batch))
# Normalize, mean aggregate and return normalized video_embeddings
frame_embeddings = frame_embeddings / frame_embeddings.norm(dim=-1, keepdim=True)
video_embeddings = frame_embeddings.mean(dim=1)
video_embeddings = video_embeddings / video_embeddings.norm(dim=-1, keepdim=True)
vid_embs.append(video_embeddings)
return torch.cat(vid_embs, dim=0)

View File

@@ -0,0 +1,83 @@
#!/bin/bash
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
set -x
WORKPATH=$(dirname "$PWD")
LOG_PATH="$WORKPATH/tests"
ip_address=$(hostname -I | awk '{print $1}')
function build_docker_images() {
cd $WORKPATH
echo $(pwd)
docker build --no-cache -t opea/dataprep-vdms:comps --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
if [ $? -ne 0 ]; then
echo "opea/dataprep-vdms built fail"
exit 1
else
echo "opea/dataprep-vdms built successful"
fi
docker pull intellabs/vdms:latest
}
function start_service() {
VDMS_PORT=5043
docker run -d --name="test-comps-dataprep-vdms" -p $VDMS_PORT:55555 intellabs/vdms:latest
dataprep_service_port=5013
COLLECTION_NAME="test-comps"
docker run -d --name="test-comps-dataprep-vdms-server" -e COLLECTION_NAME=$COLLECTION_NAME -e no_proxy=$no_proxy -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e VDMS_HOST=$ip_address -e VDMS_PORT=$VDMS_PORT -p ${dataprep_service_port}:6007 --ipc=host opea/dataprep-vdms:comps
sleep 30s
}
function validate_microservice() {
cd $LOG_PATH
echo "Deep learning is a subset of machine learning that utilizes neural networks with multiple layers to analyze various levels of abstract data representations. It enables computers to identify patterns and make decisions with minimal human intervention by learning from large amounts of data." > $LOG_PATH/dataprep_file.txt
dataprep_service_port=5013
URL="http://$ip_address:$dataprep_service_port/v1/dataprep"
HTTP_STATUS=$(http_proxy="" curl -s -o /dev/null -w "%{http_code}" -X POST -F 'files=@./dataprep_file.txt' -H 'Content-Type: multipart/form-data' ${URL} )
if [ "$HTTP_STATUS" -eq 200 ]; then
echo "[ dataprep-upload-file ] HTTP status is 200. Checking content..."
local CONTENT=$(http_proxy="" curl -s -X POST -F 'files=@./dataprep_file.txt' -H 'Content-Type: multipart/form-data' ${URL} | tee ${LOG_PATH}/dataprep-upload-file.log)
if echo "$CONTENT" | grep "Data preparation succeeded"; then
echo "[ dataprep-upload-file ] Content is correct."
else
echo "[ dataprep-upload-file ] Content is not correct. Received content was $CONTENT"
docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-upload-file.log
docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-file_vdms.log
exit 1
fi
else
echo "[ dataprep-upload-file ] HTTP status is not 200. Received status was $HTTP_STATUS"
docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-upload-file.log
docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-file_vdms.log
exit 1
fi
rm ./dataprep_file.txt
}
function stop_docker() {
cid=$(docker ps -aq --filter "name=test-comps-dataprep-vdms*")
if [[ ! -z "$cid" ]]; then docker stop $cid && docker rm $cid && sleep 1s; fi
}
function main() {
stop_docker
build_docker_images
start_service
validate_microservice
stop_docker
echo y | docker system prune
}
main

View File

@@ -0,0 +1,124 @@
#!/bin/bash
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
set -x
WORKPATH=$(dirname "$PWD")
LOG_PATH="$WORKPATH/tests"
ip_address=$(hostname -I | awk '{print $1}')
function build_docker_images() {
cd $WORKPATH
echo $(pwd)
docker build --no-cache -t opea/dataprep-vdms:comps --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/multimodal_langchain/Dockerfile .
if [ $? -ne 0 ]; then
echo "opea/dataprep-vdms built fail"
exit 1
else
echo "opea/dataprep-vdms built successful"
fi
docker pull intellabs/vdms:latest
}
function start_service() {
VDMS_PORT=5043
docker run -d --name="test-comps-dataprep-vdms" -p $VDMS_PORT:55555 intellabs/vdms:latest
dataprep_service_port=5013
COLLECTION_NAME="test-comps"
docker run -d --name="test-comps-dataprep-vdms-server" -e COLLECTION_NAME=$COLLECTION_NAME -e no_proxy=$no_proxy -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e VDMS_HOST=$ip_address -e VDMS_PORT=$VDMS_PORT -p ${dataprep_service_port}:6007 --ipc=host opea/dataprep-vdms:comps
sleep 30s
}
function validate_microservice() {
cd $LOG_PATH
wget https://github.com/DAMO-NLP-SG/Video-LLaMA/raw/main/examples/silence_girl.mp4 -O silence_girl.mp4
sleep 5
# test /v1/dataprep upload file
URL="http://$ip_address:$dataprep_service_port/v1/dataprep"
response=$(http_proxy="" curl -s -w "\n%{http_code}" -X POST -F 'files=@./silence_girl.mp4' -H 'Content-Type: multipart/form-data' ${URL})
CONTENT=$(echo "$response" | sed -e '$ d')
HTTP_STATUS=$(echo "$response" | tail -n 1)
if [ "$HTTP_STATUS" -eq 200 ]; then
echo "[ dataprep-upload-videos ] HTTP status is 200. Checking content..."
if echo "$CONTENT" | grep "Videos ingested successfully"; then
echo "[ dataprep-upload-videos ] Content is correct."
else
echo "[ dataprep-upload-videos ] Content is not correct. Received content was $CONTENT"
docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-upload-videos.log
docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-videos_vdms.log
exit 1
fi
else
echo "[ dataprep-upload-videos ] HTTP status is not 200. Received status was $HTTP_STATUS"
docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-get-videos.log
docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-videos_vdms.log
exit 1
fi
sleep 1s
rm ./silence_girl.mp4
# test /v1/dataprep/get_videos
URL="http://$ip_address:$dataprep_service_port/v1/dataprep/get_videos"
response=$(http_proxy="" curl -s -w "\n%{http_code}" -X GET ${URL})
CONTENT=$(echo "$response" | sed -e '$ d')
HTTP_STATUS=$(echo "$response" | tail -n 1)
if [ "$HTTP_STATUS" -eq 200 ]; then
echo "[ dataprep-get-videos ] HTTP status is 200. Checking content..."
if echo "$CONTENT" | grep "silence_girl"; then
echo "[ dataprep-get-videos ] Content is correct."
else
echo "[ dataprep-get-videos ] Content is not correct. Received content was $CONTENT"
docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-get-videos.log
exit 1
fi
else
echo "[ dataprep-get-videos ] HTTP status is not 200. Received status was $HTTP_STATUS"
docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-get-videos.log
exit 1
fi
# test /v1/dataprep/get_file/{filename}
file_list=$CONTENT
filename=$(echo $file_list | sed 's/^\[//;s/\]$//;s/,.*//;s/"//g')
URL="http://$ip_address:$dataprep_service_port/v1/dataprep/get_file/${filename}"
http_proxy="" wget ${URL}
CONTENT=$(ls)
if echo "$CONTENT" | grep "silence_girl"; then
echo "[ download_file ] Content is correct."
else
echo "[ download_file ] Content is not correct. $CONTENT"
docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/download_file.log
exit 1
fi
}
function stop_docker() {
cid=$(docker ps -aq --filter "name=test-comps-dataprep-vdms*")
if [[ ! -z "$cid" ]]; then docker stop $cid && docker rm $cid && sleep 1s; fi
}
function main() {
stop_docker
build_docker_images
start_service
validate_microservice
stop_docker
echo y | docker system prune
}
main