adding dataprep support for CLIP based models for VideoRAGQnA example for v1.0 (#621)

* dataprep service Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * dataprep updates Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * rearranged dirs Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * added readme Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * removed checks Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * added features Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * added get method Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add dim at init, rm unused Signed-off-by: BaoHuiling <huiling.bao@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add wait after connect DB Signed-off-by: BaoHuiling <huiling.bao@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unused Signed-off-by: BaoHuiling <huiling.bao@intel.com> * Update comps/dataprep/vdms/README.md Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com> Signed-off-by: BaoHuiling <huiling.bao@intel.com> * add test script for mm case Signed-off-by: BaoHuiling <huiling.bao@intel.com> * add return value and update readme Signed-off-by: BaoHuiling <huiling.bao@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * check bug Signed-off-by: BaoHuiling <huiling.bao@intel.com> * fix mm-script Signed-off-by: BaoHuiling <huiling.bao@intel.com> * add into dataprep workflow Signed-off-by: BaoHuiling <huiling.bao@intel.com> * rm whitespace Signed-off-by: BaoHuiling <huiling.bao@intel.com> * updated readme and added test script Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * removed unused file Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move test script Signed-off-by: BaoHuiling <huiling.bao@intel.com> * restructured repo Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates path in test script Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> * add name for build Signed-off-by: BaoHuiling <huiling.bao@intel.com> --------- Signed-off-by: srinarayan-srikanthan <srinarayan.srikanthan@intel.com> Signed-off-by: BaoHuiling <huiling.bao@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: BaoHuiling <huiling.bao@intel.com> Co-authored-by: XinyuYe-Intel <xinyu.ye@intel.com>
2024-09-11 01:19:04 -04:00
parent 4165c7d8c1
commit f84d91a3b0
20 changed files with 1475 additions and 0 deletions
--- a/.github/workflows/docker/compose/dataprep-compose-cd.yaml
+++ b/.github/workflows/docker/compose/dataprep-compose-cd.yaml
@@ -23,3 +23,7 @@ services:
    build:
      dockerfile: comps/dataprep/pinecone/langchain/Dockerfile
    image: ${REGISTRY:-opea}/dataprep-pinecone:${TAG:-latest}
+  dataprep-vdms:
+    build:
+      dockerfile: comps/dataprep/vdms/multimodal_langchain/docker/Dockerfile
+    image: ${REGISTRY:-opea}/dataprep-vdms:${TAG:-latest}
--- a/comps/dataprep/vdms/README.md
+++ b/comps/dataprep/vdms/README.md
@@ -0,0 +1,189 @@
+# Dataprep Microservice with VDMS
+
+For dataprep microservice, we currently provide one framework: `Langchain`.
+
+<!-- We also provide `Langchain_ray` which uses ray to parallel the data prep for multi-file performance improvement(observed 5x - 15x speedup by processing 1000 files/links.). -->
+
+We organized the folders in the same way, so you can use either framework for dataprep microservice with the following constructions.
+
+# 🚀1. Start Microservice with Python (Option 1)
+
+## 1.1 Install Requirements
+
+Install Single-process version (for 1-10 files processing)
+
+```bash
+apt-get update
+apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
+cd langchain
+pip install -r requirements.txt
+```
+
+<!-- - option 2: Install multi-process version (for >10 files processing)
+
+```bash
+cd langchain_ray; pip install -r requirements_ray.txt
+``` -->
+
+## 1.2 Start VDMS Server
+
+Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
+
+## 1.3 Setup Environment Variables
+
+```bash
+export http_proxy=${your_http_proxy}
+export https_proxy=${your_http_proxy}
+export VDMS_HOST=${host_ip}
+export VDMS_PORT=55555
+export COLLECTION_NAME=${your_collection_name}
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_PROJECT="opea/gen-ai-comps:dataprep"
+export PYTHONPATH=${path_to_comps}
+```
+
+## 1.4 Start Document Preparation Microservice for VDMS with Python Script
+
+Start document preparation microservice for VDMS with below command.
+
+Start single-process version (for 1-10 files processing)
+
+```bash
+python prepare_doc_vdms.py
+```
+
+<!-- - option 2: Start multi-process version (for >10 files processing)
+
+```bash
+python prepare_doc_redis_on_ray.py
+``` -->
+
+# 🚀2. Start Microservice with Docker (Option 2)
+
+## 2.1 Start VDMS Server
+
+Please refer to this [readme](../../vectorstores/langchain/vdms/README.md).
+
+## 2.2 Setup Environment Variables
+
+```bash
+export http_proxy=${your_http_proxy}
+export https_proxy=${your_http_proxy}
+export VDMS_HOST=${host_ip}
+export VDMS_PORT=55555
+export TEI_ENDPOINT=${your_tei_endpoint}
+export COLLECTION_NAME=${your_collection_name}
+export SEARCH_ENGINE="FaissFlat"
+export DISTANCE_STRATEGY="L2"
+export PYTHONPATH=${path_to_comps}
+```
+
+## 2.3 Build Docker Image
+
+- Build docker image with langchain
+
+Start single-process version (for 1-10 files processing)
+
+```bash
+cd ../../../
+docker build -t opea/dataprep-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
+```
+
+<!-- - option 2: Start multi-process version (for >10 files processing)
+
+```bash
+cd ../../../../
+docker build -t opea/dataprep-on-ray-vdms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain_ray/Dockerfile . -->
+
+## 2.4 Run Docker with CLI
+
+Start single-process version (for 1-10 files processing)
+
+```bash
+docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
+-e http_proxy=$http_proxy -e https_proxy=$https_proxy -e TEI_ENDPOINT=$TEI_ENDPOINT \
+-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
+opea/dataprep-vdms:latest
+```
+
+<!-- - option 2: Start multi-process version (for >10 files processing)
+
+```bash
+docker run -d --name="dataprep-vdms-server" -p 6007:6007 --runtime=runc --ipc=host \
+-e http_proxy=$http_proxy -e https_proxy=$https_proxy \
+-e COLLECTION_NAME=$COLLECTION_NAME -e VDMS_HOST=$VDMS_HOST -e VDMS_PORT=$VDMS_PORT \
+-e TIMEOUT_SECONDS=600 opea/dataprep-on-ray-vdms:latest
+``` -->
+
+# 🚀3. Status Microservice
+
+```bash
+docker container logs -f dataprep-vdms-server
+```
+
+# 🚀4. Consume Microservice
+
+Once document preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
+
+Make sure the file path after `files=@` is correct.
+
+- Single file upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./file1.txt" \
+    http://localhost:6007/v1/dataprep
+```
+
+You can specify chunk_size and chunk_size by the following commands.
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./LLAMA2_page6.pdf" \
+    -F "chunk_size=1500" \
+    -F "chunk_overlap=100" \
+    http://localhost:6007/v1/dataprep
+```
+
+- Multiple file upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./file1.txt" \
+    -F "files=@./file2.txt" \
+    -F "files=@./file3.txt" \
+    http://localhost:6007/v1/dataprep
+```
+
+- Links upload (not supported for llama_index now)
+
+```bash
+curl -X POST \
+    -F 'link_list=["https://www.ces.tech/"]' \
+    http://localhost:6007/v1/dataprep
+```
+
+or
+
+```python
+import requests
+import json
+
+proxies = {"http": ""}
+url = "http://localhost:6007/v1/dataprep"
+urls = [
+    "https://towardsdatascience.com/no-gpu-no-party-fine-tune-bert-for-sentiment-analysis-with-vertex-ai-custom-jobs-d8fc410e908b?source=rss----7f60cf5620c9---4"
+]
+payload = {"link_list": json.dumps(urls)}
+
+try:
+    resp = requests.post(url=url, data=payload, proxies=proxies)
+    print(resp.text)
+    resp.raise_for_status()  # Raise an exception for unsuccessful HTTP status codes
+    print("Request successful!")
+except requests.exceptions.RequestException as e:
+    print("An error occurred:", e)
+```
--- a/comps/dataprep/vdms/langchain/Dockerfile
+++ b/comps/dataprep/vdms/langchain/Dockerfile
@@ -0,0 +1,39 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM python:3.11-slim
+
+ENV LANG=C.UTF-8
+
+ARG ARCH="cpu"
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    build-essential \
+    libcairo2-dev \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    vim
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+USER user
+
+COPY comps /home/user/comps
+
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
+    if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
+    pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/langchain/requirements.txt
+
+ENV PYTHONPATH=/home/user
+
+USER root
+
+RUN mkdir -p /home/user/comps/dataprep/vdms/langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/langchain
+
+USER user
+
+WORKDIR /home/user/comps/dataprep/vdms/langchain
+
+ENTRYPOINT ["python", "prepare_doc_vdms.py"]
--- a/comps/dataprep/vdms/langchain/init.py
+++ b/comps/dataprep/vdms/langchain/init.py
@@ -0,0 +1,2 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
--- a/comps/dataprep/vdms/langchain/config.py
+++ b/comps/dataprep/vdms/langchain/config.py
@@ -0,0 +1,33 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+
+def getEnv(key, default_value=None):
+    env_value = os.getenv(key, default=default_value)
+    print(f"{key}: {env_value}")
+    return env_value
+
+
+# Embedding model
+EMBED_MODEL = getEnv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")
+
+# VDMS configuration
+VDMS_HOST = getEnv("VDMS_HOST", "localhost")
+VDMS_PORT = int(getEnv("VDMS_PORT", 55555))
+COLLECTION_NAME = getEnv("COLLECTION_NAME", "rag-vdms")
+SEARCH_ENGINE = getEnv("SEARCH_ENGINE", "FaissFlat")
+DISTANCE_STRATEGY = getEnv("DISTANCE_STRATEGY", "L2")
+
+# LLM/Embedding endpoints
+TGI_LLM_ENDPOINT = getEnv("TGI_LLM_ENDPOINT", "http://localhost:8080")
+TGI_LLM_ENDPOINT_NO_RAG = getEnv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
+TEI_EMBEDDING_ENDPOINT = getEnv("TEI_ENDPOINT")
+
+# chunk parameters
+CHUNK_SIZE = getEnv("CHUNK_SIZE", 1500)
+CHUNK_OVERLAP = getEnv("CHUNK_OVERLAP", 100)
+
+current_file_path = os.path.abspath(__file__)
+parent_dir = os.path.dirname(current_file_path)
--- a/comps/dataprep/vdms/langchain/docker-compose-dataprep-vdms.yaml
+++ b/comps/dataprep/vdms/langchain/docker-compose-dataprep-vdms.yaml
@@ -0,0 +1,28 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3"
+services:
+  vdms-vector-db:
+    image: intellabs/vdms:latest
+    container_name: vdms-vector-db
+    ports:
+      - "55555:55555"
+  dataprep-vdms:
+    image: opea/dataprep-vdms:latest
+    container_name: dataprep-vdms-server
+    ports:
+      - "6007:6007"
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      VDMS_HOST: ${VDMS_HOST}
+      VDMS_PORT: ${VDMS_PORT}
+      COLLECTION_NAME: ${COLLECTION_NAME}
+    restart: unless-stopped
+
+networks:
+  default:
+    driver: bridge
--- a/comps/dataprep/vdms/langchain/prepare_doc_vdms.py
+++ b/comps/dataprep/vdms/langchain/prepare_doc_vdms.py
@@ -0,0 +1,166 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+import os
+from typing import List, Optional, Union
+
+from config import COLLECTION_NAME, DISTANCE_STRATEGY, EMBED_MODEL, SEARCH_ENGINE, VDMS_HOST, VDMS_PORT
+from fastapi import Body, File, Form, HTTPException, UploadFile
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceEmbeddings, HuggingFaceHubEmbeddings
+from langchain_community.vectorstores.vdms import VDMS, VDMS_Client
+from langchain_text_splitters import HTMLHeaderTextSplitter
+
+from comps import CustomLogger, DocPath, opea_microservices, register_microservice
+from comps.dataprep.utils import (
+    create_upload_folder,
+    document_loader,
+    encode_filename,
+    get_separators,
+    get_tables_result,
+    parse_html,
+    save_content_to_local_disk,
+)
+
+tei_embedding_endpoint = os.getenv("TEI_ENDPOINT")
+client = VDMS_Client(VDMS_HOST, int(VDMS_PORT))
+logger = CustomLogger("prepare_doc_redis")
+logflag = os.getenv("LOGFLAG", False)
+upload_folder = "./uploaded_files/"
+
+
+def ingest_data_to_vdms(doc_path: DocPath):
+    """Ingest document to VDMS."""
+    path = doc_path.path
+    print(f"Parsing document {doc_path}.")
+
+    if path.endswith(".html"):
+        headers_to_split_on = [
+            ("h1", "Header 1"),
+            ("h2", "Header 2"),
+            ("h3", "Header 3"),
+        ]
+        text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
+    else:
+        text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=doc_path.chunk_size, chunk_overlap=100, add_start_index=True, separators=get_separators()
+        )
+
+    content = document_loader(path)
+    chunks = text_splitter.split_text(content)
+    if doc_path.process_table and path.endswith(".pdf"):
+        table_chunks = get_tables_result(path, doc_path.table_strategy)
+        chunks = chunks + table_chunks
+
+    print("Done preprocessing. Created ", len(chunks), " chunks of the original pdf")
+
+    # Create vectorstore
+    if tei_embedding_endpoint:
+        # create embeddings using TEI endpoint service
+        embedder = HuggingFaceHubEmbeddings(model=tei_embedding_endpoint)
+    else:
+        # create embeddings using local embedding model
+        embedder = HuggingFaceBgeEmbeddings(model_name=EMBED_MODEL)
+
+    # Batch size
+    batch_size = 32
+    num_chunks = len(chunks)
+    for i in range(0, num_chunks, batch_size):
+        batch_chunks = chunks[i : i + batch_size]
+        batch_texts = batch_chunks
+
+        _ = VDMS.from_texts(
+            client=client,
+            embedding=embedder,
+            collection_name=COLLECTION_NAME,
+            distance_strategy=DISTANCE_STRATEGY,
+            engine=SEARCH_ENGINE,
+            texts=batch_texts,
+        )
+        print(f"Processed batch {i//batch_size + 1}/{(num_chunks-1)//batch_size + 1}")
+
+
+@register_microservice(
+    name="opea_service@prepare_doc_vdms",
+    endpoint="/v1/dataprep",
+    host="0.0.0.0",
+    port=6007,
+)
+async def ingest_documents(
+    files: Optional[Union[UploadFile, List[UploadFile]]] = File(None),
+    link_list: Optional[str] = Form(None),
+    chunk_size: int = Form(1500),
+    chunk_overlap: int = Form(100),
+    process_table: bool = Form(False),
+    table_strategy: str = Form("fast"),
+):
+    if logflag:
+        logger.info(f"[ upload ] files:{files}")
+        logger.info(f"[ upload ] link_list:{link_list}")
+
+    if files:
+        if not isinstance(files, list):
+            files = [files]
+        uploaded_files = []
+
+        for file in files:
+            encode_file = encode_filename(file.filename)
+            doc_id = "file:" + encode_file
+            if logflag:
+                logger.info(f"[ upload ] processing file {doc_id}")
+
+            save_path = upload_folder + encode_file
+            await save_content_to_local_disk(save_path, file)
+            ingest_data_to_vdms(
+                DocPath(
+                    path=save_path,
+                    chunk_size=chunk_size,
+                    chunk_overlap=chunk_overlap,
+                    process_table=process_table,
+                    table_strategy=table_strategy,
+                )
+            )
+            uploaded_files.append(save_path)
+            if logflag:
+                logger.info(f"[ upload ] Successfully saved file {save_path}")
+
+        result = {"status": 200, "message": "Data preparation succeeded"}
+        if logflag:
+            logger.info(result)
+        return result
+
+    if link_list:
+        link_list = json.loads(link_list)  # Parse JSON string to list
+        if not isinstance(link_list, list):
+            raise HTTPException(status_code=400, detail=f"Link_list {link_list} should be a list.")
+        for link in link_list:
+            encoded_link = encode_filename(link)
+            doc_id = "file:" + encoded_link + ".txt"
+            if logflag:
+                logger.info(f"[ upload ] processing link {doc_id}")
+
+            # check whether the link file already exists
+
+            save_path = upload_folder + encoded_link + ".txt"
+            content = parse_html([link])[0][0]
+            await save_content_to_local_disk(save_path, content)
+            ingest_data_to_vdms(
+                DocPath(
+                    path=save_path,
+                    chunk_size=chunk_size,
+                    chunk_overlap=chunk_overlap,
+                    process_table=process_table,
+                    table_strategy=table_strategy,
+                )
+            )
+        if logflag:
+            logger.info(f"[ upload ] Successfully saved link list {link_list}")
+        return {"status": 200, "message": "Data preparation succeeded"}
+
+    raise HTTPException(status_code=400, detail="Must provide either a file or a string list.")
+
+
+if __name__ == "__main__":
+    create_upload_folder(upload_folder)
+    opea_microservices["opea_service@prepare_doc_vdms"].start()
--- a/comps/dataprep/vdms/langchain/requirements.txt
+++ b/comps/dataprep/vdms/langchain/requirements.txt
@@ -0,0 +1,37 @@
+beautifulsoup4
+cairosvg
+decord
+docarray[full]
+docx2txt
+easyocr
+einops
+fastapi
+huggingface_hub
+langchain
+langchain-community
+langchain-core
+langchain-text-splitters
+langsmith
+markdown
+numpy
+opencv-python
+opentelemetry-api
+opentelemetry-exporter-otlp
+opentelemetry-sdk
+pandas
+Pillow
+prometheus-fastapi-instrumentator
+pymupdf
+pyspark
+python-bidi==0.4.2
+python-docx
+python-pptx
+PyYAML
+sentence_transformers
+shortuuid
+tqdm
+typing
+tzlocal
+unstructured[all-docs]==0.11.5
+uvicorn
+vdms
--- a/comps/dataprep/vdms/multimodal_langchain/Dockerfile
+++ b/comps/dataprep/vdms/multimodal_langchain/Dockerfile
@@ -0,0 +1,40 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM python:3.11-slim
+
+ENV LANG=C.UTF-8
+
+ARG ARCH="cpu"
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    build-essential \
+    libcairo2-dev \
+    libgl1-mesa-glx \
+    libjemalloc-dev \
+    vim
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+USER user
+
+COPY comps /home/user/comps
+
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
+    if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
+    pip install --no-cache-dir -r /home/user/comps/dataprep/vdms/multimodal_langchain/requirements.txt
+
+ENV PYTHONPATH=/home/user
+
+USER root
+
+RUN mkdir -p /home/user/comps/dataprep/vdms/multimodal_langchain/uploaded_files && chown -R user /home/user/comps/dataprep/vdms/multimodal_langchain
+
+USER user
+
+WORKDIR /home/user/comps/dataprep/vdms/multimodal_langchain
+
+ENTRYPOINT ["python", "ingest_videos.py"]
--- a/comps/dataprep/vdms/multimodal_langchain/README.md
+++ b/comps/dataprep/vdms/multimodal_langchain/README.md
@@ -0,0 +1,124 @@
+# Multimodal Dataprep Microservice with VDMS
+
+For dataprep microservice, we currently provide one framework: `Langchain`.
+
+# 🚀1. Start Microservice with Python (Option 1)
+
+## 1.1 Install Requirements
+
+- option 1: Install Single-process version (for 1-10 files processing)
+
+```bash
+apt-get update
+apt-get install -y default-jre tesseract-ocr libtesseract-dev poppler-utils
+pip install -r requirements.txt
+```
+
+## 1.2 Start VDMS Server
+
+```bash
+docker run -d --name="vdms-vector-db" -p 55555:55555 intellabs/vdms:latest
+```
+
+## 1.3 Setup Environment Variables
+
+```bash
+export http_proxy=${your_http_proxy}
+export https_proxy=${your_http_proxy}
+export host_ip=$(hostname -I | awk '{print $1}')
+export VDMS_HOST=${host_ip}
+export VDMS_PORT=55555
+export INDEX_NAME="rag-vdms"
+export your_hf_api_token="{your_hf_token}"
+export PYTHONPATH=${path_to_comps}
+```
+
+## 1.4 Start Data Preparation Microservice for VDMS with Python Script
+
+Start document preparation microservice for VDMS with below command.
+
+```bash
+python ingest_videos.py
+```
+
+# 🚀2. Start Microservice with Docker (Option 2)
+
+## 2.1 Start VDMS Server
+
+```bash
+docker run -d --name="vdms-vector-db" -p 55555:55555 intellabs/vdms:latest
+```
+
+## 2.1 Setup Environment Variables
+
+```bash
+export http_proxy=${your_http_proxy}
+export https_proxy=${your_http_proxy}
+export host_ip=$(hostname -I | awk '{print $1}')
+export VDMS_HOST=${host_ip}
+export VDMS_PORT=55555
+export INDEX_NAME="rag-vdms"
+export your_hf_api_token="{your_hf_token}"
+```
+
+## 2.3 Build Docker Image
+
+- Build docker image
+
+```bash
+cd ../../../
+ docker build -t opea/dataprep-vdms:latest --network host --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/multimodal_langchain/Dockerfile .
+
+```
+
+## 2.4 Run Docker Compose
+
+```bash
+docker compose -f comps/dataprep/vdms/multimodal_langchain/docker-compose-dataprep-vdms.yaml up -d
+```
+
+# 🚀3. Status Microservice
+
+```bash
+docker container logs -f dataprep-vdms-server
+```
+
+# 🚀4. Consume Microservice
+
+Once data preparation microservice for VDMS is started, user can use below command to invoke the microservice to convert the videos to embedding and save to the database.
+
+Make sure the file path after `files=@` is correct.
+
+- Single file upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./file1.mp4" \
+    http://localhost:6007/v1/dataprep
+```
+
+- Multiple file upload
+
+```bash
+curl -X POST \
+    -H "Content-Type: multipart/form-data" \
+    -F "files=@./file1.mp4" \
+    -F "files=@./file2.mp4" \
+    -F "files=@./file3.mp4" \
+    http://localhost:6007/v1/dataprep
+```
+
+- List of uploaded files
+
+```bash
+curl -X GET http://localhost:6007/v1/dataprep/get_videos
+```
+
+- Download uploaded files
+
+Please use the file name from the list
+
+```bash
+curl -X GET http://localhost:6007/v1/dataprep/get_file/${filename}
+```
--- a/comps/dataprep/vdms/multimodal_langchain/init.py
+++ b/comps/dataprep/vdms/multimodal_langchain/init.py
@@ -0,0 +1,2 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
--- a/comps/dataprep/vdms/multimodal_langchain/config.yaml
+++ b/comps/dataprep/vdms/multimodal_langchain/config.yaml
@@ -0,0 +1,30 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+# Path to all videos
+videos: uploaded_files/
+# Do you want to extract frames of videos (True if not done already, else False)
+generate_frames: True
+# How do you want to generate feature embeddings?
+embeddings:
+  vclip_model_name: "openai/clip-vit-base-patch32"
+  vclip_num_frm: 64
+  vector_dimensions: 512
+  path: "uploaded_files/embeddings"
+# VL-branch config
+vl_branch:
+  cfg_path: embedding/video_llama_config/video_llama_eval_only_vl.yaml
+  model_type: "llama_v2"
+# Path to store metadata files
+meta_output_dir: uploaded_files/video_metadata/
+# Chunk duration defines the interval of time that each embedding will occur
+chunk_duration: 30
+# Clip duration defines the length of the interval in which the embedding will occur
+clip_duration: 10
+# e.g. For every <chunk_duration>, you embed the first <clip_duration>'s frames of that interval
+
+vector_db:
+  choice_of_db: "vdms" # #Supported databases [vdms]
+
+# LLM path
+model_path: meta-llama/Llama-2-7b-chat-hf
--- a/comps/dataprep/vdms/multimodal_langchain/docker-compose-dataprep-vdms.yaml
+++ b/comps/dataprep/vdms/multimodal_langchain/docker-compose-dataprep-vdms.yaml
@@ -0,0 +1,28 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3"
+services:
+  vdms-vector-db:
+    image: intellabs/vdms:latest
+    container_name: vdms-vector-db
+    ports:
+      - "55555:55555"
+  dataprep-vdms:
+    image: opea/dataprep-vdms:latest
+    container_name: dataprep-vdms-server
+    ports:
+      - "6007:6007"
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      VDMS_HOST: ${VDMS_HOST}
+      VDMS_PORT: ${VDMS_PORT}
+      INDEX_NAME: ${INDEX_NAME}
+    restart: unless-stopped
+
+networks:
+  default:
+    driver: bridge
--- a/comps/dataprep/vdms/multimodal_langchain/ingest_videos.py
+++ b/comps/dataprep/vdms/multimodal_langchain/ingest_videos.py
@@ -0,0 +1,177 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+import os
+import shutil
+import time
+import uuid
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Type, Union
+
+from fastapi import File, HTTPException, UploadFile
+from fastapi.responses import FileResponse
+from tqdm import tqdm
+from utils import store_embeddings
+from utils.utils import process_all_videos, read_config
+from utils.vclip import vCLIP
+
+from comps import opea_microservices, register_microservice
+
+VECTORDB_SERVICE_HOST_IP = os.getenv("VDMS_HOST", "0.0.0.0")
+VECTORDB_SERVICE_PORT = os.getenv("VDMS_PORT", 55555)
+collection_name = os.getenv("INDEX_NAME", "rag-vdms")
+
+
+def setup_vclip_model(config, device="cpu"):
+    model = vCLIP(config)
+    return model
+
+
+def read_json(path):
+    with open(path) as f:
+        x = json.load(f)
+    return x
+
+
+def store_into_vectordb(vs, metadata_file_path, dimensions):
+    GMetadata = read_json(metadata_file_path)
+
+    total_videos = len(GMetadata.keys())
+
+    for idx, (video, data) in enumerate(tqdm(GMetadata.items())):
+        metadata_list = []
+        ids = []
+
+        data["video"] = video
+        video_name_list = [data["video_path"]]
+        metadata_list = [data]
+        if vs.selected_db == "vdms":
+            vs.video_db.add_videos(
+                paths=video_name_list,
+                metadatas=metadata_list,
+                start_time=[data["timestamp"]],
+                clip_duration=[data["clip_duration"]],
+            )
+        else:
+            print(f"ERROR: selected_db {vs.selected_db} not supported. Supported:[vdms]")
+
+    # clean up tmp_ folders containing frames (jpeg)
+    for i in os.listdir():
+        if i.startswith("tmp_"):
+            print("removing tmp_*")
+            os.system("rm -r tmp_*")
+            break
+
+
+def generate_video_id():
+    """Generates a unique identifier for a video file."""
+    return str(uuid.uuid4())
+
+
+def generate_embeddings(config, dimensions, vs):
+    process_all_videos(config)
+    global_metadata_file_path = os.path.join(config["meta_output_dir"], "metadata.json")
+    print(f"global metadata file available at {global_metadata_file_path}")
+    store_into_vectordb(vs, global_metadata_file_path, dimensions)
+
+
+@register_microservice(name="opea_service@prepare_videodoc_vdms", endpoint="/v1/dataprep", host="0.0.0.0", port=6007)
+async def process_videos(files: List[UploadFile] = File(None)):
+    """Ingest videos to VDMS."""
+
+    config = read_config("./config.yaml")
+    meanclip_cfg = {
+        "model_name": config["embeddings"]["vclip_model_name"],
+        "num_frm": config["embeddings"]["vclip_num_frm"],
+    }
+    generate_frames = config["generate_frames"]
+    path = config["videos"]
+    meta_output_dir = config["meta_output_dir"]
+    emb_path = config["embeddings"]["path"]
+    host = VECTORDB_SERVICE_HOST_IP
+    port = int(VECTORDB_SERVICE_PORT)
+    selected_db = config["vector_db"]["choice_of_db"]
+    vector_dimensions = config["embeddings"]["vector_dimensions"]
+    print(f"Parsing videos {path}.")
+
+    # Saving videos
+    if files:
+        video_files = []
+        for file in files:
+            if os.path.splitext(file.filename)[1] == ".mp4":
+                video_files.append(file)
+            else:
+                raise HTTPException(
+                    status_code=400, detail=f"File {file.filename} is not an mp4 file. Please upload mp4 files only."
+                )
+
+        for video_file in video_files:
+            video_id = generate_video_id()
+            video_name = os.path.splitext(video_file.filename)[0]
+            video_file_name = f"{video_name}_{video_id}.mp4"
+            video_dir_name = os.path.splitext(video_file_name)[0]
+            # Save video file in upload_directory
+            with open(os.path.join(path, video_file_name), "wb") as f:
+                shutil.copyfileobj(video_file.file, f)
+
+    # Creating DB
+    print(
+        "Creating DB with video embedding and metadata support, \nIt may take few minutes to download and load all required models if you are running for first time.",
+        flush=True,
+    )
+    print("Connecting to {} at {}:{}".format(selected_db, host, port), flush=True)
+
+    # init meanclip model
+    model = setup_vclip_model(meanclip_cfg, device="cpu")
+    vs = store_embeddings.VideoVS(
+        host, port, selected_db, model, collection_name, embedding_dimensions=vector_dimensions
+    )
+    print("done creating DB, sleep 5s", flush=True)
+    time.sleep(5)
+
+    generate_embeddings(config, vector_dimensions, vs)
+
+    return {"message": "Videos ingested successfully"}
+
+
+@register_microservice(
+    name="opea_service@prepare_videodoc_vdms",
+    endpoint="/v1/dataprep/get_videos",
+    host="0.0.0.0",
+    port=6007,
+    methods=["GET"],
+)
+async def rag_get_file_structure():
+    """Returns list of names of uploaded videos saved on the server."""
+    config = read_config("./config.yaml")
+    if not Path(config["videos"]).exists():
+        print("No file uploaded, return empty list.")
+        return []
+
+    uploaded_videos = os.listdir(config["videos"])
+    mp4_files = [file for file in uploaded_videos if file.endswith(".mp4")]
+    return mp4_files
+
+
+@register_microservice(
+    name="opea_service@prepare_videodoc_vdms",
+    endpoint="/v1/dataprep/get_file/{filename}",
+    host="0.0.0.0",
+    port=6007,
+    methods=["GET"],
+)
+async def rag_get_file(filename: str):
+    """Download the file from remote."""
+
+    config = read_config("./config.yaml")
+    UPLOAD_DIR = config["videos"]
+    file_path = os.path.join(UPLOAD_DIR, filename)
+    if os.path.exists(file_path):
+        return FileResponse(path=file_path, filename=filename)
+    else:
+        return {"error": "File not found"}
+
+
+if __name__ == "__main__":
+    opea_microservices["opea_service@prepare_videodoc_vdms"].start()
--- a/comps/dataprep/vdms/multimodal_langchain/requirements.txt
+++ b/comps/dataprep/vdms/multimodal_langchain/requirements.txt
@@ -0,0 +1,37 @@
+beautifulsoup4
+cairosvg
+decord
+docarray[full]
+docx2txt
+easyocr
+einops
+fastapi
+huggingface_hub
+langchain
+langchain-community
+langchain-core
+langchain-text-splitters
+langsmith
+markdown
+numpy
+opencv-python
+opentelemetry-api
+opentelemetry-exporter-otlp
+opentelemetry-sdk
+pandas
+Pillow
+prometheus-fastapi-instrumentator
+pymupdf
+pyspark
+python-bidi==0.4.2
+python-docx
+python-pptx
+PyYAML
+sentence_transformers
+shortuuid
+tqdm
+typing
+tzlocal
+unstructured[all-docs]==0.11.5
+uvicorn
+vdms
--- a/comps/dataprep/vdms/multimodal_langchain/utils/store_embeddings.py
+++ b/comps/dataprep/vdms/multimodal_langchain/utils/store_embeddings.py
@@ -0,0 +1,138 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+import time
+from typing import Any, Dict, Iterable, List, Optional
+
+import numpy as np
+import torch
+import torchvision.transforms as T
+from decord import VideoReader, cpu
+from langchain.pydantic_v1 import BaseModel, root_validator
+from langchain_community.vectorstores import VDMS
+from langchain_community.vectorstores.vdms import VDMS_Client
+from langchain_core.embeddings import Embeddings
+from PIL import Image
+
+toPIL = T.ToPILImage()
+
+# 'similarity', 'similarity_score_threshold' (needs threshold), 'mmr'
+
+
+class vCLIPEmbeddings(BaseModel, Embeddings):
+    """MeanCLIP Embeddings model."""
+
+    model: Any
+
+    @root_validator(allow_reuse=True)
+    def validate_environment(cls, values: Dict) -> Dict:
+        """Validate that open_clip and torch libraries are installed."""
+        try:
+            # Use the provided model if present
+            if "model" not in values:
+                raise ValueError("Model must be provided during initialization.")
+
+        except ImportError:
+            raise ImportError("Please ensure CLIP model is loaded")
+        return values
+
+    def embed_documents(self, texts: List[str]) -> List[List[float]]:
+        model_device = next(self.model.clip.parameters()).device
+        text_features = self.model.get_text_embeddings(texts)
+
+        return text_features.detach().numpy()
+
+    def embed_query(self, text: str) -> List[float]:
+        return self.embed_documents([text])[0]
+
+    def embed_video(self, paths: List[str], **kwargs: Any) -> List[List[float]]:
+        # Open images directly as PIL images
+
+        video_features = []
+        for vid_path in sorted(paths):
+            # Encode the video to get the embeddings
+            model_device = next(self.model.parameters()).device
+            # Preprocess the video for the model
+            clip_images = self.load_video_for_vclip(
+                vid_path,
+                num_frm=self.model.num_frm,
+                max_img_size=224,
+                start_time=kwargs.get("start_time", None),
+                clip_duration=kwargs.get("clip_duration", None),
+            )
+            embeddings_tensor = self.model.get_video_embeddings([clip_images])
+
+            # Convert tensor to list and add to the video_features list
+            embeddings_list = embeddings_tensor.tolist()
+
+            video_features.append(embeddings_list)
+
+        return video_features
+
+    def load_video_for_vclip(self, vid_path, num_frm=4, max_img_size=224, **kwargs):
+        # Load video with VideoReader
+        import decord
+
+        decord.bridge.set_bridge("torch")
+        vr = VideoReader(vid_path, ctx=cpu(0))
+        fps = vr.get_avg_fps()
+        num_frames = len(vr)
+        start_idx = int(fps * kwargs.get("start_time", [0])[0])
+        end_idx = start_idx + int(fps * kwargs.get("clip_duration", [num_frames])[0])
+
+        frame_idx = np.linspace(start_idx, end_idx, num=num_frm, endpoint=False, dtype=int)  # Uniform sampling
+        clip_images = []
+
+        # read images
+        temp_frms = vr.get_batch(frame_idx.astype(int).tolist())
+        for idx in range(temp_frms.shape[0]):
+            im = temp_frms[idx]  # H W C
+            clip_images.append(toPIL(im.permute(2, 0, 1)))
+
+        return clip_images
+
+
+class VideoVS:
+    def __init__(
+        self,
+        host,
+        port,
+        selected_db,
+        video_retriever_model,
+        collection_name,
+        embedding_dimensions: int = 512,
+        chosen_video_search_type="similarity",
+    ):
+
+        self.host = host
+        self.port = port
+        self.selected_db = selected_db
+        self.chosen_video_search_type = chosen_video_search_type
+        self.constraints = None
+        self.video_collection = collection_name
+        self.video_embedder = vCLIPEmbeddings(model=video_retriever_model)
+        self.chosen_video_search_type = chosen_video_search_type
+        self.embedding_dimensions = embedding_dimensions
+
+        # initialize_db
+        self.get_db_client()
+        self.init_db()
+
+    def get_db_client(self):
+
+        if self.selected_db == "vdms":
+            print("Connecting to VDMS db server . . .")
+            self.client = VDMS_Client(host=self.host, port=self.port)
+
+    def init_db(self):
+        print("Loading db instances")
+        if self.selected_db == "vdms":
+            self.video_db = VDMS(
+                client=self.client,
+                embedding=self.video_embedder,
+                collection_name=self.video_collection,
+                engine="FaissFlat",
+                distance_strategy="IP",
+                embedding_dimensions=self.embedding_dimensions,
+            )
--- a/comps/dataprep/vdms/multimodal_langchain/utils/utils.py
+++ b/comps/dataprep/vdms/multimodal_langchain/utils/utils.py
@@ -0,0 +1,138 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import datetime
+import json
+import os
+import random
+import time as t
+
+import cv2
+import yaml
+from tqdm import tqdm
+from tzlocal import get_localzone
+
+
+def read_config(path):
+    with open(path, "r") as f:
+        config = yaml.safe_load(f)
+    return config
+
+
+def calculate_intervals(video_path, chunk_duration, clip_duration):
+    cap = cv2.VideoCapture(video_path)
+
+    if not cap.isOpened():
+        print("Error: Could not open video.")
+        return
+
+    fps = cap.get(cv2.CAP_PROP_FPS)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    total_seconds = total_frames / fps
+
+    intervals = []
+
+    chunk_frames = int(chunk_duration * fps)
+    clip_frames = int(clip_duration * fps)
+
+    for start_frame in range(0, total_frames, chunk_frames):
+        end_frame = min(start_frame + clip_frames, total_frames)
+        start_time = start_frame / fps
+        end_time = end_frame / fps
+        intervals.append((start_frame, end_frame, start_time, end_time))
+
+    cap.release()
+    return intervals
+
+
+def process_all_videos(config):
+    path = config["videos"]
+    meta_output_dir = config["meta_output_dir"]
+    selected_db = config["vector_db"]["choice_of_db"]
+    emb_path = config["embeddings"]["path"]
+    chunk_duration = config["chunk_duration"]
+    clip_duration = config["clip_duration"]
+
+    videos = [file for file in os.listdir(path) if file.endswith(".mp4")]  # TODO: increase supported video formats
+
+    # print (f'Total {len(videos)} videos will be processed')
+    metadata = {}
+
+    for i, each_video in enumerate(tqdm(videos)):
+        metadata[each_video] = {}
+        keyname = each_video
+        video_path = os.path.join(path, each_video)
+        date_time = datetime.datetime.now()  # FIXME CHECK: is this correct?
+        # date_time = t.ctime(os.stat(video_path).st_ctime)
+        # Get the local timezone of the machine
+        local_timezone = get_localzone()
+        time_format = "%a %b %d %H:%M:%S %Y"
+        if not isinstance(date_time, datetime.datetime):
+            date_time = datetime.datetime.strptime(date_time, time_format)
+        time = date_time.strftime("%H:%M:%S")
+        hours, minutes, seconds = map(float, time.split(":"))
+        date = date_time.strftime("%Y-%m-%d")
+        year, month, day = map(int, date.split("-"))
+
+        if clip_duration is not None and chunk_duration is not None and clip_duration <= chunk_duration:
+            interval_count = 0
+            metadata.pop(each_video)
+            for start_frame, end_frame, start_time, end_time in calculate_intervals(
+                video_path, chunk_duration, clip_duration
+            ):
+                keyname = os.path.splitext(os.path.basename(video_path))[0] + f"_interval_{interval_count}"
+                metadata[keyname] = {"timestamp": start_time}
+                metadata[keyname].update(
+                    {
+                        "date": date,
+                        "year": year,
+                        "month": month,
+                        "day": day,
+                        "time": time,
+                        "hours": hours,
+                        "minutes": minutes,
+                        "seconds": seconds,
+                    }
+                )
+                if selected_db == "vdms":
+                    # Localize the current time to the local timezone of the machine
+                    # Tahani might not need this
+                    current_time_local = date_time.replace(tzinfo=datetime.timezone.utc).astimezone(local_timezone)
+
+                    # Convert the localized time to ISO 8601 format with timezone offset
+                    iso_date_time = current_time_local.isoformat()
+                    metadata[keyname]["date_time"] = {"_date": str(iso_date_time)}
+
+                # Open the video file
+                cap = cv2.VideoCapture(video_path)
+
+                if int(cv2.__version__.split(".")[0]) < 3:
+                    fps = cap.get(cv2.cv.CV_CAP_PROP_FPS)
+                else:
+                    fps = cap.get(cv2.CAP_PROP_FPS)
+
+                total_frames = cap.get(cv2.CAP_PROP_FRAME_COUNT)
+                # get the duration
+                metadata[keyname].update(
+                    {
+                        "clip_duration": (min(total_frames, end_frame) - start_frame) / fps,
+                        "fps": fps,
+                        "total_frames": total_frames,
+                        #'embedding_path': os.path.join(emb_path, each_video+".pt"),
+                        "video_path": f"{os.path.join(path,each_video)}",
+                    }
+                )
+                cap.release()
+                interval_count += 1
+        metadata[keyname].update(
+            {
+                "fps": fps,
+                "total_frames": total_frames,
+                #'embedding_path': os.path.join(emb_path, each_video+".pt"),
+                "video_path": f"{os.path.join(path,each_video)}",
+            }
+        )
+    os.makedirs(meta_output_dir, exist_ok=True)
+    metadata_file = os.path.join(meta_output_dir, "metadata.json")
+    with open(metadata_file, "w") as f:
+        json.dump(metadata, f, indent=4)
--- a/comps/dataprep/vdms/multimodal_langchain/utils/vclip.py
+++ b/comps/dataprep/vdms/multimodal_langchain/utils/vclip.py
@@ -0,0 +1,56 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import json
+import os
+import sys
+
+import numpy as np
+import torch
+import torchvision.transforms as T
+import yaml
+from decord import VideoReader, cpu
+from transformers import AutoProcessor, AutoTokenizer, CLIPModel
+
+toPIL = T.ToPILImage()
+import torch.nn as nn
+from einops import rearrange
+
+
+class vCLIP(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+
+        self.num_frm = cfg["num_frm"]
+        self.model_name = cfg["model_name"]
+
+        self.clip = CLIPModel.from_pretrained(self.model_name)
+        self.processor = AutoProcessor.from_pretrained(self.model_name)
+        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
+
+    def get_text_embeddings(self, texts):
+        """Input is list of texts."""
+        text_inputs = self.tokenizer(texts, padding=True, return_tensors="pt")
+        text_features = self.clip.get_text_features(**text_inputs)
+        return text_features
+
+    def get_image_embeddings(self, images):
+        """Input is list of images."""
+        image_inputs = self.processor(images=images, return_tensors="pt")
+        image_features = self.clip.get_image_features(**image_inputs)
+        return image_features
+
+    def get_video_embeddings(self, frames_batch):
+        """Input is list of list of frames in video."""
+        self.batch_size = len(frames_batch)
+        vid_embs = []
+        for frames in frames_batch:
+            frame_embeddings = self.get_image_embeddings(frames)
+            frame_embeddings = rearrange(frame_embeddings, "(b n) d -> b n d", b=len(frames_batch))
+            # Normalize, mean aggregate and return normalized video_embeddings
+            frame_embeddings = frame_embeddings / frame_embeddings.norm(dim=-1, keepdim=True)
+            video_embeddings = frame_embeddings.mean(dim=1)
+            video_embeddings = video_embeddings / video_embeddings.norm(dim=-1, keepdim=True)
+            vid_embs.append(video_embeddings)
+        return torch.cat(vid_embs, dim=0)
--- a/tests/dataprep/test_dataprep_vdms_langchain.sh
+++ b/tests/dataprep/test_dataprep_vdms_langchain.sh
@@ -0,0 +1,83 @@
+#!/bin/bash
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+set -x
+
+WORKPATH=$(dirname "$PWD")
+LOG_PATH="$WORKPATH/tests"
+ip_address=$(hostname -I | awk '{print $1}')
+
+function build_docker_images() {
+    cd $WORKPATH
+    echo $(pwd)
+    docker build --no-cache -t opea/dataprep-vdms:comps --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/langchain/Dockerfile .
+
+    if [ $? -ne 0 ]; then
+        echo "opea/dataprep-vdms built fail"
+        exit 1
+    else
+        echo "opea/dataprep-vdms built successful"
+    fi
+    docker pull intellabs/vdms:latest
+}
+
+function start_service() {
+    VDMS_PORT=5043
+    docker run -d --name="test-comps-dataprep-vdms" -p $VDMS_PORT:55555 intellabs/vdms:latest
+    dataprep_service_port=5013
+    COLLECTION_NAME="test-comps"
+    docker run -d --name="test-comps-dataprep-vdms-server" -e COLLECTION_NAME=$COLLECTION_NAME -e no_proxy=$no_proxy -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e VDMS_HOST=$ip_address -e VDMS_PORT=$VDMS_PORT -p ${dataprep_service_port}:6007 --ipc=host opea/dataprep-vdms:comps
+    sleep 30s
+}
+
+function validate_microservice() {
+    cd $LOG_PATH
+
+    echo "Deep learning is a subset of machine learning that utilizes neural networks with multiple layers to analyze various levels of abstract data representations. It enables computers to identify patterns and make decisions with minimal human intervention by learning from large amounts of data." > $LOG_PATH/dataprep_file.txt
+
+    dataprep_service_port=5013
+
+    URL="http://$ip_address:$dataprep_service_port/v1/dataprep"
+    HTTP_STATUS=$(http_proxy="" curl -s -o /dev/null -w "%{http_code}" -X POST -F 'files=@./dataprep_file.txt' -H 'Content-Type: multipart/form-data' ${URL} )
+    if [ "$HTTP_STATUS" -eq 200 ]; then
+        echo "[ dataprep-upload-file ] HTTP status is 200. Checking content..."
+        local CONTENT=$(http_proxy="" curl -s -X POST -F 'files=@./dataprep_file.txt' -H 'Content-Type: multipart/form-data' ${URL} | tee ${LOG_PATH}/dataprep-upload-file.log)
+        if echo "$CONTENT" | grep "Data preparation succeeded"; then
+            echo "[ dataprep-upload-file ] Content is correct."
+        else
+            echo "[ dataprep-upload-file ] Content is not correct. Received content was $CONTENT"
+            docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-upload-file.log
+            docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-file_vdms.log
+            exit 1
+        fi
+    else
+        echo "[ dataprep-upload-file ] HTTP status is not 200. Received status was $HTTP_STATUS"
+        docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-upload-file.log
+        docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-file_vdms.log
+        exit 1
+    fi
+    rm ./dataprep_file.txt
+
+}
+
+function stop_docker() {
+    cid=$(docker ps -aq --filter "name=test-comps-dataprep-vdms*")
+    if [[ ! -z "$cid" ]]; then docker stop $cid && docker rm $cid && sleep 1s; fi
+}
+
+function main() {
+
+    stop_docker
+
+    build_docker_images
+    start_service
+
+    validate_microservice
+
+    stop_docker
+    echo y | docker system prune
+
+}
+
+main
--- a/tests/dataprep/test_dataprep_vdms_multimodal_langchain.sh
+++ b/tests/dataprep/test_dataprep_vdms_multimodal_langchain.sh
@@ -0,0 +1,124 @@
+#!/bin/bash
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+set -x
+
+WORKPATH=$(dirname "$PWD")
+LOG_PATH="$WORKPATH/tests"
+ip_address=$(hostname -I | awk '{print $1}')
+
+function build_docker_images() {
+    cd $WORKPATH
+    echo $(pwd)
+    docker build --no-cache -t opea/dataprep-vdms:comps --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/vdms/multimodal_langchain/Dockerfile .
+
+    if [ $? -ne 0 ]; then
+        echo "opea/dataprep-vdms built fail"
+        exit 1
+    else
+        echo "opea/dataprep-vdms built successful"
+    fi
+    docker pull intellabs/vdms:latest
+}
+
+function start_service() {
+    VDMS_PORT=5043
+    docker run -d --name="test-comps-dataprep-vdms" -p $VDMS_PORT:55555 intellabs/vdms:latest
+    dataprep_service_port=5013
+    COLLECTION_NAME="test-comps"
+    docker run -d --name="test-comps-dataprep-vdms-server" -e COLLECTION_NAME=$COLLECTION_NAME -e no_proxy=$no_proxy -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e VDMS_HOST=$ip_address -e VDMS_PORT=$VDMS_PORT -p ${dataprep_service_port}:6007 --ipc=host opea/dataprep-vdms:comps
+    sleep 30s
+}
+
+function validate_microservice() {
+    cd $LOG_PATH
+    wget https://github.com/DAMO-NLP-SG/Video-LLaMA/raw/main/examples/silence_girl.mp4 -O silence_girl.mp4
+    sleep 5
+
+    # test /v1/dataprep upload file
+    URL="http://$ip_address:$dataprep_service_port/v1/dataprep"
+
+    response=$(http_proxy="" curl -s -w "\n%{http_code}" -X POST -F 'files=@./silence_girl.mp4' -H 'Content-Type: multipart/form-data' ${URL})
+    CONTENT=$(echo "$response" | sed -e '$ d')
+    HTTP_STATUS=$(echo "$response" | tail -n 1)
+
+    if [ "$HTTP_STATUS" -eq 200 ]; then
+        echo "[ dataprep-upload-videos ]  HTTP status is 200. Checking content..."
+        if echo "$CONTENT" | grep "Videos ingested successfully"; then
+            echo "[ dataprep-upload-videos ] Content is correct."
+        else
+            echo "[ dataprep-upload-videos ] Content is not correct. Received content was $CONTENT"
+            docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-upload-videos.log
+            docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-videos_vdms.log
+            exit 1
+        fi
+    else
+        echo "[ dataprep-upload-videos ] HTTP status is not 200. Received status was $HTTP_STATUS"
+        docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-get-videos.log
+        docker logs test-comps-dataprep-vdms >> ${LOG_PATH}/dataprep-upload-videos_vdms.log
+        exit 1
+    fi
+
+    sleep 1s
+    rm ./silence_girl.mp4
+
+    # test /v1/dataprep/get_videos
+    URL="http://$ip_address:$dataprep_service_port/v1/dataprep/get_videos"
+
+    response=$(http_proxy="" curl -s -w "\n%{http_code}" -X GET ${URL})
+    CONTENT=$(echo "$response" | sed -e '$ d')
+    HTTP_STATUS=$(echo "$response" | tail -n 1)
+
+    if [ "$HTTP_STATUS" -eq 200 ]; then
+        echo "[ dataprep-get-videos ] HTTP status is 200. Checking content..."
+        if echo "$CONTENT" | grep "silence_girl"; then
+            echo "[ dataprep-get-videos ] Content is correct."
+        else
+            echo "[ dataprep-get-videos ] Content is not correct. Received content was $CONTENT"
+            docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-get-videos.log
+            exit 1
+        fi
+    else
+        echo "[ dataprep-get-videos ] HTTP status is not 200. Received status was $HTTP_STATUS"
+        docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/dataprep-get-videos.log
+        exit 1
+    fi
+
+    # test /v1/dataprep/get_file/{filename}
+    file_list=$CONTENT
+    filename=$(echo $file_list | sed 's/^\[//;s/\]$//;s/,.*//;s/"//g')
+    URL="http://$ip_address:$dataprep_service_port/v1/dataprep/get_file/${filename}"
+
+    http_proxy="" wget ${URL}
+    CONTENT=$(ls)
+    if echo "$CONTENT" | grep "silence_girl"; then
+        echo "[ download_file ] Content is correct."
+    else
+        echo "[ download_file ] Content is not correct. $CONTENT"
+        docker logs test-comps-dataprep-vdms-server >> ${LOG_PATH}/download_file.log
+        exit 1
+    fi
+
+}
+
+function stop_docker() {
+    cid=$(docker ps -aq --filter "name=test-comps-dataprep-vdms*")
+    if [[ ! -z "$cid" ]]; then docker stop $cid && docker rm $cid && sleep 1s; fi
+}
+
+function main() {
+
+    stop_docker
+
+    build_docker_images
+    start_service
+
+    validate_microservice
+
+    stop_docker
+    echo y | docker system prune
+
+}
+
+main