update examples accuracy (#941)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-14 13:20:50 +08:00
parent 441f8cc6ba
commit 088ab98f31
12 changed files with 784 additions and 14 deletions
--- a/AudioQnA/benchmark/accuracy/README.md
+++ b/AudioQnA/benchmark/accuracy/README.md
@@ -1,4 +1,4 @@
-# AudioQnA accuracy Evaluation
+# AudioQnA Accuracy

 AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.

--- a/AudioQnA/benchmark/accuracy/run_acc.sh
+++ b/AudioQnA/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,5 @@
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+python online_evaluate.py
--- a/ChatQnA/benchmark/accuracy/README.md
+++ b/ChatQnA/benchmark/accuracy/README.md
@@ -0,0 +1,170 @@
+# ChatQnA Accuracy
+
+ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval.
+
+For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive:
+
+- Dataset
+  - [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset)
+  - [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset)
+- metrics (measure accuracy of both the context retrieval and response generation)
+  - evaluation for retrieval/reranking
+    - MRR@10
+    - MAP@10
+    - Hits@10
+    - Hits@4
+    - LLM-as-a-Judge
+  - evaluation for the generated response from the end-to-end pipeline
+    - BLEU
+    - ROGUE(L)
+    - LLM-as-a-Judge
+
+## Prerequisite
+
+### Environment
+
+```bash
+git clone https://github.com/opea-project/GenAIEval
+cd GenAIEval
+pip install -r requirements.txt
+pip install -e .
+```
+
+## MultiHop (English dataset)
+
+[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
+
+### Launch Service of RAG System
+
+Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`.
+
+### Launch Service of LLM-as-a-Judge
+
+To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards:
+
+```
+# please set your llm_port and hf_token
+
+docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
+
+# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
+docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
+```
+
+### Prepare Dataset
+
+We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset.
+
+```bash
+git clone https://github.com/yixuantt/MultiHop-RAG.git
+```
+
+### Evaluation
+
+Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming.
+
+If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
+
+```bash
+python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json  --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate
+```
+
+If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
+
+```bash
+python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json  --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
+```
+
+The default values for arguments are:
+|Argument|Default value|
+|--------|-------------|
+|service_url|http://localhost:8888/v1/chatqna|
+|database_endpoint|http://localhost:6007/v1/dataprep|
+|embedding_endpoint|http://localhost:6000/v1/embeddings|
+|tei_embedding_endpoint|http://localhost:8090|
+|retrieval_endpoint|http://localhost:7000/v1/retrieval|
+|reranking_endpoint|http://localhost:8000/v1/reranking|
+|output_dir|./output|
+|temperature|0.1|
+|max_new_tokens|1280|
+|chunk_size|256|
+|chunk_overlap|100|
+|search_type|similarity|
+|retrival_k|10|
+|fetch_k|20|
+|lambda_mult|0.5|
+|dataset_path|None|
+|docs_path|None|
+|limits|100|
+
+You can check arguments details use below command:
+
+```bash
+python eval_multihop.py --help
+```
+
+## CRUD (Chinese dataset)
+
+[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system.
+
+### Prepare Dataset
+
+We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset.
+
+```bash
+git clone https://github.com/IAAR-Shanghai/CRUD_RAG
+mkdir data/
+cp CRUD_RAG/data/crud_split/split_merged.json data/
+cp -r CRUD_RAG/data/80000_docs/ data/
+python process_crud_dataset.py
+```
+
+### Launch Service of RAG System
+
+Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`.
+
+### Evaluation
+
+Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted.
+
+If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
+
+```bash
+python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs
+
+# if you want to get ragas metrics
+python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs  --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}"  --ragas_metrics
+```
+
+If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
+
+```bash
+python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
+```
+
+The default values for arguments are:
+|Argument|Default value|
+|--------|-------------|
+|service_url|http://localhost:8888/v1/chatqna|
+|database_endpoint|http://localhost:6007/v1/dataprep|
+|embedding_endpoint|http://localhost:6000/v1/embeddings|
+|retrieval_endpoint|http://localhost:7000/v1/retrieval|
+|reranking_endpoint|http://localhost:8000/v1/reranking|
+|output_dir|./output|
+|temperature|0.1|
+|max_new_tokens|1280|
+|chunk_size|256|
+|chunk_overlap|100|
+|dataset_path|./data/split_merged.json|
+|docs_path|./data/80000_docs|
+|tasks|["question_answering"]|
+
+You can check arguments details use below command:
+
+```bash
+python eval_crud.py --help
+```
+
+## Acknowledgements
+
+This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work!
--- a/ChatQnA/benchmark/accuracy/eval_crud.py
+++ b/ChatQnA/benchmark/accuracy/eval_crud.py
@@ -0,0 +1,210 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+
+import argparse
+import json
+import os
+
+from evals.evaluation.rag_eval import Evaluator
+from evals.evaluation.rag_eval.template import CRUDTemplate
+from evals.metrics.ragas import RagasMetric
+from tqdm import tqdm
+
+
+class CRUD_Evaluator(Evaluator):
+    def get_ground_truth_text(self, data: dict):
+        if self.task == "summarization":
+            ground_truth_text = data["summary"]
+        elif self.task == "question_answering":
+            ground_truth_text = data["answers"]
+        elif self.task == "continuation":
+            ground_truth_text = data["continuing"]
+        elif self.task == "hallucinated_modified":
+            ground_truth_text = data["hallucinatedMod"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return ground_truth_text
+
+    def get_query(self, data: dict):
+        if self.task == "summarization":
+            query = data["text"]
+        elif self.task == "question_answering":
+            query = data["questions"]
+        elif self.task == "continuation":
+            query = data["beginning"]
+        elif self.task == "hallucinated_modified":
+            query = data["newsBeginning"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return query
+
+    def get_document(self, data: dict):
+        if self.task == "summarization":
+            document = data["text"]
+        elif self.task == "question_answering":
+            document = data["news1"]
+        elif self.task == "continuation":
+            document = data["beginning"]
+        elif self.task == "hallucinated_modified":
+            document = data["newsBeginning"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return document
+
+    def get_template(self):
+        if self.task == "summarization":
+            template = CRUDTemplate.get_summarization_template()
+        elif self.task == "question_answering":
+            template = CRUDTemplate.get_question_answering_template()
+        elif self.task == "continuation":
+            template = CRUDTemplate.get_continuation_template()
+        else:
+            raise NotImplementedError(
+                f"Unknown task {self.task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        return template
+
+    def post_process(self, result):
+        return result.split("<response>")[-1].split("</response>")[0].strip()
+
+    def get_ragas_metrics(self, results, arguments):
+        from langchain_huggingface import HuggingFaceEndpointEmbeddings
+
+        embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
+
+        metric = RagasMetric(
+            threshold=0.5,
+            model=arguments.llm_endpoint,
+            embeddings=embeddings,
+            metrics=["faithfulness", "answer_relevancy"],
+        )
+
+        all_answer_relevancy = 0
+        all_faithfulness = 0
+        ragas_inputs = {
+            "question": [],
+            "answer": [],
+            "ground_truth": [],
+            "contexts": [],
+        }
+
+        valid_results = self.remove_invalid(results["results"])
+
+        for data in tqdm(valid_results):
+            data = data["original_data"]
+
+            query = self.get_query(data)
+            generated_text = data["generated_text"]
+            ground_truth = data["ground_truth_text"]
+            retrieved_documents = data["retrieved_documents"]
+
+            ragas_inputs["question"].append(query)
+            ragas_inputs["answer"].append(generated_text)
+            ragas_inputs["ground_truth"].append(ground_truth)
+            ragas_inputs["contexts"].append(retrieved_documents[:3])
+
+        ragas_metrics = metric.measure(ragas_inputs)
+        return ragas_metrics
+
+
+def args_parser():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
+    )
+    parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
+    parser.add_argument(
+        "--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
+    )
+    parser.add_argument(
+        "--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
+    )
+    parser.add_argument(
+        "--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
+    )
+    parser.add_argument(
+        "--chunk_overlap",
+        type=int,
+        default=100,
+        help="the number of characters that should overlap between two adjacent chunks",
+    )
+    parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset")
+    parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents")
+
+    # Retriever related options
+    parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform")
+    parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
+    parser.add_argument(
+        "--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
+    )
+    parser.add_argument(
+        "--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
+    )
+    parser.add_argument(
+        "--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
+    )
+    parser.add_argument(
+        "--tei_embedding_endpoint",
+        type=str,
+        default="http://localhost:8090",
+        help="Service URL address of tei embedding.",
+    )
+    parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
+    parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
+    parser.add_argument(
+        "--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
+    )
+    parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
+
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = args_parser()
+    if os.path.isfile(args.dataset_path):
+        with open(args.dataset_path) as f:
+            all_datasets = json.load(f)
+    else:
+        raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.")
+    os.makedirs(args.output_dir, exist_ok=True)
+    for task in args.tasks:
+        if task == "question_answering":
+            dataset = all_datasets["questanswer_1doc"]
+        elif task == "summarization":
+            dataset = all_datasets["event_summary"]
+        else:
+            raise NotImplementedError(
+                f"Unknown task {task}, only support "
+                "summarization, question_answering, continuation and hallucinated_modified."
+            )
+        output_save_path = os.path.join(args.output_dir, f"{task}.json")
+        evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task)
+        if args.ingest_docs:
+            CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap)
+        results = evaluator.evaluate(
+            args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data
+        )
+        print(results["overall"])
+        if args.ragas_metrics:
+            ragas_metrics = evaluator.get_ragas_metrics(results, args)
+            print(ragas_metrics)
+        print(f"Evaluation results of task {task} saved to {output_save_path}.")
+
+
+if __name__ == "__main__":
+    main()
--- a/ChatQnA/benchmark/accuracy/eval_multihop.py
+++ b/ChatQnA/benchmark/accuracy/eval_multihop.py
@@ -0,0 +1,279 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import argparse
+import json
+import os
+
+import requests
+from evals.evaluation.rag_eval import Evaluator
+from evals.metrics.ragas import RagasMetric
+from evals.metrics.retrieval import RetrievalBaseMetric
+from tqdm import tqdm
+
+
+class MultiHop_Evaluator(Evaluator):
+    def get_ground_truth_text(self, data: dict):
+        return data["answer"]
+
+    def get_query(self, data: dict):
+        return data["query"]
+
+    def get_template(self):
+        return None
+
+    def get_reranked_documents(self, query, docs, arguments):
+        data = {
+            "initial_query": query,
+            "retrieved_docs": [{"text": doc} for doc in docs],
+            "top_n": 10,
+        }
+        headers = {"Content-Type": "application/json"}
+
+        response = requests.post(arguments.reranking_endpoint, data=json.dumps(data), headers=headers)
+        if response.ok:
+            reranked_documents = response.json()["documents"]
+            return reranked_documents
+        else:
+            print(f"Request for retrieval failed due to {response.text}.")
+            return []
+
+    def get_retrieved_documents(self, query, arguments):
+        data = {"text": query}
+        headers = {"Content-Type": "application/json"}
+        response = requests.post(arguments.embedding_endpoint, data=json.dumps(data), headers=headers)
+        if response.ok:
+            embedding = response.json()["embedding"]
+        else:
+            print(f"Request for embedding failed due to {response.text}.")
+            return []
+        data = {
+            "text": query,
+            "embedding": embedding,
+            "search_type": arguments.search_type,
+            "k": arguments.retrival_k,
+            "fetch_k": arguments.fetch_k,
+            "lambda_mult": arguments.lambda_mult,
+        }
+        response = requests.post(arguments.retrieval_endpoint, data=json.dumps(data), headers=headers)
+        if response.ok:
+            retrieved_documents = response.json()["retrieved_docs"]
+            return [doc["text"] for doc in retrieved_documents]
+        else:
+            print(f"Request for retrieval failed due to {response.text}.")
+            return []
+
+    def get_retrieval_metrics(self, all_queries, arguments):
+        print("start to retrieve...")
+        metric = RetrievalBaseMetric()
+        hits_at_10 = 0
+        hits_at_4 = 0
+        map_at_10 = 0
+        mrr_at_10 = 0
+        total = 0
+        for data in tqdm(all_queries):
+            if data["question_type"] == "null_query":
+                continue
+            query = data["query"]
+            retrieved_documents = self.get_retrieved_documents(query, arguments)
+            if arguments.rerank:
+                retrieved_documents = self.get_reranked_documents(query, retrieved_documents, arguments)
+            golden_context = [each["fact"] for each in data["evidence_list"]]
+            test_case = {
+                "input": query,
+                "golden_context": golden_context,
+                "retrieval_context": retrieved_documents,
+            }
+            results = metric.measure(test_case)
+            hits_at_10 += results["Hits@10"]
+            hits_at_4 += results["Hits@4"]
+            map_at_10 += results["MAP@10"]
+            mrr_at_10 += results["MRR@10"]
+            total += 1
+
+        # Calculate average metrics over all queries
+        hits_at_10 = hits_at_10 / total
+        hits_at_4 = hits_at_4 / total
+        map_at_10 = map_at_10 / total
+        mrr_at_10 = mrr_at_10 / total
+
+        return {
+            "Hits@10": hits_at_10,
+            "Hits@4": hits_at_4,
+            "MAP@10": map_at_10,
+            "MRR@10": mrr_at_10,
+        }
+
+    def evaluate(self, all_queries, arguments):
+        results = []
+        accuracy = 0
+        index = 0
+        for data in tqdm(all_queries):
+            if data["question_type"] == "null_query":
+                continue
+
+            generated_text = self.send_request(data, arguments)
+            data["generated_text"] = generated_text
+
+            # same method with paper: https://github.com/yixuantt/MultiHop-RAG/issues/8
+            if data["answer"] in generated_text:
+                accuracy += 1
+            result = {"id": index, **self.scoring(data)}
+            results.append(result)
+            index += 1
+
+        valid_results = self.remove_invalid(results)
+
+        try:
+            overall = self.compute_overall(valid_results) if len(valid_results) > 0 else {}
+        except Exception as e:
+            print(repr(e))
+            overall = dict()
+
+        overall.update({"accuracy": accuracy / len(results)})
+        return overall
+
+    def get_ragas_metrics(self, all_queries, arguments):
+        from langchain_huggingface import HuggingFaceEndpointEmbeddings
+
+        embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
+
+        metric = RagasMetric(threshold=0.5, model=arguments.llm_endpoint, embeddings=embeddings)
+        all_answer_relevancy = 0
+        all_faithfulness = 0
+        ragas_inputs = {
+            "question": [],
+            "answer": [],
+            "ground_truth": [],
+            "contexts": [],
+        }
+
+        for data in tqdm(all_queries):
+            if data["question_type"] == "null_query":
+                continue
+            retrieved_documents = self.get_retrieved_documents(data["query"], arguments)
+            generated_text = self.send_request(data, arguments)
+            data["generated_text"] = generated_text
+
+            ragas_inputs["question"].append(data["query"])
+            ragas_inputs["answer"].append(generated_text)
+            ragas_inputs["ground_truth"].append(data["answer"])
+            ragas_inputs["contexts"].append(retrieved_documents[:3])
+
+            if len(ragas_inputs["question"]) >= arguments.limits:
+                break
+
+        ragas_metrics = metric.measure(ragas_inputs)
+        return ragas_metrics
+
+
+def args_parser():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
+    )
+    parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
+    parser.add_argument(
+        "--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
+    )
+    parser.add_argument(
+        "--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
+    )
+    parser.add_argument(
+        "--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
+    )
+    parser.add_argument(
+        "--chunk_overlap",
+        type=int,
+        default=100,
+        help="the number of characters that should overlap between two adjacent chunks",
+    )
+    parser.add_argument("--search_type", type=str, default="similarity", help="similarity type")
+    parser.add_argument("--retrival_k", type=int, default=10, help="Number of Documents to return.")
+    parser.add_argument(
+        "--fetch_k", type=int, default=20, help="Number of Documents to fetch to pass to MMR algorithm."
+    )
+    parser.add_argument(
+        "--lambda_mult",
+        type=float,
+        default=0.5,
+        help="Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.",
+    )
+    parser.add_argument("--dataset_path", default=None, help="Path to the dataset")
+    parser.add_argument("--docs_path", default=None, help="Path to the retrieval documents")
+
+    # Retriever related options
+    parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
+    parser.add_argument("--retrieval_metrics", action="store_true", help="Whether to compute retrieval metrics.")
+    parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
+    parser.add_argument("--limits", type=int, default=100, help="Number of examples to be evaluated by llm-as-judge")
+    parser.add_argument(
+        "--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
+    )
+    parser.add_argument(
+        "--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
+    )
+    parser.add_argument(
+        "--tei_embedding_endpoint",
+        type=str,
+        default="http://localhost:8090",
+        help="Service URL address of tei embedding.",
+    )
+    parser.add_argument(
+        "--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
+    )
+    parser.add_argument("--rerank", action="store_true", help="Whether to use rerank microservice.")
+    parser.add_argument(
+        "--reranking_endpoint", type=str, default="http://localhost:8000/v1/reranking", help="Service URL address."
+    )
+    parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
+    parser.add_argument(
+        "--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
+    )
+    parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
+
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = args_parser()
+
+    evaluator = MultiHop_Evaluator()
+
+    with open(args.docs_path, "r") as file:
+        doc_data = json.load(file)
+
+    documents = []
+    for doc in doc_data:
+        metadata = {"title": doc["title"], "published_at": doc["published_at"], "source": doc["source"]}
+        documents.append(doc["body"])
+
+    # save docs to a tmp file
+    tmp_corpus_file = "tmp_corpus.txt"
+    with open(tmp_corpus_file, "w") as f:
+        for doc in documents:
+            f.write(doc + "\n")
+
+    if args.ingest_docs:
+        evaluator.ingest_docs(tmp_corpus_file, args.database_endpoint, args.chunk_size, args.chunk_overlap)
+
+    with open(args.dataset_path, "r") as file:
+        all_queries = json.load(file)
+
+    # get retrieval quality
+    if args.retrieval_metrics:
+        retrieval_metrics = evaluator.get_retrieval_metrics(all_queries, args)
+        print(retrieval_metrics)
+
+    # get rag quality
+    if args.ragas_metrics:
+        ragas_metrics = evaluator.get_ragas_metrics(all_queries, args)
+        print(ragas_metrics)
+
+
+if __name__ == "__main__":
+    main()
--- a/ChatQnA/benchmark/accuracy/process_crud_dataset.py
+++ b/ChatQnA/benchmark/accuracy/process_crud_dataset.py
@@ -0,0 +1,9 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+path = os.path.join(os.path.dirname(__file__), "./data/80000_docs")
+for file in os.listdir(path):
+    src_file = os.path.join(path, file)
+    os.rename(src_file, src_file + ".txt")
--- a/ChatQnA/benchmark/accuracy/run_acc.sh
+++ b/ChatQnA/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,64 @@
+#!/bin/bash
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+set -x
+
+function main {
+
+  init_params "$@"
+  # run_benchmark
+  echo $dataset
+  if [[ ${dataset} == "MultiHop" ]]; then
+    run_multihop
+  elif [[ ${dataset} == "crud" ]]; then
+    run_crud
+  fi
+
+}
+
+# init params
+function init_params {
+  for var in "$@"
+  do
+    case $var in
+      --dataset=*)
+          dataset=$(  echo $var |cut -f2 -d=)
+      ;;
+      *)
+          echo "Error: No such parameter: ${var}"
+          exit 1
+      ;;
+    esac
+  done
+}
+
+# run_multihop
+function run_multihop {
+  git clone https://github.com/yixuantt/MultiHop-RAG.git
+
+  python eval_multihop.py \
+      --docs_path MultiHop-RAG/dataset/corpus.json \
+      --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json \
+      --ingest_docs \
+      --retrieval_metrics
+
+}
+
+# run_crud
+function run_crud {
+
+  git clone https://github.com/IAAR-Shanghai/CRUD_RAG
+  mkdir data/
+  cp CRUD_RAG/data/crud_split/split_merged.json data/
+  cp -r CRUD_RAG/data/80000_docs/ data/
+  python process_crud_dataset.py
+
+  python eval_crud.py \
+      --dataset_path ./data/split_merged.json \
+      --docs_path ./data/80000_docs \
+      --ingest_docs
+}
+
+
+main "$@"
--- a/CodeGen/benchmark/accuracy/README.md
+++ b/CodeGen/benchmark/accuracy/README.md
@@ -1,4 +1,4 @@
-# CodeGen accuracy Evaluation
+# CodeGen Accuracy

 ## Evaluation Framework

@@ -13,7 +13,7 @@ Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples
 Use `curl` command to test codegen service and ensure that it has started properly

 ```bash
-export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen"
+export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
 curl $CODEGEN_ENDPOINT \
    -H "Content-Type: application/json" \
    -d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
@@ -24,7 +24,7 @@ curl $CODEGEN_ENDPOINT \

 For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.

-#### command line usage
+#### Environment

 ```shell
 git clone https://github.com/opea-project/GenAIEval
@@ -32,15 +32,14 @@ cd GenAIEval
 pip install -r requirements.txt
 pip install -e .

-cd evals/evaluation/bigcode_evaluation_harness/examples
-python main.py --model Qwen/CodeQwen1.5-7B-Chat \
-  --tasks humaneval \
-  --codegen_url $CODEGEN_ENDPOINT \
-  --max_length_generation 2048 \
-  --batch_size 1  \
-  --save_generations \
-  --save_references \
-  --allow_code_execution
+```
+
+#### Evaluation
+
+```
+export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
+export CODEGEN_MODEL=your_model
+bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
 ```

 **_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
--- a/CodeGen/benchmark/accuracy/main.py
+++ b/CodeGen/benchmark/accuracy/main.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+#
+from evals.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
+
+
+def main():
+    eval_args = setup_parser()
+    results = evaluate(eval_args)
+    print(results)
+
+
+if __name__ == "__main__":
+    main()
--- a/CodeGen/benchmark/accuracy/run_acc.sh
+++ b/CodeGen/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,13 @@
+
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+python main.py --model $1 \
+  --tasks humaneval \
+  --codegen_url $2 \
+  --max_length_generation 2048 \
+  --batch_size 1  \
+  --save_generations \
+  --save_references \
+  --allow_code_execution
--- a/FaqGen/benchmark/accuracy/README.md
+++ b/FaqGen/benchmark/accuracy/README.md
@@ -1,4 +1,4 @@
-# FaqGen Evaluation
+# FaqGen Accuracy

 ## Dataset

--- a/FaqGen/benchmark/accuracy/run_acc.sh
+++ b/FaqGen/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,4 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+python evaluate.py