update examples accuracy (#941)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-14 13:20:50 +08:00
parent 441f8cc6ba
commit 088ab98f31
12 changed files with 784 additions and 14 deletions
--- a/AudioQnA/benchmark/accuracy/README.md
+++ b/AudioQnA/benchmark/accuracy/README.md
@@ -1,4 +1,4 @@
-# AudioQnA accuracy Evaluation
+# AudioQnA Accuracy
 AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.
--- a/AudioQnA/benchmark/accuracy/run_acc.sh
+++ b/AudioQnA/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,5 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 python online_evaluate.py
--- a/ChatQnA/benchmark/accuracy/README.md
+++ b/ChatQnA/benchmark/accuracy/README.md
@@ -0,0 +1,170 @@
 # ChatQnA Accuracy
 ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval.
 For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive:
 - Dataset
  - [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset)
  - [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset)
 - metrics (measure accuracy of both the context retrieval and response generation)
  - evaluation for retrieval/reranking
    - MRR@10
    - MAP@10
    - Hits@10
    - Hits@4
    - LLM-as-a-Judge
  - evaluation for the generated response from the end-to-end pipeline
    - BLEU
    - ROGUE(L)
    - LLM-as-a-Judge
 ## Prerequisite
 ### Environment
 ```bash
 git clone https://github.com/opea-project/GenAIEval
 cd GenAIEval
 pip install -r requirements.txt
 pip install -e .
 ```
 ## MultiHop (English dataset)
 [MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
 ### Launch Service of RAG System
 Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`.
 ### Launch Service of LLM-as-a-Judge
 To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards:
 ```
 # please set your llm_port and hf_token
 docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
 # for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
 docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
 ```
 ### Prepare Dataset
 We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset.
 ```bash
 git clone https://github.com/yixuantt/MultiHop-RAG.git
 ```
 ### Evaluation
 Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming.
 If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
 ```bash
 python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json  --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate
 ```
 If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
 ```bash
 python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json  --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
 ```
 The default values for arguments are:
 |Argument|Default value|
 |--------|-------------|
 |service_url|http://localhost:8888/v1/chatqna|
 |database_endpoint|http://localhost:6007/v1/dataprep|
 |embedding_endpoint|http://localhost:6000/v1/embeddings|
 |tei_embedding_endpoint|http://localhost:8090|
 |retrieval_endpoint|http://localhost:7000/v1/retrieval|
 |reranking_endpoint|http://localhost:8000/v1/reranking|
 |output_dir|./output|
 |temperature|0.1|
 |max_new_tokens|1280|
 |chunk_size|256|
 |chunk_overlap|100|
 |search_type|similarity|
 |retrival_k|10|
 |fetch_k|20|
 |lambda_mult|0.5|
 |dataset_path|None|
 |docs_path|None|
 |limits|100|
 You can check arguments details use below command:
 ```bash
 python eval_multihop.py --help
 ```
 ## CRUD (Chinese dataset)
 [CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system.
 ### Prepare Dataset
 We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset.
 ```bash
 git clone https://github.com/IAAR-Shanghai/CRUD_RAG
 mkdir data/
 cp CRUD_RAG/data/crud_split/split_merged.json data/
 cp -r CRUD_RAG/data/80000_docs/ data/
 python process_crud_dataset.py
 ```
 ### Launch Service of RAG System
 Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`.
 ### Evaluation
 Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted.
 If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
 ```bash
 python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs
 # if you want to get ragas metrics
 python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs  --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}"  --ragas_metrics
 ```
 If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
 ```bash
 python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
 ```
 The default values for arguments are:
 |Argument|Default value|
 |--------|-------------|
 |service_url|http://localhost:8888/v1/chatqna|
 |database_endpoint|http://localhost:6007/v1/dataprep|
 |embedding_endpoint|http://localhost:6000/v1/embeddings|
 |retrieval_endpoint|http://localhost:7000/v1/retrieval|
 |reranking_endpoint|http://localhost:8000/v1/reranking|
 |output_dir|./output|
 |temperature|0.1|
 |max_new_tokens|1280|
 |chunk_size|256|
 |chunk_overlap|100|
 |dataset_path|./data/split_merged.json|
 |docs_path|./data/80000_docs|
 |tasks|["question_answering"]|
 You can check arguments details use below command:
 ```bash
 python eval_crud.py --help
 ```
 ## Acknowledgements
 This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work!
--- a/ChatQnA/benchmark/accuracy/eval_crud.py
+++ b/ChatQnA/benchmark/accuracy/eval_crud.py
@@ -0,0 +1,210 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 import argparse
 import json
 import os
 from evals.evaluation.rag_eval import Evaluator
 from evals.evaluation.rag_eval.template import CRUDTemplate
 from evals.metrics.ragas import RagasMetric
 from tqdm import tqdm
 class CRUD_Evaluator(Evaluator):
    def get_ground_truth_text(self, data: dict):
        if self.task == "summarization":
            ground_truth_text = data["summary"]
        elif self.task == "question_answering":
            ground_truth_text = data["answers"]
        elif self.task == "continuation":
            ground_truth_text = data["continuing"]
        elif self.task == "hallucinated_modified":
            ground_truth_text = data["hallucinatedMod"]
        else:
            raise NotImplementedError(
                f"Unknown task {self.task}, only support "
                "summarization, question_answering, continuation and hallucinated_modified."
            )
        return ground_truth_text
    def get_query(self, data: dict):
        if self.task == "summarization":
            query = data["text"]
        elif self.task == "question_answering":
            query = data["questions"]
        elif self.task == "continuation":
            query = data["beginning"]
        elif self.task == "hallucinated_modified":
            query = data["newsBeginning"]
        else:
            raise NotImplementedError(
                f"Unknown task {self.task}, only support "
                "summarization, question_answering, continuation and hallucinated_modified."
            )
        return query
    def get_document(self, data: dict):
        if self.task == "summarization":
            document = data["text"]
        elif self.task == "question_answering":
            document = data["news1"]
        elif self.task == "continuation":
            document = data["beginning"]
        elif self.task == "hallucinated_modified":
            document = data["newsBeginning"]
        else:
            raise NotImplementedError(
                f"Unknown task {self.task}, only support "
                "summarization, question_answering, continuation and hallucinated_modified."
            )
        return document
    def get_template(self):
        if self.task == "summarization":
            template = CRUDTemplate.get_summarization_template()
        elif self.task == "question_answering":
            template = CRUDTemplate.get_question_answering_template()
        elif self.task == "continuation":
            template = CRUDTemplate.get_continuation_template()
        else:
            raise NotImplementedError(
                f"Unknown task {self.task}, only support "
                "summarization, question_answering, continuation and hallucinated_modified."
            )
        return template
    def post_process(self, result):
        return result.split("<response>")[-1].split("</response>")[0].strip()
    def get_ragas_metrics(self, results, arguments):
        from langchain_huggingface import HuggingFaceEndpointEmbeddings
        embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
        metric = RagasMetric(
            threshold=0.5,
            model=arguments.llm_endpoint,
            embeddings=embeddings,
            metrics=["faithfulness", "answer_relevancy"],
        )
        all_answer_relevancy = 0
        all_faithfulness = 0
        ragas_inputs = {
            "question": [],
            "answer": [],
            "ground_truth": [],
            "contexts": [],
        }
        valid_results = self.remove_invalid(results["results"])
        for data in tqdm(valid_results):
            data = data["original_data"]
            query = self.get_query(data)
            generated_text = data["generated_text"]
            ground_truth = data["ground_truth_text"]
            retrieved_documents = data["retrieved_documents"]
            ragas_inputs["question"].append(query)
            ragas_inputs["answer"].append(generated_text)
            ragas_inputs["ground_truth"].append(ground_truth)
            ragas_inputs["contexts"].append(retrieved_documents[:3])
        ragas_metrics = metric.measure(ragas_inputs)
        return ragas_metrics
 def args_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
    )
    parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
    parser.add_argument(
        "--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
    )
    parser.add_argument(
        "--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
    )
    parser.add_argument(
        "--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
    )
    parser.add_argument(
        "--chunk_overlap",
        type=int,
        default=100,
        help="the number of characters that should overlap between two adjacent chunks",
    )
    parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset")
    parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents")
    # Retriever related options
    parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform")
    parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
    parser.add_argument(
        "--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
    )
    parser.add_argument(
        "--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
    )
    parser.add_argument(
        "--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
    )
    parser.add_argument(
        "--tei_embedding_endpoint",
        type=str,
        default="http://localhost:8090",
        help="Service URL address of tei embedding.",
    )
    parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
    parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
    parser.add_argument(
        "--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
    )
    parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
    args = parser.parse_args()
    return args
 def main():
    args = args_parser()
    if os.path.isfile(args.dataset_path):
        with open(args.dataset_path) as f:
            all_datasets = json.load(f)
    else:
        raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.")
    os.makedirs(args.output_dir, exist_ok=True)
    for task in args.tasks:
        if task == "question_answering":
            dataset = all_datasets["questanswer_1doc"]
        elif task == "summarization":
            dataset = all_datasets["event_summary"]
        else:
            raise NotImplementedError(
                f"Unknown task {task}, only support "
                "summarization, question_answering, continuation and hallucinated_modified."
            )
        output_save_path = os.path.join(args.output_dir, f"{task}.json")
        evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task)
        if args.ingest_docs:
            CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap)
        results = evaluator.evaluate(
            args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data
        )
        print(results["overall"])
        if args.ragas_metrics:
            ragas_metrics = evaluator.get_ragas_metrics(results, args)
            print(ragas_metrics)
        print(f"Evaluation results of task {task} saved to {output_save_path}.")
 if __name__ == "__main__":
    main()
--- a/ChatQnA/benchmark/accuracy/eval_multihop.py
+++ b/ChatQnA/benchmark/accuracy/eval_multihop.py
@@ -0,0 +1,279 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 import argparse
 import json
 import os
 import requests
 from evals.evaluation.rag_eval import Evaluator
 from evals.metrics.ragas import RagasMetric
 from evals.metrics.retrieval import RetrievalBaseMetric
 from tqdm import tqdm
 class MultiHop_Evaluator(Evaluator):
    def get_ground_truth_text(self, data: dict):
        return data["answer"]
    def get_query(self, data: dict):
        return data["query"]
    def get_template(self):
        return None
    def get_reranked_documents(self, query, docs, arguments):
        data = {
            "initial_query": query,
            "retrieved_docs": [{"text": doc} for doc in docs],
            "top_n": 10,
        }
        headers = {"Content-Type": "application/json"}
        response = requests.post(arguments.reranking_endpoint, data=json.dumps(data), headers=headers)
        if response.ok:
            reranked_documents = response.json()["documents"]
            return reranked_documents
        else:
            print(f"Request for retrieval failed due to {response.text}.")
            return []
    def get_retrieved_documents(self, query, arguments):
        data = {"text": query}
        headers = {"Content-Type": "application/json"}
        response = requests.post(arguments.embedding_endpoint, data=json.dumps(data), headers=headers)
        if response.ok:
            embedding = response.json()["embedding"]
        else:
            print(f"Request for embedding failed due to {response.text}.")
            return []
        data = {
            "text": query,
            "embedding": embedding,
            "search_type": arguments.search_type,
            "k": arguments.retrival_k,
            "fetch_k": arguments.fetch_k,
            "lambda_mult": arguments.lambda_mult,
        }
        response = requests.post(arguments.retrieval_endpoint, data=json.dumps(data), headers=headers)
        if response.ok:
            retrieved_documents = response.json()["retrieved_docs"]
            return [doc["text"] for doc in retrieved_documents]
        else:
            print(f"Request for retrieval failed due to {response.text}.")
            return []
    def get_retrieval_metrics(self, all_queries, arguments):
        print("start to retrieve...")
        metric = RetrievalBaseMetric()
        hits_at_10 = 0
        hits_at_4 = 0
        map_at_10 = 0
        mrr_at_10 = 0
        total = 0
        for data in tqdm(all_queries):
            if data["question_type"] == "null_query":
                continue
            query = data["query"]
            retrieved_documents = self.get_retrieved_documents(query, arguments)
            if arguments.rerank:
                retrieved_documents = self.get_reranked_documents(query, retrieved_documents, arguments)
            golden_context = [each["fact"] for each in data["evidence_list"]]
            test_case = {
                "input": query,
                "golden_context": golden_context,
                "retrieval_context": retrieved_documents,
            }
            results = metric.measure(test_case)
            hits_at_10 += results["Hits@10"]
            hits_at_4 += results["Hits@4"]
            map_at_10 += results["MAP@10"]
            mrr_at_10 += results["MRR@10"]
            total += 1
        # Calculate average metrics over all queries
        hits_at_10 = hits_at_10 / total
        hits_at_4 = hits_at_4 / total
        map_at_10 = map_at_10 / total
        mrr_at_10 = mrr_at_10 / total
        return {
            "Hits@10": hits_at_10,
            "Hits@4": hits_at_4,
            "MAP@10": map_at_10,
            "MRR@10": mrr_at_10,
        }
    def evaluate(self, all_queries, arguments):
        results = []
        accuracy = 0
        index = 0
        for data in tqdm(all_queries):
            if data["question_type"] == "null_query":
                continue
            generated_text = self.send_request(data, arguments)
            data["generated_text"] = generated_text
            # same method with paper: https://github.com/yixuantt/MultiHop-RAG/issues/8
            if data["answer"] in generated_text:
                accuracy += 1
            result = {"id": index, **self.scoring(data)}
            results.append(result)
            index += 1
        valid_results = self.remove_invalid(results)
        try:
            overall = self.compute_overall(valid_results) if len(valid_results) > 0 else {}
        except Exception as e:
            print(repr(e))
            overall = dict()
        overall.update({"accuracy": accuracy / len(results)})
        return overall
    def get_ragas_metrics(self, all_queries, arguments):
        from langchain_huggingface import HuggingFaceEndpointEmbeddings
        embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
        metric = RagasMetric(threshold=0.5, model=arguments.llm_endpoint, embeddings=embeddings)
        all_answer_relevancy = 0
        all_faithfulness = 0
        ragas_inputs = {
            "question": [],
            "answer": [],
            "ground_truth": [],
            "contexts": [],
        }
        for data in tqdm(all_queries):
            if data["question_type"] == "null_query":
                continue
            retrieved_documents = self.get_retrieved_documents(data["query"], arguments)
            generated_text = self.send_request(data, arguments)
            data["generated_text"] = generated_text
            ragas_inputs["question"].append(data["query"])
            ragas_inputs["answer"].append(generated_text)
            ragas_inputs["ground_truth"].append(data["answer"])
            ragas_inputs["contexts"].append(retrieved_documents[:3])
            if len(ragas_inputs["question"]) >= arguments.limits:
                break
        ragas_metrics = metric.measure(ragas_inputs)
        return ragas_metrics
 def args_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
    )
    parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
    parser.add_argument(
        "--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
    )
    parser.add_argument(
        "--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
    )
    parser.add_argument(
        "--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
    )
    parser.add_argument(
        "--chunk_overlap",
        type=int,
        default=100,
        help="the number of characters that should overlap between two adjacent chunks",
    )
    parser.add_argument("--search_type", type=str, default="similarity", help="similarity type")
    parser.add_argument("--retrival_k", type=int, default=10, help="Number of Documents to return.")
    parser.add_argument(
        "--fetch_k", type=int, default=20, help="Number of Documents to fetch to pass to MMR algorithm."
    )
    parser.add_argument(
        "--lambda_mult",
        type=float,
        default=0.5,
        help="Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.",
    )
    parser.add_argument("--dataset_path", default=None, help="Path to the dataset")
    parser.add_argument("--docs_path", default=None, help="Path to the retrieval documents")
    # Retriever related options
    parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
    parser.add_argument("--retrieval_metrics", action="store_true", help="Whether to compute retrieval metrics.")
    parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
    parser.add_argument("--limits", type=int, default=100, help="Number of examples to be evaluated by llm-as-judge")
    parser.add_argument(
        "--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
    )
    parser.add_argument(
        "--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
    )
    parser.add_argument(
        "--tei_embedding_endpoint",
        type=str,
        default="http://localhost:8090",
        help="Service URL address of tei embedding.",
    )
    parser.add_argument(
        "--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
    )
    parser.add_argument("--rerank", action="store_true", help="Whether to use rerank microservice.")
    parser.add_argument(
        "--reranking_endpoint", type=str, default="http://localhost:8000/v1/reranking", help="Service URL address."
    )
    parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
    parser.add_argument(
        "--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
    )
    parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
    args = parser.parse_args()
    return args
 def main():
    args = args_parser()
    evaluator = MultiHop_Evaluator()
    with open(args.docs_path, "r") as file:
        doc_data = json.load(file)
    documents = []
    for doc in doc_data:
        metadata = {"title": doc["title"], "published_at": doc["published_at"], "source": doc["source"]}
        documents.append(doc["body"])
    # save docs to a tmp file
    tmp_corpus_file = "tmp_corpus.txt"
    with open(tmp_corpus_file, "w") as f:
        for doc in documents:
            f.write(doc + "\n")
    if args.ingest_docs:
        evaluator.ingest_docs(tmp_corpus_file, args.database_endpoint, args.chunk_size, args.chunk_overlap)
    with open(args.dataset_path, "r") as file:
        all_queries = json.load(file)
    # get retrieval quality
    if args.retrieval_metrics:
        retrieval_metrics = evaluator.get_retrieval_metrics(all_queries, args)
        print(retrieval_metrics)
    # get rag quality
    if args.ragas_metrics:
        ragas_metrics = evaluator.get_ragas_metrics(all_queries, args)
        print(ragas_metrics)
 if __name__ == "__main__":
    main()
--- a/ChatQnA/benchmark/accuracy/process_crud_dataset.py
+++ b/ChatQnA/benchmark/accuracy/process_crud_dataset.py
@@ -0,0 +1,9 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 import os
 path = os.path.join(os.path.dirname(__file__), "./data/80000_docs")
 for file in os.listdir(path):
    src_file = os.path.join(path, file)
    os.rename(src_file, src_file + ".txt")
--- a/ChatQnA/benchmark/accuracy/run_acc.sh
+++ b/ChatQnA/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,64 @@
 #!/bin/bash
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 set -x
 function main {
  init_params "$@"
  # run_benchmark
  echo $dataset
  if [[ ${dataset} == "MultiHop" ]]; then
    run_multihop
  elif [[ ${dataset} == "crud" ]]; then
    run_crud
  fi
 }
 # init params
 function init_params {
  for var in "$@"
  do
    case $var in
      --dataset=*)
          dataset=$(  echo $var |cut -f2 -d=)
      ;;
      *)
          echo "Error: No such parameter: ${var}"
          exit 1
      ;;
    esac
  done
 }
 # run_multihop
 function run_multihop {
  git clone https://github.com/yixuantt/MultiHop-RAG.git
  python eval_multihop.py \
      --docs_path MultiHop-RAG/dataset/corpus.json \
      --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json \
      --ingest_docs \
      --retrieval_metrics
 }
 # run_crud
 function run_crud {
  git clone https://github.com/IAAR-Shanghai/CRUD_RAG
  mkdir data/
  cp CRUD_RAG/data/crud_split/split_merged.json data/
  cp -r CRUD_RAG/data/80000_docs/ data/
  python process_crud_dataset.py
  python eval_crud.py \
      --dataset_path ./data/split_merged.json \
      --docs_path ./data/80000_docs \
      --ingest_docs
 }
 main "$@"
--- a/CodeGen/benchmark/accuracy/README.md
+++ b/CodeGen/benchmark/accuracy/README.md
@@ -1,4 +1,4 @@
-# CodeGen accuracy Evaluation
+# CodeGen Accuracy
 ## Evaluation Framework
@@ -13,7 +13,7 @@ Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples
 Use `curl` command to test codegen service and ensure that it has started properly
 ```bash
-export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen"
+export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
 curl $CODEGEN_ENDPOINT \
    -H "Content-Type: application/json" \
    -d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
@@ -24,7 +24,7 @@ curl $CODEGEN_ENDPOINT \
 For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
-#### command line usage
+#### Environment
 ```shell
 git clone https://github.com/opea-project/GenAIEval
@@ -32,15 +32,14 @@ cd GenAIEval
 pip install -r requirements.txt
 pip install -e .
-cd evals/evaluation/bigcode_evaluation_harness/examples
+```
-python main.py --model Qwen/CodeQwen1.5-7B-Chat \
+
-  --tasks humaneval \
+#### Evaluation
-  --codegen_url $CODEGEN_ENDPOINT \
+
-  --max_length_generation 2048 \
+```
-  --batch_size 1  \
+export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
-  --save_generations \
+export CODEGEN_MODEL=your_model
-  --save_references \
+bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
  --allow_code_execution
 ```
 **_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
--- a/CodeGen/benchmark/accuracy/main.py
+++ b/CodeGen/benchmark/accuracy/main.py
@@ -0,0 +1,17 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 #
 from evals.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
 def main():
    eval_args = setup_parser()
    results = evaluate(eval_args)
    print(results)
 if __name__ == "__main__":
    main()
--- a/CodeGen/benchmark/accuracy/run_acc.sh
+++ b/CodeGen/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,13 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 python main.py --model $1 \
  --tasks humaneval \
  --codegen_url $2 \
  --max_length_generation 2048 \
  --batch_size 1  \
  --save_generations \
  --save_references \
  --allow_code_execution
--- a/FaqGen/benchmark/accuracy/README.md
+++ b/FaqGen/benchmark/accuracy/README.md
@@ -1,4 +1,4 @@
-# FaqGen Evaluation
+# FaqGen Accuracy
 ## Dataset
--- a/FaqGen/benchmark/accuracy/run_acc.sh
+++ b/FaqGen/benchmark/accuracy/run_acc.sh
@@ -0,0 +1,4 @@
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 python evaluate.py
`@@ -1,4 +1,4 @@`
	`# AudioQnA accuracy Evaluation`	`# AudioQnA Accuracy`

	`AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.`	`AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.`
`@@ -1,4 +1,4 @@`
	`# FaqGen Evaluation`	`# FaqGen Accuracy`

	`## Dataset`	`## Dataset`