update examples accuracy (#941)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
|||||||
# AudioQnA accuracy Evaluation
|
# AudioQnA Accuracy
|
||||||
|
|
||||||
AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.
|
AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.
|
||||||
|
|
||||||
|
|||||||
5
AudioQnA/benchmark/accuracy/run_acc.sh
Normal file
5
AudioQnA/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,5 @@
|
|||||||
|
|
||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
python online_evaluate.py
|
||||||
170
ChatQnA/benchmark/accuracy/README.md
Normal file
170
ChatQnA/benchmark/accuracy/README.md
Normal file
@@ -0,0 +1,170 @@
|
|||||||
|
# ChatQnA Accuracy
|
||||||
|
|
||||||
|
ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval.
|
||||||
|
|
||||||
|
For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive:
|
||||||
|
|
||||||
|
- Dataset
|
||||||
|
- [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset)
|
||||||
|
- [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset)
|
||||||
|
- metrics (measure accuracy of both the context retrieval and response generation)
|
||||||
|
- evaluation for retrieval/reranking
|
||||||
|
- MRR@10
|
||||||
|
- MAP@10
|
||||||
|
- Hits@10
|
||||||
|
- Hits@4
|
||||||
|
- LLM-as-a-Judge
|
||||||
|
- evaluation for the generated response from the end-to-end pipeline
|
||||||
|
- BLEU
|
||||||
|
- ROGUE(L)
|
||||||
|
- LLM-as-a-Judge
|
||||||
|
|
||||||
|
## Prerequisite
|
||||||
|
|
||||||
|
### Environment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/opea-project/GenAIEval
|
||||||
|
cd GenAIEval
|
||||||
|
pip install -r requirements.txt
|
||||||
|
pip install -e .
|
||||||
|
```
|
||||||
|
|
||||||
|
## MultiHop (English dataset)
|
||||||
|
|
||||||
|
[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
|
||||||
|
|
||||||
|
### Launch Service of RAG System
|
||||||
|
|
||||||
|
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`.
|
||||||
|
|
||||||
|
### Launch Service of LLM-as-a-Judge
|
||||||
|
|
||||||
|
To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards:
|
||||||
|
|
||||||
|
```
|
||||||
|
# please set your llm_port and hf_token
|
||||||
|
|
||||||
|
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
|
||||||
|
|
||||||
|
# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
|
||||||
|
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
|
||||||
|
```
|
||||||
|
|
||||||
|
### Prepare Dataset
|
||||||
|
|
||||||
|
We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/yixuantt/MultiHop-RAG.git
|
||||||
|
```
|
||||||
|
|
||||||
|
### Evaluation
|
||||||
|
|
||||||
|
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming.
|
||||||
|
|
||||||
|
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate
|
||||||
|
```
|
||||||
|
|
||||||
|
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
|
||||||
|
```
|
||||||
|
|
||||||
|
The default values for arguments are:
|
||||||
|
|Argument|Default value|
|
||||||
|
|--------|-------------|
|
||||||
|
|service_url|http://localhost:8888/v1/chatqna|
|
||||||
|
|database_endpoint|http://localhost:6007/v1/dataprep|
|
||||||
|
|embedding_endpoint|http://localhost:6000/v1/embeddings|
|
||||||
|
|tei_embedding_endpoint|http://localhost:8090|
|
||||||
|
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
|
||||||
|
|reranking_endpoint|http://localhost:8000/v1/reranking|
|
||||||
|
|output_dir|./output|
|
||||||
|
|temperature|0.1|
|
||||||
|
|max_new_tokens|1280|
|
||||||
|
|chunk_size|256|
|
||||||
|
|chunk_overlap|100|
|
||||||
|
|search_type|similarity|
|
||||||
|
|retrival_k|10|
|
||||||
|
|fetch_k|20|
|
||||||
|
|lambda_mult|0.5|
|
||||||
|
|dataset_path|None|
|
||||||
|
|docs_path|None|
|
||||||
|
|limits|100|
|
||||||
|
|
||||||
|
You can check arguments details use below command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python eval_multihop.py --help
|
||||||
|
```
|
||||||
|
|
||||||
|
## CRUD (Chinese dataset)
|
||||||
|
|
||||||
|
[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system.
|
||||||
|
|
||||||
|
### Prepare Dataset
|
||||||
|
|
||||||
|
We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/IAAR-Shanghai/CRUD_RAG
|
||||||
|
mkdir data/
|
||||||
|
cp CRUD_RAG/data/crud_split/split_merged.json data/
|
||||||
|
cp -r CRUD_RAG/data/80000_docs/ data/
|
||||||
|
python process_crud_dataset.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Launch Service of RAG System
|
||||||
|
|
||||||
|
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`.
|
||||||
|
|
||||||
|
### Evaluation
|
||||||
|
|
||||||
|
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted.
|
||||||
|
|
||||||
|
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs
|
||||||
|
|
||||||
|
# if you want to get ragas metrics
|
||||||
|
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}" --ragas_metrics
|
||||||
|
```
|
||||||
|
|
||||||
|
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
|
||||||
|
```
|
||||||
|
|
||||||
|
The default values for arguments are:
|
||||||
|
|Argument|Default value|
|
||||||
|
|--------|-------------|
|
||||||
|
|service_url|http://localhost:8888/v1/chatqna|
|
||||||
|
|database_endpoint|http://localhost:6007/v1/dataprep|
|
||||||
|
|embedding_endpoint|http://localhost:6000/v1/embeddings|
|
||||||
|
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
|
||||||
|
|reranking_endpoint|http://localhost:8000/v1/reranking|
|
||||||
|
|output_dir|./output|
|
||||||
|
|temperature|0.1|
|
||||||
|
|max_new_tokens|1280|
|
||||||
|
|chunk_size|256|
|
||||||
|
|chunk_overlap|100|
|
||||||
|
|dataset_path|./data/split_merged.json|
|
||||||
|
|docs_path|./data/80000_docs|
|
||||||
|
|tasks|["question_answering"]|
|
||||||
|
|
||||||
|
You can check arguments details use below command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python eval_crud.py --help
|
||||||
|
```
|
||||||
|
|
||||||
|
## Acknowledgements
|
||||||
|
|
||||||
|
This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work!
|
||||||
210
ChatQnA/benchmark/accuracy/eval_crud.py
Normal file
210
ChatQnA/benchmark/accuracy/eval_crud.py
Normal file
@@ -0,0 +1,210 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
from evals.evaluation.rag_eval import Evaluator
|
||||||
|
from evals.evaluation.rag_eval.template import CRUDTemplate
|
||||||
|
from evals.metrics.ragas import RagasMetric
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
|
||||||
|
class CRUD_Evaluator(Evaluator):
|
||||||
|
def get_ground_truth_text(self, data: dict):
|
||||||
|
if self.task == "summarization":
|
||||||
|
ground_truth_text = data["summary"]
|
||||||
|
elif self.task == "question_answering":
|
||||||
|
ground_truth_text = data["answers"]
|
||||||
|
elif self.task == "continuation":
|
||||||
|
ground_truth_text = data["continuing"]
|
||||||
|
elif self.task == "hallucinated_modified":
|
||||||
|
ground_truth_text = data["hallucinatedMod"]
|
||||||
|
else:
|
||||||
|
raise NotImplementedError(
|
||||||
|
f"Unknown task {self.task}, only support "
|
||||||
|
"summarization, question_answering, continuation and hallucinated_modified."
|
||||||
|
)
|
||||||
|
return ground_truth_text
|
||||||
|
|
||||||
|
def get_query(self, data: dict):
|
||||||
|
if self.task == "summarization":
|
||||||
|
query = data["text"]
|
||||||
|
elif self.task == "question_answering":
|
||||||
|
query = data["questions"]
|
||||||
|
elif self.task == "continuation":
|
||||||
|
query = data["beginning"]
|
||||||
|
elif self.task == "hallucinated_modified":
|
||||||
|
query = data["newsBeginning"]
|
||||||
|
else:
|
||||||
|
raise NotImplementedError(
|
||||||
|
f"Unknown task {self.task}, only support "
|
||||||
|
"summarization, question_answering, continuation and hallucinated_modified."
|
||||||
|
)
|
||||||
|
return query
|
||||||
|
|
||||||
|
def get_document(self, data: dict):
|
||||||
|
if self.task == "summarization":
|
||||||
|
document = data["text"]
|
||||||
|
elif self.task == "question_answering":
|
||||||
|
document = data["news1"]
|
||||||
|
elif self.task == "continuation":
|
||||||
|
document = data["beginning"]
|
||||||
|
elif self.task == "hallucinated_modified":
|
||||||
|
document = data["newsBeginning"]
|
||||||
|
else:
|
||||||
|
raise NotImplementedError(
|
||||||
|
f"Unknown task {self.task}, only support "
|
||||||
|
"summarization, question_answering, continuation and hallucinated_modified."
|
||||||
|
)
|
||||||
|
return document
|
||||||
|
|
||||||
|
def get_template(self):
|
||||||
|
if self.task == "summarization":
|
||||||
|
template = CRUDTemplate.get_summarization_template()
|
||||||
|
elif self.task == "question_answering":
|
||||||
|
template = CRUDTemplate.get_question_answering_template()
|
||||||
|
elif self.task == "continuation":
|
||||||
|
template = CRUDTemplate.get_continuation_template()
|
||||||
|
else:
|
||||||
|
raise NotImplementedError(
|
||||||
|
f"Unknown task {self.task}, only support "
|
||||||
|
"summarization, question_answering, continuation and hallucinated_modified."
|
||||||
|
)
|
||||||
|
return template
|
||||||
|
|
||||||
|
def post_process(self, result):
|
||||||
|
return result.split("<response>")[-1].split("</response>")[0].strip()
|
||||||
|
|
||||||
|
def get_ragas_metrics(self, results, arguments):
|
||||||
|
from langchain_huggingface import HuggingFaceEndpointEmbeddings
|
||||||
|
|
||||||
|
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
|
||||||
|
|
||||||
|
metric = RagasMetric(
|
||||||
|
threshold=0.5,
|
||||||
|
model=arguments.llm_endpoint,
|
||||||
|
embeddings=embeddings,
|
||||||
|
metrics=["faithfulness", "answer_relevancy"],
|
||||||
|
)
|
||||||
|
|
||||||
|
all_answer_relevancy = 0
|
||||||
|
all_faithfulness = 0
|
||||||
|
ragas_inputs = {
|
||||||
|
"question": [],
|
||||||
|
"answer": [],
|
||||||
|
"ground_truth": [],
|
||||||
|
"contexts": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
valid_results = self.remove_invalid(results["results"])
|
||||||
|
|
||||||
|
for data in tqdm(valid_results):
|
||||||
|
data = data["original_data"]
|
||||||
|
|
||||||
|
query = self.get_query(data)
|
||||||
|
generated_text = data["generated_text"]
|
||||||
|
ground_truth = data["ground_truth_text"]
|
||||||
|
retrieved_documents = data["retrieved_documents"]
|
||||||
|
|
||||||
|
ragas_inputs["question"].append(query)
|
||||||
|
ragas_inputs["answer"].append(generated_text)
|
||||||
|
ragas_inputs["ground_truth"].append(ground_truth)
|
||||||
|
ragas_inputs["contexts"].append(retrieved_documents[:3])
|
||||||
|
|
||||||
|
ragas_metrics = metric.measure(ragas_inputs)
|
||||||
|
return ragas_metrics
|
||||||
|
|
||||||
|
|
||||||
|
def args_parser():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--chunk_overlap",
|
||||||
|
type=int,
|
||||||
|
default=100,
|
||||||
|
help="the number of characters that should overlap between two adjacent chunks",
|
||||||
|
)
|
||||||
|
parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset")
|
||||||
|
parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents")
|
||||||
|
|
||||||
|
# Retriever related options
|
||||||
|
parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform")
|
||||||
|
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
|
||||||
|
parser.add_argument(
|
||||||
|
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--tei_embedding_endpoint",
|
||||||
|
type=str,
|
||||||
|
default="http://localhost:8090",
|
||||||
|
help="Service URL address of tei embedding.",
|
||||||
|
)
|
||||||
|
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
|
||||||
|
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
|
||||||
|
)
|
||||||
|
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
return args
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
args = args_parser()
|
||||||
|
if os.path.isfile(args.dataset_path):
|
||||||
|
with open(args.dataset_path) as f:
|
||||||
|
all_datasets = json.load(f)
|
||||||
|
else:
|
||||||
|
raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.")
|
||||||
|
os.makedirs(args.output_dir, exist_ok=True)
|
||||||
|
for task in args.tasks:
|
||||||
|
if task == "question_answering":
|
||||||
|
dataset = all_datasets["questanswer_1doc"]
|
||||||
|
elif task == "summarization":
|
||||||
|
dataset = all_datasets["event_summary"]
|
||||||
|
else:
|
||||||
|
raise NotImplementedError(
|
||||||
|
f"Unknown task {task}, only support "
|
||||||
|
"summarization, question_answering, continuation and hallucinated_modified."
|
||||||
|
)
|
||||||
|
output_save_path = os.path.join(args.output_dir, f"{task}.json")
|
||||||
|
evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task)
|
||||||
|
if args.ingest_docs:
|
||||||
|
CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap)
|
||||||
|
results = evaluator.evaluate(
|
||||||
|
args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data
|
||||||
|
)
|
||||||
|
print(results["overall"])
|
||||||
|
if args.ragas_metrics:
|
||||||
|
ragas_metrics = evaluator.get_ragas_metrics(results, args)
|
||||||
|
print(ragas_metrics)
|
||||||
|
print(f"Evaluation results of task {task} saved to {output_save_path}.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
279
ChatQnA/benchmark/accuracy/eval_multihop.py
Normal file
279
ChatQnA/benchmark/accuracy/eval_multihop.py
Normal file
@@ -0,0 +1,279 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from evals.evaluation.rag_eval import Evaluator
|
||||||
|
from evals.metrics.ragas import RagasMetric
|
||||||
|
from evals.metrics.retrieval import RetrievalBaseMetric
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
|
||||||
|
class MultiHop_Evaluator(Evaluator):
|
||||||
|
def get_ground_truth_text(self, data: dict):
|
||||||
|
return data["answer"]
|
||||||
|
|
||||||
|
def get_query(self, data: dict):
|
||||||
|
return data["query"]
|
||||||
|
|
||||||
|
def get_template(self):
|
||||||
|
return None
|
||||||
|
|
||||||
|
def get_reranked_documents(self, query, docs, arguments):
|
||||||
|
data = {
|
||||||
|
"initial_query": query,
|
||||||
|
"retrieved_docs": [{"text": doc} for doc in docs],
|
||||||
|
"top_n": 10,
|
||||||
|
}
|
||||||
|
headers = {"Content-Type": "application/json"}
|
||||||
|
|
||||||
|
response = requests.post(arguments.reranking_endpoint, data=json.dumps(data), headers=headers)
|
||||||
|
if response.ok:
|
||||||
|
reranked_documents = response.json()["documents"]
|
||||||
|
return reranked_documents
|
||||||
|
else:
|
||||||
|
print(f"Request for retrieval failed due to {response.text}.")
|
||||||
|
return []
|
||||||
|
|
||||||
|
def get_retrieved_documents(self, query, arguments):
|
||||||
|
data = {"text": query}
|
||||||
|
headers = {"Content-Type": "application/json"}
|
||||||
|
response = requests.post(arguments.embedding_endpoint, data=json.dumps(data), headers=headers)
|
||||||
|
if response.ok:
|
||||||
|
embedding = response.json()["embedding"]
|
||||||
|
else:
|
||||||
|
print(f"Request for embedding failed due to {response.text}.")
|
||||||
|
return []
|
||||||
|
data = {
|
||||||
|
"text": query,
|
||||||
|
"embedding": embedding,
|
||||||
|
"search_type": arguments.search_type,
|
||||||
|
"k": arguments.retrival_k,
|
||||||
|
"fetch_k": arguments.fetch_k,
|
||||||
|
"lambda_mult": arguments.lambda_mult,
|
||||||
|
}
|
||||||
|
response = requests.post(arguments.retrieval_endpoint, data=json.dumps(data), headers=headers)
|
||||||
|
if response.ok:
|
||||||
|
retrieved_documents = response.json()["retrieved_docs"]
|
||||||
|
return [doc["text"] for doc in retrieved_documents]
|
||||||
|
else:
|
||||||
|
print(f"Request for retrieval failed due to {response.text}.")
|
||||||
|
return []
|
||||||
|
|
||||||
|
def get_retrieval_metrics(self, all_queries, arguments):
|
||||||
|
print("start to retrieve...")
|
||||||
|
metric = RetrievalBaseMetric()
|
||||||
|
hits_at_10 = 0
|
||||||
|
hits_at_4 = 0
|
||||||
|
map_at_10 = 0
|
||||||
|
mrr_at_10 = 0
|
||||||
|
total = 0
|
||||||
|
for data in tqdm(all_queries):
|
||||||
|
if data["question_type"] == "null_query":
|
||||||
|
continue
|
||||||
|
query = data["query"]
|
||||||
|
retrieved_documents = self.get_retrieved_documents(query, arguments)
|
||||||
|
if arguments.rerank:
|
||||||
|
retrieved_documents = self.get_reranked_documents(query, retrieved_documents, arguments)
|
||||||
|
golden_context = [each["fact"] for each in data["evidence_list"]]
|
||||||
|
test_case = {
|
||||||
|
"input": query,
|
||||||
|
"golden_context": golden_context,
|
||||||
|
"retrieval_context": retrieved_documents,
|
||||||
|
}
|
||||||
|
results = metric.measure(test_case)
|
||||||
|
hits_at_10 += results["Hits@10"]
|
||||||
|
hits_at_4 += results["Hits@4"]
|
||||||
|
map_at_10 += results["MAP@10"]
|
||||||
|
mrr_at_10 += results["MRR@10"]
|
||||||
|
total += 1
|
||||||
|
|
||||||
|
# Calculate average metrics over all queries
|
||||||
|
hits_at_10 = hits_at_10 / total
|
||||||
|
hits_at_4 = hits_at_4 / total
|
||||||
|
map_at_10 = map_at_10 / total
|
||||||
|
mrr_at_10 = mrr_at_10 / total
|
||||||
|
|
||||||
|
return {
|
||||||
|
"Hits@10": hits_at_10,
|
||||||
|
"Hits@4": hits_at_4,
|
||||||
|
"MAP@10": map_at_10,
|
||||||
|
"MRR@10": mrr_at_10,
|
||||||
|
}
|
||||||
|
|
||||||
|
def evaluate(self, all_queries, arguments):
|
||||||
|
results = []
|
||||||
|
accuracy = 0
|
||||||
|
index = 0
|
||||||
|
for data in tqdm(all_queries):
|
||||||
|
if data["question_type"] == "null_query":
|
||||||
|
continue
|
||||||
|
|
||||||
|
generated_text = self.send_request(data, arguments)
|
||||||
|
data["generated_text"] = generated_text
|
||||||
|
|
||||||
|
# same method with paper: https://github.com/yixuantt/MultiHop-RAG/issues/8
|
||||||
|
if data["answer"] in generated_text:
|
||||||
|
accuracy += 1
|
||||||
|
result = {"id": index, **self.scoring(data)}
|
||||||
|
results.append(result)
|
||||||
|
index += 1
|
||||||
|
|
||||||
|
valid_results = self.remove_invalid(results)
|
||||||
|
|
||||||
|
try:
|
||||||
|
overall = self.compute_overall(valid_results) if len(valid_results) > 0 else {}
|
||||||
|
except Exception as e:
|
||||||
|
print(repr(e))
|
||||||
|
overall = dict()
|
||||||
|
|
||||||
|
overall.update({"accuracy": accuracy / len(results)})
|
||||||
|
return overall
|
||||||
|
|
||||||
|
def get_ragas_metrics(self, all_queries, arguments):
|
||||||
|
from langchain_huggingface import HuggingFaceEndpointEmbeddings
|
||||||
|
|
||||||
|
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
|
||||||
|
|
||||||
|
metric = RagasMetric(threshold=0.5, model=arguments.llm_endpoint, embeddings=embeddings)
|
||||||
|
all_answer_relevancy = 0
|
||||||
|
all_faithfulness = 0
|
||||||
|
ragas_inputs = {
|
||||||
|
"question": [],
|
||||||
|
"answer": [],
|
||||||
|
"ground_truth": [],
|
||||||
|
"contexts": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
for data in tqdm(all_queries):
|
||||||
|
if data["question_type"] == "null_query":
|
||||||
|
continue
|
||||||
|
retrieved_documents = self.get_retrieved_documents(data["query"], arguments)
|
||||||
|
generated_text = self.send_request(data, arguments)
|
||||||
|
data["generated_text"] = generated_text
|
||||||
|
|
||||||
|
ragas_inputs["question"].append(data["query"])
|
||||||
|
ragas_inputs["answer"].append(generated_text)
|
||||||
|
ragas_inputs["ground_truth"].append(data["answer"])
|
||||||
|
ragas_inputs["contexts"].append(retrieved_documents[:3])
|
||||||
|
|
||||||
|
if len(ragas_inputs["question"]) >= arguments.limits:
|
||||||
|
break
|
||||||
|
|
||||||
|
ragas_metrics = metric.measure(ragas_inputs)
|
||||||
|
return ragas_metrics
|
||||||
|
|
||||||
|
|
||||||
|
def args_parser():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--chunk_overlap",
|
||||||
|
type=int,
|
||||||
|
default=100,
|
||||||
|
help="the number of characters that should overlap between two adjacent chunks",
|
||||||
|
)
|
||||||
|
parser.add_argument("--search_type", type=str, default="similarity", help="similarity type")
|
||||||
|
parser.add_argument("--retrival_k", type=int, default=10, help="Number of Documents to return.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--fetch_k", type=int, default=20, help="Number of Documents to fetch to pass to MMR algorithm."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--lambda_mult",
|
||||||
|
type=float,
|
||||||
|
default=0.5,
|
||||||
|
help="Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.",
|
||||||
|
)
|
||||||
|
parser.add_argument("--dataset_path", default=None, help="Path to the dataset")
|
||||||
|
parser.add_argument("--docs_path", default=None, help="Path to the retrieval documents")
|
||||||
|
|
||||||
|
# Retriever related options
|
||||||
|
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
|
||||||
|
parser.add_argument("--retrieval_metrics", action="store_true", help="Whether to compute retrieval metrics.")
|
||||||
|
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
|
||||||
|
parser.add_argument("--limits", type=int, default=100, help="Number of examples to be evaluated by llm-as-judge")
|
||||||
|
parser.add_argument(
|
||||||
|
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--tei_embedding_endpoint",
|
||||||
|
type=str,
|
||||||
|
default="http://localhost:8090",
|
||||||
|
help="Service URL address of tei embedding.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument("--rerank", action="store_true", help="Whether to use rerank microservice.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--reranking_endpoint", type=str, default="http://localhost:8000/v1/reranking", help="Service URL address."
|
||||||
|
)
|
||||||
|
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
|
||||||
|
)
|
||||||
|
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
return args
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
args = args_parser()
|
||||||
|
|
||||||
|
evaluator = MultiHop_Evaluator()
|
||||||
|
|
||||||
|
with open(args.docs_path, "r") as file:
|
||||||
|
doc_data = json.load(file)
|
||||||
|
|
||||||
|
documents = []
|
||||||
|
for doc in doc_data:
|
||||||
|
metadata = {"title": doc["title"], "published_at": doc["published_at"], "source": doc["source"]}
|
||||||
|
documents.append(doc["body"])
|
||||||
|
|
||||||
|
# save docs to a tmp file
|
||||||
|
tmp_corpus_file = "tmp_corpus.txt"
|
||||||
|
with open(tmp_corpus_file, "w") as f:
|
||||||
|
for doc in documents:
|
||||||
|
f.write(doc + "\n")
|
||||||
|
|
||||||
|
if args.ingest_docs:
|
||||||
|
evaluator.ingest_docs(tmp_corpus_file, args.database_endpoint, args.chunk_size, args.chunk_overlap)
|
||||||
|
|
||||||
|
with open(args.dataset_path, "r") as file:
|
||||||
|
all_queries = json.load(file)
|
||||||
|
|
||||||
|
# get retrieval quality
|
||||||
|
if args.retrieval_metrics:
|
||||||
|
retrieval_metrics = evaluator.get_retrieval_metrics(all_queries, args)
|
||||||
|
print(retrieval_metrics)
|
||||||
|
|
||||||
|
# get rag quality
|
||||||
|
if args.ragas_metrics:
|
||||||
|
ragas_metrics = evaluator.get_ragas_metrics(all_queries, args)
|
||||||
|
print(ragas_metrics)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
9
ChatQnA/benchmark/accuracy/process_crud_dataset.py
Normal file
9
ChatQnA/benchmark/accuracy/process_crud_dataset.py
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
import os
|
||||||
|
|
||||||
|
path = os.path.join(os.path.dirname(__file__), "./data/80000_docs")
|
||||||
|
for file in os.listdir(path):
|
||||||
|
src_file = os.path.join(path, file)
|
||||||
|
os.rename(src_file, src_file + ".txt")
|
||||||
64
ChatQnA/benchmark/accuracy/run_acc.sh
Normal file
64
ChatQnA/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
set -x
|
||||||
|
|
||||||
|
function main {
|
||||||
|
|
||||||
|
init_params "$@"
|
||||||
|
# run_benchmark
|
||||||
|
echo $dataset
|
||||||
|
if [[ ${dataset} == "MultiHop" ]]; then
|
||||||
|
run_multihop
|
||||||
|
elif [[ ${dataset} == "crud" ]]; then
|
||||||
|
run_crud
|
||||||
|
fi
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
# init params
|
||||||
|
function init_params {
|
||||||
|
for var in "$@"
|
||||||
|
do
|
||||||
|
case $var in
|
||||||
|
--dataset=*)
|
||||||
|
dataset=$( echo $var |cut -f2 -d=)
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Error: No such parameter: ${var}"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# run_multihop
|
||||||
|
function run_multihop {
|
||||||
|
git clone https://github.com/yixuantt/MultiHop-RAG.git
|
||||||
|
|
||||||
|
python eval_multihop.py \
|
||||||
|
--docs_path MultiHop-RAG/dataset/corpus.json \
|
||||||
|
--dataset_path MultiHop-RAG/dataset/MultiHopRAG.json \
|
||||||
|
--ingest_docs \
|
||||||
|
--retrieval_metrics
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
# run_crud
|
||||||
|
function run_crud {
|
||||||
|
|
||||||
|
git clone https://github.com/IAAR-Shanghai/CRUD_RAG
|
||||||
|
mkdir data/
|
||||||
|
cp CRUD_RAG/data/crud_split/split_merged.json data/
|
||||||
|
cp -r CRUD_RAG/data/80000_docs/ data/
|
||||||
|
python process_crud_dataset.py
|
||||||
|
|
||||||
|
python eval_crud.py \
|
||||||
|
--dataset_path ./data/split_merged.json \
|
||||||
|
--docs_path ./data/80000_docs \
|
||||||
|
--ingest_docs
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
main "$@"
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# CodeGen accuracy Evaluation
|
# CodeGen Accuracy
|
||||||
|
|
||||||
## Evaluation Framework
|
## Evaluation Framework
|
||||||
|
|
||||||
@@ -13,7 +13,7 @@ Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples
|
|||||||
Use `curl` command to test codegen service and ensure that it has started properly
|
Use `curl` command to test codegen service and ensure that it has started properly
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen"
|
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
|
||||||
curl $CODEGEN_ENDPOINT \
|
curl $CODEGEN_ENDPOINT \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
|
-d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
|
||||||
@@ -24,7 +24,7 @@ curl $CODEGEN_ENDPOINT \
|
|||||||
|
|
||||||
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
|
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
|
||||||
|
|
||||||
#### command line usage
|
#### Environment
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
git clone https://github.com/opea-project/GenAIEval
|
git clone https://github.com/opea-project/GenAIEval
|
||||||
@@ -32,15 +32,14 @@ cd GenAIEval
|
|||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
pip install -e .
|
pip install -e .
|
||||||
|
|
||||||
cd evals/evaluation/bigcode_evaluation_harness/examples
|
```
|
||||||
python main.py --model Qwen/CodeQwen1.5-7B-Chat \
|
|
||||||
--tasks humaneval \
|
#### Evaluation
|
||||||
--codegen_url $CODEGEN_ENDPOINT \
|
|
||||||
--max_length_generation 2048 \
|
```
|
||||||
--batch_size 1 \
|
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
|
||||||
--save_generations \
|
export CODEGEN_MODEL=your_model
|
||||||
--save_references \
|
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
|
||||||
--allow_code_execution
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
|
**_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
|
||||||
|
|||||||
17
CodeGen/benchmark/accuracy/main.py
Normal file
17
CodeGen/benchmark/accuracy/main.py
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
#
|
||||||
|
from evals.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
eval_args = setup_parser()
|
||||||
|
results = evaluate(eval_args)
|
||||||
|
print(results)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
13
CodeGen/benchmark/accuracy/run_acc.sh
Normal file
13
CodeGen/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
|
||||||
|
|
||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
python main.py --model $1 \
|
||||||
|
--tasks humaneval \
|
||||||
|
--codegen_url $2 \
|
||||||
|
--max_length_generation 2048 \
|
||||||
|
--batch_size 1 \
|
||||||
|
--save_generations \
|
||||||
|
--save_references \
|
||||||
|
--allow_code_execution
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# FaqGen Evaluation
|
# FaqGen Accuracy
|
||||||
|
|
||||||
## Dataset
|
## Dataset
|
||||||
|
|
||||||
|
|||||||
4
FaqGen/benchmark/accuracy/run_acc.sh
Normal file
4
FaqGen/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
# Copyright (C) 2024 Intel Corporation
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
|
||||||
|
python evaluate.py
|
||||||
Reference in New Issue
Block a user