update examples accuracy (#941)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# AudioQnA accuracy Evaluation
|
||||
# AudioQnA Accuracy
|
||||
|
||||
AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.
|
||||
|
||||
|
||||
5
AudioQnA/benchmark/accuracy/run_acc.sh
Normal file
5
AudioQnA/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,5 @@
|
||||
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
python online_evaluate.py
|
||||
170
ChatQnA/benchmark/accuracy/README.md
Normal file
170
ChatQnA/benchmark/accuracy/README.md
Normal file
@@ -0,0 +1,170 @@
|
||||
# ChatQnA Accuracy
|
||||
|
||||
ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval.
|
||||
|
||||
For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive:
|
||||
|
||||
- Dataset
|
||||
- [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset)
|
||||
- [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset)
|
||||
- metrics (measure accuracy of both the context retrieval and response generation)
|
||||
- evaluation for retrieval/reranking
|
||||
- MRR@10
|
||||
- MAP@10
|
||||
- Hits@10
|
||||
- Hits@4
|
||||
- LLM-as-a-Judge
|
||||
- evaluation for the generated response from the end-to-end pipeline
|
||||
- BLEU
|
||||
- ROGUE(L)
|
||||
- LLM-as-a-Judge
|
||||
|
||||
## Prerequisite
|
||||
|
||||
### Environment
|
||||
|
||||
```bash
|
||||
git clone https://github.com/opea-project/GenAIEval
|
||||
cd GenAIEval
|
||||
pip install -r requirements.txt
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## MultiHop (English dataset)
|
||||
|
||||
[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
|
||||
|
||||
### Launch Service of RAG System
|
||||
|
||||
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`.
|
||||
|
||||
### Launch Service of LLM-as-a-Judge
|
||||
|
||||
To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards:
|
||||
|
||||
```
|
||||
# please set your llm_port and hf_token
|
||||
|
||||
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
|
||||
|
||||
# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
|
||||
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
|
||||
```
|
||||
|
||||
### Prepare Dataset
|
||||
|
||||
We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/yixuantt/MultiHop-RAG.git
|
||||
```
|
||||
|
||||
### Evaluation
|
||||
|
||||
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming.
|
||||
|
||||
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
|
||||
|
||||
```bash
|
||||
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate
|
||||
```
|
||||
|
||||
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
|
||||
|
||||
```bash
|
||||
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
|
||||
```
|
||||
|
||||
The default values for arguments are:
|
||||
|Argument|Default value|
|
||||
|--------|-------------|
|
||||
|service_url|http://localhost:8888/v1/chatqna|
|
||||
|database_endpoint|http://localhost:6007/v1/dataprep|
|
||||
|embedding_endpoint|http://localhost:6000/v1/embeddings|
|
||||
|tei_embedding_endpoint|http://localhost:8090|
|
||||
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
|
||||
|reranking_endpoint|http://localhost:8000/v1/reranking|
|
||||
|output_dir|./output|
|
||||
|temperature|0.1|
|
||||
|max_new_tokens|1280|
|
||||
|chunk_size|256|
|
||||
|chunk_overlap|100|
|
||||
|search_type|similarity|
|
||||
|retrival_k|10|
|
||||
|fetch_k|20|
|
||||
|lambda_mult|0.5|
|
||||
|dataset_path|None|
|
||||
|docs_path|None|
|
||||
|limits|100|
|
||||
|
||||
You can check arguments details use below command:
|
||||
|
||||
```bash
|
||||
python eval_multihop.py --help
|
||||
```
|
||||
|
||||
## CRUD (Chinese dataset)
|
||||
|
||||
[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system.
|
||||
|
||||
### Prepare Dataset
|
||||
|
||||
We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/IAAR-Shanghai/CRUD_RAG
|
||||
mkdir data/
|
||||
cp CRUD_RAG/data/crud_split/split_merged.json data/
|
||||
cp -r CRUD_RAG/data/80000_docs/ data/
|
||||
python process_crud_dataset.py
|
||||
```
|
||||
|
||||
### Launch Service of RAG System
|
||||
|
||||
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`.
|
||||
|
||||
### Evaluation
|
||||
|
||||
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted.
|
||||
|
||||
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
|
||||
|
||||
```bash
|
||||
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs
|
||||
|
||||
# if you want to get ragas metrics
|
||||
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}" --ragas_metrics
|
||||
```
|
||||
|
||||
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
|
||||
|
||||
```bash
|
||||
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
|
||||
```
|
||||
|
||||
The default values for arguments are:
|
||||
|Argument|Default value|
|
||||
|--------|-------------|
|
||||
|service_url|http://localhost:8888/v1/chatqna|
|
||||
|database_endpoint|http://localhost:6007/v1/dataprep|
|
||||
|embedding_endpoint|http://localhost:6000/v1/embeddings|
|
||||
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
|
||||
|reranking_endpoint|http://localhost:8000/v1/reranking|
|
||||
|output_dir|./output|
|
||||
|temperature|0.1|
|
||||
|max_new_tokens|1280|
|
||||
|chunk_size|256|
|
||||
|chunk_overlap|100|
|
||||
|dataset_path|./data/split_merged.json|
|
||||
|docs_path|./data/80000_docs|
|
||||
|tasks|["question_answering"]|
|
||||
|
||||
You can check arguments details use below command:
|
||||
|
||||
```bash
|
||||
python eval_crud.py --help
|
||||
```
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work!
|
||||
210
ChatQnA/benchmark/accuracy/eval_crud.py
Normal file
210
ChatQnA/benchmark/accuracy/eval_crud.py
Normal file
@@ -0,0 +1,210 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
from evals.evaluation.rag_eval import Evaluator
|
||||
from evals.evaluation.rag_eval.template import CRUDTemplate
|
||||
from evals.metrics.ragas import RagasMetric
|
||||
from tqdm import tqdm
|
||||
|
||||
|
||||
class CRUD_Evaluator(Evaluator):
|
||||
def get_ground_truth_text(self, data: dict):
|
||||
if self.task == "summarization":
|
||||
ground_truth_text = data["summary"]
|
||||
elif self.task == "question_answering":
|
||||
ground_truth_text = data["answers"]
|
||||
elif self.task == "continuation":
|
||||
ground_truth_text = data["continuing"]
|
||||
elif self.task == "hallucinated_modified":
|
||||
ground_truth_text = data["hallucinatedMod"]
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
f"Unknown task {self.task}, only support "
|
||||
"summarization, question_answering, continuation and hallucinated_modified."
|
||||
)
|
||||
return ground_truth_text
|
||||
|
||||
def get_query(self, data: dict):
|
||||
if self.task == "summarization":
|
||||
query = data["text"]
|
||||
elif self.task == "question_answering":
|
||||
query = data["questions"]
|
||||
elif self.task == "continuation":
|
||||
query = data["beginning"]
|
||||
elif self.task == "hallucinated_modified":
|
||||
query = data["newsBeginning"]
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
f"Unknown task {self.task}, only support "
|
||||
"summarization, question_answering, continuation and hallucinated_modified."
|
||||
)
|
||||
return query
|
||||
|
||||
def get_document(self, data: dict):
|
||||
if self.task == "summarization":
|
||||
document = data["text"]
|
||||
elif self.task == "question_answering":
|
||||
document = data["news1"]
|
||||
elif self.task == "continuation":
|
||||
document = data["beginning"]
|
||||
elif self.task == "hallucinated_modified":
|
||||
document = data["newsBeginning"]
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
f"Unknown task {self.task}, only support "
|
||||
"summarization, question_answering, continuation and hallucinated_modified."
|
||||
)
|
||||
return document
|
||||
|
||||
def get_template(self):
|
||||
if self.task == "summarization":
|
||||
template = CRUDTemplate.get_summarization_template()
|
||||
elif self.task == "question_answering":
|
||||
template = CRUDTemplate.get_question_answering_template()
|
||||
elif self.task == "continuation":
|
||||
template = CRUDTemplate.get_continuation_template()
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
f"Unknown task {self.task}, only support "
|
||||
"summarization, question_answering, continuation and hallucinated_modified."
|
||||
)
|
||||
return template
|
||||
|
||||
def post_process(self, result):
|
||||
return result.split("<response>")[-1].split("</response>")[0].strip()
|
||||
|
||||
def get_ragas_metrics(self, results, arguments):
|
||||
from langchain_huggingface import HuggingFaceEndpointEmbeddings
|
||||
|
||||
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
|
||||
|
||||
metric = RagasMetric(
|
||||
threshold=0.5,
|
||||
model=arguments.llm_endpoint,
|
||||
embeddings=embeddings,
|
||||
metrics=["faithfulness", "answer_relevancy"],
|
||||
)
|
||||
|
||||
all_answer_relevancy = 0
|
||||
all_faithfulness = 0
|
||||
ragas_inputs = {
|
||||
"question": [],
|
||||
"answer": [],
|
||||
"ground_truth": [],
|
||||
"contexts": [],
|
||||
}
|
||||
|
||||
valid_results = self.remove_invalid(results["results"])
|
||||
|
||||
for data in tqdm(valid_results):
|
||||
data = data["original_data"]
|
||||
|
||||
query = self.get_query(data)
|
||||
generated_text = data["generated_text"]
|
||||
ground_truth = data["ground_truth_text"]
|
||||
retrieved_documents = data["retrieved_documents"]
|
||||
|
||||
ragas_inputs["question"].append(query)
|
||||
ragas_inputs["answer"].append(generated_text)
|
||||
ragas_inputs["ground_truth"].append(ground_truth)
|
||||
ragas_inputs["contexts"].append(retrieved_documents[:3])
|
||||
|
||||
ragas_metrics = metric.measure(ragas_inputs)
|
||||
return ragas_metrics
|
||||
|
||||
|
||||
def args_parser():
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
parser.add_argument(
|
||||
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
|
||||
)
|
||||
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
|
||||
parser.add_argument(
|
||||
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--chunk_overlap",
|
||||
type=int,
|
||||
default=100,
|
||||
help="the number of characters that should overlap between two adjacent chunks",
|
||||
)
|
||||
parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset")
|
||||
parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents")
|
||||
|
||||
# Retriever related options
|
||||
parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform")
|
||||
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
|
||||
parser.add_argument(
|
||||
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tei_embedding_endpoint",
|
||||
type=str,
|
||||
default="http://localhost:8090",
|
||||
help="Service URL address of tei embedding.",
|
||||
)
|
||||
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
|
||||
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
|
||||
parser.add_argument(
|
||||
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
|
||||
)
|
||||
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
def main():
|
||||
args = args_parser()
|
||||
if os.path.isfile(args.dataset_path):
|
||||
with open(args.dataset_path) as f:
|
||||
all_datasets = json.load(f)
|
||||
else:
|
||||
raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.")
|
||||
os.makedirs(args.output_dir, exist_ok=True)
|
||||
for task in args.tasks:
|
||||
if task == "question_answering":
|
||||
dataset = all_datasets["questanswer_1doc"]
|
||||
elif task == "summarization":
|
||||
dataset = all_datasets["event_summary"]
|
||||
else:
|
||||
raise NotImplementedError(
|
||||
f"Unknown task {task}, only support "
|
||||
"summarization, question_answering, continuation and hallucinated_modified."
|
||||
)
|
||||
output_save_path = os.path.join(args.output_dir, f"{task}.json")
|
||||
evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task)
|
||||
if args.ingest_docs:
|
||||
CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap)
|
||||
results = evaluator.evaluate(
|
||||
args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data
|
||||
)
|
||||
print(results["overall"])
|
||||
if args.ragas_metrics:
|
||||
ragas_metrics = evaluator.get_ragas_metrics(results, args)
|
||||
print(ragas_metrics)
|
||||
print(f"Evaluation results of task {task} saved to {output_save_path}.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
279
ChatQnA/benchmark/accuracy/eval_multihop.py
Normal file
279
ChatQnA/benchmark/accuracy/eval_multihop.py
Normal file
@@ -0,0 +1,279 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
|
||||
import requests
|
||||
from evals.evaluation.rag_eval import Evaluator
|
||||
from evals.metrics.ragas import RagasMetric
|
||||
from evals.metrics.retrieval import RetrievalBaseMetric
|
||||
from tqdm import tqdm
|
||||
|
||||
|
||||
class MultiHop_Evaluator(Evaluator):
|
||||
def get_ground_truth_text(self, data: dict):
|
||||
return data["answer"]
|
||||
|
||||
def get_query(self, data: dict):
|
||||
return data["query"]
|
||||
|
||||
def get_template(self):
|
||||
return None
|
||||
|
||||
def get_reranked_documents(self, query, docs, arguments):
|
||||
data = {
|
||||
"initial_query": query,
|
||||
"retrieved_docs": [{"text": doc} for doc in docs],
|
||||
"top_n": 10,
|
||||
}
|
||||
headers = {"Content-Type": "application/json"}
|
||||
|
||||
response = requests.post(arguments.reranking_endpoint, data=json.dumps(data), headers=headers)
|
||||
if response.ok:
|
||||
reranked_documents = response.json()["documents"]
|
||||
return reranked_documents
|
||||
else:
|
||||
print(f"Request for retrieval failed due to {response.text}.")
|
||||
return []
|
||||
|
||||
def get_retrieved_documents(self, query, arguments):
|
||||
data = {"text": query}
|
||||
headers = {"Content-Type": "application/json"}
|
||||
response = requests.post(arguments.embedding_endpoint, data=json.dumps(data), headers=headers)
|
||||
if response.ok:
|
||||
embedding = response.json()["embedding"]
|
||||
else:
|
||||
print(f"Request for embedding failed due to {response.text}.")
|
||||
return []
|
||||
data = {
|
||||
"text": query,
|
||||
"embedding": embedding,
|
||||
"search_type": arguments.search_type,
|
||||
"k": arguments.retrival_k,
|
||||
"fetch_k": arguments.fetch_k,
|
||||
"lambda_mult": arguments.lambda_mult,
|
||||
}
|
||||
response = requests.post(arguments.retrieval_endpoint, data=json.dumps(data), headers=headers)
|
||||
if response.ok:
|
||||
retrieved_documents = response.json()["retrieved_docs"]
|
||||
return [doc["text"] for doc in retrieved_documents]
|
||||
else:
|
||||
print(f"Request for retrieval failed due to {response.text}.")
|
||||
return []
|
||||
|
||||
def get_retrieval_metrics(self, all_queries, arguments):
|
||||
print("start to retrieve...")
|
||||
metric = RetrievalBaseMetric()
|
||||
hits_at_10 = 0
|
||||
hits_at_4 = 0
|
||||
map_at_10 = 0
|
||||
mrr_at_10 = 0
|
||||
total = 0
|
||||
for data in tqdm(all_queries):
|
||||
if data["question_type"] == "null_query":
|
||||
continue
|
||||
query = data["query"]
|
||||
retrieved_documents = self.get_retrieved_documents(query, arguments)
|
||||
if arguments.rerank:
|
||||
retrieved_documents = self.get_reranked_documents(query, retrieved_documents, arguments)
|
||||
golden_context = [each["fact"] for each in data["evidence_list"]]
|
||||
test_case = {
|
||||
"input": query,
|
||||
"golden_context": golden_context,
|
||||
"retrieval_context": retrieved_documents,
|
||||
}
|
||||
results = metric.measure(test_case)
|
||||
hits_at_10 += results["Hits@10"]
|
||||
hits_at_4 += results["Hits@4"]
|
||||
map_at_10 += results["MAP@10"]
|
||||
mrr_at_10 += results["MRR@10"]
|
||||
total += 1
|
||||
|
||||
# Calculate average metrics over all queries
|
||||
hits_at_10 = hits_at_10 / total
|
||||
hits_at_4 = hits_at_4 / total
|
||||
map_at_10 = map_at_10 / total
|
||||
mrr_at_10 = mrr_at_10 / total
|
||||
|
||||
return {
|
||||
"Hits@10": hits_at_10,
|
||||
"Hits@4": hits_at_4,
|
||||
"MAP@10": map_at_10,
|
||||
"MRR@10": mrr_at_10,
|
||||
}
|
||||
|
||||
def evaluate(self, all_queries, arguments):
|
||||
results = []
|
||||
accuracy = 0
|
||||
index = 0
|
||||
for data in tqdm(all_queries):
|
||||
if data["question_type"] == "null_query":
|
||||
continue
|
||||
|
||||
generated_text = self.send_request(data, arguments)
|
||||
data["generated_text"] = generated_text
|
||||
|
||||
# same method with paper: https://github.com/yixuantt/MultiHop-RAG/issues/8
|
||||
if data["answer"] in generated_text:
|
||||
accuracy += 1
|
||||
result = {"id": index, **self.scoring(data)}
|
||||
results.append(result)
|
||||
index += 1
|
||||
|
||||
valid_results = self.remove_invalid(results)
|
||||
|
||||
try:
|
||||
overall = self.compute_overall(valid_results) if len(valid_results) > 0 else {}
|
||||
except Exception as e:
|
||||
print(repr(e))
|
||||
overall = dict()
|
||||
|
||||
overall.update({"accuracy": accuracy / len(results)})
|
||||
return overall
|
||||
|
||||
def get_ragas_metrics(self, all_queries, arguments):
|
||||
from langchain_huggingface import HuggingFaceEndpointEmbeddings
|
||||
|
||||
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
|
||||
|
||||
metric = RagasMetric(threshold=0.5, model=arguments.llm_endpoint, embeddings=embeddings)
|
||||
all_answer_relevancy = 0
|
||||
all_faithfulness = 0
|
||||
ragas_inputs = {
|
||||
"question": [],
|
||||
"answer": [],
|
||||
"ground_truth": [],
|
||||
"contexts": [],
|
||||
}
|
||||
|
||||
for data in tqdm(all_queries):
|
||||
if data["question_type"] == "null_query":
|
||||
continue
|
||||
retrieved_documents = self.get_retrieved_documents(data["query"], arguments)
|
||||
generated_text = self.send_request(data, arguments)
|
||||
data["generated_text"] = generated_text
|
||||
|
||||
ragas_inputs["question"].append(data["query"])
|
||||
ragas_inputs["answer"].append(generated_text)
|
||||
ragas_inputs["ground_truth"].append(data["answer"])
|
||||
ragas_inputs["contexts"].append(retrieved_documents[:3])
|
||||
|
||||
if len(ragas_inputs["question"]) >= arguments.limits:
|
||||
break
|
||||
|
||||
ragas_metrics = metric.measure(ragas_inputs)
|
||||
return ragas_metrics
|
||||
|
||||
|
||||
def args_parser():
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
parser.add_argument(
|
||||
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
|
||||
)
|
||||
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
|
||||
parser.add_argument(
|
||||
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--chunk_overlap",
|
||||
type=int,
|
||||
default=100,
|
||||
help="the number of characters that should overlap between two adjacent chunks",
|
||||
)
|
||||
parser.add_argument("--search_type", type=str, default="similarity", help="similarity type")
|
||||
parser.add_argument("--retrival_k", type=int, default=10, help="Number of Documents to return.")
|
||||
parser.add_argument(
|
||||
"--fetch_k", type=int, default=20, help="Number of Documents to fetch to pass to MMR algorithm."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--lambda_mult",
|
||||
type=float,
|
||||
default=0.5,
|
||||
help="Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.",
|
||||
)
|
||||
parser.add_argument("--dataset_path", default=None, help="Path to the dataset")
|
||||
parser.add_argument("--docs_path", default=None, help="Path to the retrieval documents")
|
||||
|
||||
# Retriever related options
|
||||
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
|
||||
parser.add_argument("--retrieval_metrics", action="store_true", help="Whether to compute retrieval metrics.")
|
||||
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
|
||||
parser.add_argument("--limits", type=int, default=100, help="Number of examples to be evaluated by llm-as-judge")
|
||||
parser.add_argument(
|
||||
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--tei_embedding_endpoint",
|
||||
type=str,
|
||||
default="http://localhost:8090",
|
||||
help="Service URL address of tei embedding.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
|
||||
)
|
||||
parser.add_argument("--rerank", action="store_true", help="Whether to use rerank microservice.")
|
||||
parser.add_argument(
|
||||
"--reranking_endpoint", type=str, default="http://localhost:8000/v1/reranking", help="Service URL address."
|
||||
)
|
||||
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
|
||||
parser.add_argument(
|
||||
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
|
||||
)
|
||||
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
|
||||
|
||||
args = parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
def main():
|
||||
args = args_parser()
|
||||
|
||||
evaluator = MultiHop_Evaluator()
|
||||
|
||||
with open(args.docs_path, "r") as file:
|
||||
doc_data = json.load(file)
|
||||
|
||||
documents = []
|
||||
for doc in doc_data:
|
||||
metadata = {"title": doc["title"], "published_at": doc["published_at"], "source": doc["source"]}
|
||||
documents.append(doc["body"])
|
||||
|
||||
# save docs to a tmp file
|
||||
tmp_corpus_file = "tmp_corpus.txt"
|
||||
with open(tmp_corpus_file, "w") as f:
|
||||
for doc in documents:
|
||||
f.write(doc + "\n")
|
||||
|
||||
if args.ingest_docs:
|
||||
evaluator.ingest_docs(tmp_corpus_file, args.database_endpoint, args.chunk_size, args.chunk_overlap)
|
||||
|
||||
with open(args.dataset_path, "r") as file:
|
||||
all_queries = json.load(file)
|
||||
|
||||
# get retrieval quality
|
||||
if args.retrieval_metrics:
|
||||
retrieval_metrics = evaluator.get_retrieval_metrics(all_queries, args)
|
||||
print(retrieval_metrics)
|
||||
|
||||
# get rag quality
|
||||
if args.ragas_metrics:
|
||||
ragas_metrics = evaluator.get_ragas_metrics(all_queries, args)
|
||||
print(ragas_metrics)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
9
ChatQnA/benchmark/accuracy/process_crud_dataset.py
Normal file
9
ChatQnA/benchmark/accuracy/process_crud_dataset.py
Normal file
@@ -0,0 +1,9 @@
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
import os
|
||||
|
||||
path = os.path.join(os.path.dirname(__file__), "./data/80000_docs")
|
||||
for file in os.listdir(path):
|
||||
src_file = os.path.join(path, file)
|
||||
os.rename(src_file, src_file + ".txt")
|
||||
64
ChatQnA/benchmark/accuracy/run_acc.sh
Normal file
64
ChatQnA/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,64 @@
|
||||
#!/bin/bash
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
set -x
|
||||
|
||||
function main {
|
||||
|
||||
init_params "$@"
|
||||
# run_benchmark
|
||||
echo $dataset
|
||||
if [[ ${dataset} == "MultiHop" ]]; then
|
||||
run_multihop
|
||||
elif [[ ${dataset} == "crud" ]]; then
|
||||
run_crud
|
||||
fi
|
||||
|
||||
}
|
||||
|
||||
# init params
|
||||
function init_params {
|
||||
for var in "$@"
|
||||
do
|
||||
case $var in
|
||||
--dataset=*)
|
||||
dataset=$( echo $var |cut -f2 -d=)
|
||||
;;
|
||||
*)
|
||||
echo "Error: No such parameter: ${var}"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
}
|
||||
|
||||
# run_multihop
|
||||
function run_multihop {
|
||||
git clone https://github.com/yixuantt/MultiHop-RAG.git
|
||||
|
||||
python eval_multihop.py \
|
||||
--docs_path MultiHop-RAG/dataset/corpus.json \
|
||||
--dataset_path MultiHop-RAG/dataset/MultiHopRAG.json \
|
||||
--ingest_docs \
|
||||
--retrieval_metrics
|
||||
|
||||
}
|
||||
|
||||
# run_crud
|
||||
function run_crud {
|
||||
|
||||
git clone https://github.com/IAAR-Shanghai/CRUD_RAG
|
||||
mkdir data/
|
||||
cp CRUD_RAG/data/crud_split/split_merged.json data/
|
||||
cp -r CRUD_RAG/data/80000_docs/ data/
|
||||
python process_crud_dataset.py
|
||||
|
||||
python eval_crud.py \
|
||||
--dataset_path ./data/split_merged.json \
|
||||
--docs_path ./data/80000_docs \
|
||||
--ingest_docs
|
||||
}
|
||||
|
||||
|
||||
main "$@"
|
||||
@@ -1,4 +1,4 @@
|
||||
# CodeGen accuracy Evaluation
|
||||
# CodeGen Accuracy
|
||||
|
||||
## Evaluation Framework
|
||||
|
||||
@@ -13,7 +13,7 @@ Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples
|
||||
Use `curl` command to test codegen service and ensure that it has started properly
|
||||
|
||||
```bash
|
||||
export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen"
|
||||
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
|
||||
curl $CODEGEN_ENDPOINT \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
|
||||
@@ -24,7 +24,7 @@ curl $CODEGEN_ENDPOINT \
|
||||
|
||||
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
|
||||
|
||||
#### command line usage
|
||||
#### Environment
|
||||
|
||||
```shell
|
||||
git clone https://github.com/opea-project/GenAIEval
|
||||
@@ -32,15 +32,14 @@ cd GenAIEval
|
||||
pip install -r requirements.txt
|
||||
pip install -e .
|
||||
|
||||
cd evals/evaluation/bigcode_evaluation_harness/examples
|
||||
python main.py --model Qwen/CodeQwen1.5-7B-Chat \
|
||||
--tasks humaneval \
|
||||
--codegen_url $CODEGEN_ENDPOINT \
|
||||
--max_length_generation 2048 \
|
||||
--batch_size 1 \
|
||||
--save_generations \
|
||||
--save_references \
|
||||
--allow_code_execution
|
||||
```
|
||||
|
||||
#### Evaluation
|
||||
|
||||
```
|
||||
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
|
||||
export CODEGEN_MODEL=your_model
|
||||
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
|
||||
```
|
||||
|
||||
**_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
|
||||
|
||||
17
CodeGen/benchmark/accuracy/main.py
Normal file
17
CodeGen/benchmark/accuracy/main.py
Normal file
@@ -0,0 +1,17 @@
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
#
|
||||
from evals.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
|
||||
|
||||
|
||||
def main():
|
||||
eval_args = setup_parser()
|
||||
results = evaluate(eval_args)
|
||||
print(results)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
13
CodeGen/benchmark/accuracy/run_acc.sh
Normal file
13
CodeGen/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,13 @@
|
||||
|
||||
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
python main.py --model $1 \
|
||||
--tasks humaneval \
|
||||
--codegen_url $2 \
|
||||
--max_length_generation 2048 \
|
||||
--batch_size 1 \
|
||||
--save_generations \
|
||||
--save_references \
|
||||
--allow_code_execution
|
||||
@@ -1,4 +1,4 @@
|
||||
# FaqGen Evaluation
|
||||
# FaqGen Accuracy
|
||||
|
||||
## Dataset
|
||||
|
||||
|
||||
4
FaqGen/benchmark/accuracy/run_acc.sh
Normal file
4
FaqGen/benchmark/accuracy/run_acc.sh
Normal file
@@ -0,0 +1,4 @@
|
||||
# Copyright (C) 2024 Intel Corporation
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
|
||||
python evaluate.py
|
||||
Reference in New Issue
Block a user