update examples accuracy (#941)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
lkk
2024-10-14 13:20:50 +08:00
committed by GitHub
parent 441f8cc6ba
commit 088ab98f31
12 changed files with 784 additions and 14 deletions

View File

@@ -1,4 +1,4 @@
# AudioQnA accuracy Evaluation
# AudioQnA Accuracy
AudioQnA is an example that demonstrates the integration of Generative AI (GenAI) models for performing question-answering (QnA) on audio scene, which contains Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). The following is the piepline for evaluating the ASR accuracy.

View File

@@ -0,0 +1,5 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
python online_evaluate.py

View File

@@ -0,0 +1,170 @@
# ChatQnA Accuracy
ChatQnA is a Retrieval-Augmented Generation (RAG) pipeline, which can enhance generative models through external information retrieval.
For evaluating the accuracy, we use 2 latest published datasets and 10+ metrics which are popular and comprehensive:
- Dataset
- [MultiHop](https://arxiv.org/pdf/2401.15391) (English dataset)
- [CRUD](https://arxiv.org/abs/2401.17043) (Chinese dataset)
- metrics (measure accuracy of both the context retrieval and response generation)
- evaluation for retrieval/reranking
- MRR@10
- MAP@10
- Hits@10
- Hits@4
- LLM-as-a-Judge
- evaluation for the generated response from the end-to-end pipeline
- BLEU
- ROGUE(L)
- LLM-as-a-Judge
## Prerequisite
### Environment
```bash
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .
```
## MultiHop (English dataset)
[MultiHop-RAG](https://arxiv.org/pdf/2401.15391): a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications.
### Launch Service of RAG System
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA`.
### Launch Service of LLM-as-a-Judge
To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-gaudi) to launch a service. For example, the follow command is to setup the [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) model on 2 Gaudi2 cards:
```
# please set your llm_port and hf_token
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.5 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
```
### Prepare Dataset
We use the evaluation dataset from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) repo, use the below command to prepare the dataset.
```bash
git clone https://github.com/yixuantt/MultiHop-RAG.git
```
### Evaluation
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted. Set `--retrieval_metrics` to get retrieval related metrics (MRR@10/MAP@10/Hits@10/Hits@4). Set `--ragas_metrics` and `--llm_endpoint` to get end-to-end rag pipeline metrics (faithfulness/answer_relevancy/...), which are judged by LLMs. We set `--limits` is 100 as default, which means only 100 examples are evaluated by llm-as-judge as it is very time consuming.
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
```bash
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate
```
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
```bash
python eval_multihop.py --docs_path MultiHop-RAG/dataset/corpus.json --dataset_path MultiHop-RAG/dataset/MultiHopRAG.json --ingest_docs --retrieval_metrics --ragas_metrics --llm_endpoint http://{llm_as_judge_ip}:{llm_as_judge_port}/generate --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --tei_embedding_endpoint http://{your_tei_embedding_ip}:{your_tei_embedding_port} --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
```
The default values for arguments are:
|Argument|Default value|
|--------|-------------|
|service_url|http://localhost:8888/v1/chatqna|
|database_endpoint|http://localhost:6007/v1/dataprep|
|embedding_endpoint|http://localhost:6000/v1/embeddings|
|tei_embedding_endpoint|http://localhost:8090|
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
|reranking_endpoint|http://localhost:8000/v1/reranking|
|output_dir|./output|
|temperature|0.1|
|max_new_tokens|1280|
|chunk_size|256|
|chunk_overlap|100|
|search_type|similarity|
|retrival_k|10|
|fetch_k|20|
|lambda_mult|0.5|
|dataset_path|None|
|docs_path|None|
|limits|100|
You can check arguments details use below command:
```bash
python eval_multihop.py --help
```
## CRUD (Chinese dataset)
[CRUD-RAG](https://arxiv.org/abs/2401.17043) is a Chinese benchmark for RAG (Retrieval-Augmented Generation) system. This example utilize CRUD-RAG for evaluating the RAG system.
### Prepare Dataset
We use the evaluation dataset from [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, use the below command to prepare the dataset.
```bash
git clone https://github.com/IAAR-Shanghai/CRUD_RAG
mkdir data/
cp CRUD_RAG/data/crud_split/split_merged.json data/
cp -r CRUD_RAG/data/80000_docs/ data/
python process_crud_dataset.py
```
### Launch Service of RAG System
Please refer to this [guide](https://github.com/opea-project/GenAIExamples/blob/main/ChatQnA/README.md) to launch the service of `ChatQnA` system. For Chinese dataset, you should replace the English emebdding and llm model with Chinese, for example, `EMBEDDING_MODEL_ID="BAAI/bge-base-zh-v1.5"` and `LLM_MODEL_ID=Qwen/Qwen2-7B-Instruct`.
### Evaluation
Use below command to run the evaluation, please note that for the first run, argument `--ingest_docs` should be added in the command to ingest the documents into the vector database, while for the subsequent run, this argument should be omitted.
If you are using docker compose to deploy `ChatQnA` system, you can simply run the evaluation as following:
```bash
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs
# if you want to get ragas metrics
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --contain_original_data --llm_endpoint "http://{llm_as_judge_ip}:{llm_as_judge_port}" --ragas_metrics
```
If you are using Kubernetes manifest/helm to deploy `ChatQnA` system, you must specify more arguments as following:
```bash
python eval_crud.py --dataset_path ./data/split_merged.json --docs_path ./data/80000_docs --ingest_docs --database_endpoint http://{your_dataprep_ip}:{your_dataprep_port}/v1/dataprep --embedding_endpoint http://{your_embedding_ip}:{your_embedding_port}/v1/embeddings --retrieval_endpoint http://{your_retrieval_ip}:{your_retrieval_port}/v1/retrieval --service_url http://{your_chatqna_ip}:{your_chatqna_port}/v1/chatqna
```
The default values for arguments are:
|Argument|Default value|
|--------|-------------|
|service_url|http://localhost:8888/v1/chatqna|
|database_endpoint|http://localhost:6007/v1/dataprep|
|embedding_endpoint|http://localhost:6000/v1/embeddings|
|retrieval_endpoint|http://localhost:7000/v1/retrieval|
|reranking_endpoint|http://localhost:8000/v1/reranking|
|output_dir|./output|
|temperature|0.1|
|max_new_tokens|1280|
|chunk_size|256|
|chunk_overlap|100|
|dataset_path|./data/split_merged.json|
|docs_path|./data/80000_docs|
|tasks|["question_answering"]|
You can check arguments details use below command:
```bash
python eval_crud.py --help
```
## Acknowledgements
This example is mostly adapted from [MultiHop-RAG](https://github.com/yixuantt/MultiHop-RAG) and [CRUD-RAG](https://github.com/IAAR-Shanghai/CRUD_RAG) repo, we thank the authors for their great work!

View File

@@ -0,0 +1,210 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
import os
from evals.evaluation.rag_eval import Evaluator
from evals.evaluation.rag_eval.template import CRUDTemplate
from evals.metrics.ragas import RagasMetric
from tqdm import tqdm
class CRUD_Evaluator(Evaluator):
def get_ground_truth_text(self, data: dict):
if self.task == "summarization":
ground_truth_text = data["summary"]
elif self.task == "question_answering":
ground_truth_text = data["answers"]
elif self.task == "continuation":
ground_truth_text = data["continuing"]
elif self.task == "hallucinated_modified":
ground_truth_text = data["hallucinatedMod"]
else:
raise NotImplementedError(
f"Unknown task {self.task}, only support "
"summarization, question_answering, continuation and hallucinated_modified."
)
return ground_truth_text
def get_query(self, data: dict):
if self.task == "summarization":
query = data["text"]
elif self.task == "question_answering":
query = data["questions"]
elif self.task == "continuation":
query = data["beginning"]
elif self.task == "hallucinated_modified":
query = data["newsBeginning"]
else:
raise NotImplementedError(
f"Unknown task {self.task}, only support "
"summarization, question_answering, continuation and hallucinated_modified."
)
return query
def get_document(self, data: dict):
if self.task == "summarization":
document = data["text"]
elif self.task == "question_answering":
document = data["news1"]
elif self.task == "continuation":
document = data["beginning"]
elif self.task == "hallucinated_modified":
document = data["newsBeginning"]
else:
raise NotImplementedError(
f"Unknown task {self.task}, only support "
"summarization, question_answering, continuation and hallucinated_modified."
)
return document
def get_template(self):
if self.task == "summarization":
template = CRUDTemplate.get_summarization_template()
elif self.task == "question_answering":
template = CRUDTemplate.get_question_answering_template()
elif self.task == "continuation":
template = CRUDTemplate.get_continuation_template()
else:
raise NotImplementedError(
f"Unknown task {self.task}, only support "
"summarization, question_answering, continuation and hallucinated_modified."
)
return template
def post_process(self, result):
return result.split("<response>")[-1].split("</response>")[0].strip()
def get_ragas_metrics(self, results, arguments):
from langchain_huggingface import HuggingFaceEndpointEmbeddings
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
metric = RagasMetric(
threshold=0.5,
model=arguments.llm_endpoint,
embeddings=embeddings,
metrics=["faithfulness", "answer_relevancy"],
)
all_answer_relevancy = 0
all_faithfulness = 0
ragas_inputs = {
"question": [],
"answer": [],
"ground_truth": [],
"contexts": [],
}
valid_results = self.remove_invalid(results["results"])
for data in tqdm(valid_results):
data = data["original_data"]
query = self.get_query(data)
generated_text = data["generated_text"]
ground_truth = data["ground_truth_text"]
retrieved_documents = data["retrieved_documents"]
ragas_inputs["question"].append(query)
ragas_inputs["answer"].append(generated_text)
ragas_inputs["ground_truth"].append(ground_truth)
ragas_inputs["contexts"].append(retrieved_documents[:3])
ragas_metrics = metric.measure(ragas_inputs)
return ragas_metrics
def args_parser():
parser = argparse.ArgumentParser()
parser.add_argument(
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
)
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
parser.add_argument(
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
)
parser.add_argument(
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
)
parser.add_argument(
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
)
parser.add_argument(
"--chunk_overlap",
type=int,
default=100,
help="the number of characters that should overlap between two adjacent chunks",
)
parser.add_argument("--dataset_path", default="../data/split_merged.json", help="Path to the dataset")
parser.add_argument("--docs_path", default="../data/80000_docs", help="Path to the retrieval documents")
# Retriever related options
parser.add_argument("--tasks", default=["question_answering"], nargs="+", help="Task to perform")
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
parser.add_argument(
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
)
parser.add_argument(
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
)
parser.add_argument(
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
)
parser.add_argument(
"--tei_embedding_endpoint",
type=str,
default="http://localhost:8090",
help="Service URL address of tei embedding.",
)
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
parser.add_argument(
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
)
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
args = parser.parse_args()
return args
def main():
args = args_parser()
if os.path.isfile(args.dataset_path):
with open(args.dataset_path) as f:
all_datasets = json.load(f)
else:
raise FileNotFoundError(f"Evaluation dataset file {args.dataset_path} not exist.")
os.makedirs(args.output_dir, exist_ok=True)
for task in args.tasks:
if task == "question_answering":
dataset = all_datasets["questanswer_1doc"]
elif task == "summarization":
dataset = all_datasets["event_summary"]
else:
raise NotImplementedError(
f"Unknown task {task}, only support "
"summarization, question_answering, continuation and hallucinated_modified."
)
output_save_path = os.path.join(args.output_dir, f"{task}.json")
evaluator = CRUD_Evaluator(dataset=dataset, output_path=output_save_path, task=task)
if args.ingest_docs:
CRUD_Evaluator.ingest_docs(args.docs_path, args.database_endpoint, args.chunk_size, args.chunk_overlap)
results = evaluator.evaluate(
args, show_progress_bar=args.show_progress_bar, contain_original_data=args.contain_original_data
)
print(results["overall"])
if args.ragas_metrics:
ragas_metrics = evaluator.get_ragas_metrics(results, args)
print(ragas_metrics)
print(f"Evaluation results of task {task} saved to {output_save_path}.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,279 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
import os
import requests
from evals.evaluation.rag_eval import Evaluator
from evals.metrics.ragas import RagasMetric
from evals.metrics.retrieval import RetrievalBaseMetric
from tqdm import tqdm
class MultiHop_Evaluator(Evaluator):
def get_ground_truth_text(self, data: dict):
return data["answer"]
def get_query(self, data: dict):
return data["query"]
def get_template(self):
return None
def get_reranked_documents(self, query, docs, arguments):
data = {
"initial_query": query,
"retrieved_docs": [{"text": doc} for doc in docs],
"top_n": 10,
}
headers = {"Content-Type": "application/json"}
response = requests.post(arguments.reranking_endpoint, data=json.dumps(data), headers=headers)
if response.ok:
reranked_documents = response.json()["documents"]
return reranked_documents
else:
print(f"Request for retrieval failed due to {response.text}.")
return []
def get_retrieved_documents(self, query, arguments):
data = {"text": query}
headers = {"Content-Type": "application/json"}
response = requests.post(arguments.embedding_endpoint, data=json.dumps(data), headers=headers)
if response.ok:
embedding = response.json()["embedding"]
else:
print(f"Request for embedding failed due to {response.text}.")
return []
data = {
"text": query,
"embedding": embedding,
"search_type": arguments.search_type,
"k": arguments.retrival_k,
"fetch_k": arguments.fetch_k,
"lambda_mult": arguments.lambda_mult,
}
response = requests.post(arguments.retrieval_endpoint, data=json.dumps(data), headers=headers)
if response.ok:
retrieved_documents = response.json()["retrieved_docs"]
return [doc["text"] for doc in retrieved_documents]
else:
print(f"Request for retrieval failed due to {response.text}.")
return []
def get_retrieval_metrics(self, all_queries, arguments):
print("start to retrieve...")
metric = RetrievalBaseMetric()
hits_at_10 = 0
hits_at_4 = 0
map_at_10 = 0
mrr_at_10 = 0
total = 0
for data in tqdm(all_queries):
if data["question_type"] == "null_query":
continue
query = data["query"]
retrieved_documents = self.get_retrieved_documents(query, arguments)
if arguments.rerank:
retrieved_documents = self.get_reranked_documents(query, retrieved_documents, arguments)
golden_context = [each["fact"] for each in data["evidence_list"]]
test_case = {
"input": query,
"golden_context": golden_context,
"retrieval_context": retrieved_documents,
}
results = metric.measure(test_case)
hits_at_10 += results["Hits@10"]
hits_at_4 += results["Hits@4"]
map_at_10 += results["MAP@10"]
mrr_at_10 += results["MRR@10"]
total += 1
# Calculate average metrics over all queries
hits_at_10 = hits_at_10 / total
hits_at_4 = hits_at_4 / total
map_at_10 = map_at_10 / total
mrr_at_10 = mrr_at_10 / total
return {
"Hits@10": hits_at_10,
"Hits@4": hits_at_4,
"MAP@10": map_at_10,
"MRR@10": mrr_at_10,
}
def evaluate(self, all_queries, arguments):
results = []
accuracy = 0
index = 0
for data in tqdm(all_queries):
if data["question_type"] == "null_query":
continue
generated_text = self.send_request(data, arguments)
data["generated_text"] = generated_text
# same method with paper: https://github.com/yixuantt/MultiHop-RAG/issues/8
if data["answer"] in generated_text:
accuracy += 1
result = {"id": index, **self.scoring(data)}
results.append(result)
index += 1
valid_results = self.remove_invalid(results)
try:
overall = self.compute_overall(valid_results) if len(valid_results) > 0 else {}
except Exception as e:
print(repr(e))
overall = dict()
overall.update({"accuracy": accuracy / len(results)})
return overall
def get_ragas_metrics(self, all_queries, arguments):
from langchain_huggingface import HuggingFaceEndpointEmbeddings
embeddings = HuggingFaceEndpointEmbeddings(model=arguments.tei_embedding_endpoint)
metric = RagasMetric(threshold=0.5, model=arguments.llm_endpoint, embeddings=embeddings)
all_answer_relevancy = 0
all_faithfulness = 0
ragas_inputs = {
"question": [],
"answer": [],
"ground_truth": [],
"contexts": [],
}
for data in tqdm(all_queries):
if data["question_type"] == "null_query":
continue
retrieved_documents = self.get_retrieved_documents(data["query"], arguments)
generated_text = self.send_request(data, arguments)
data["generated_text"] = generated_text
ragas_inputs["question"].append(data["query"])
ragas_inputs["answer"].append(generated_text)
ragas_inputs["ground_truth"].append(data["answer"])
ragas_inputs["contexts"].append(retrieved_documents[:3])
if len(ragas_inputs["question"]) >= arguments.limits:
break
ragas_metrics = metric.measure(ragas_inputs)
return ragas_metrics
def args_parser():
parser = argparse.ArgumentParser()
parser.add_argument(
"--service_url", type=str, default="http://localhost:8888/v1/chatqna", help="Service URL address."
)
parser.add_argument("--output_dir", type=str, default="./output", help="Directory to save evaluation results.")
parser.add_argument(
"--temperature", type=float, default=0.1, help="Controls the randomness of the model's text generation"
)
parser.add_argument(
"--max_new_tokens", type=int, default=1280, help="Maximum number of new tokens to be generated by the model"
)
parser.add_argument(
"--chunk_size", type=int, default=256, help="the maximum number of characters that a chunk can contain"
)
parser.add_argument(
"--chunk_overlap",
type=int,
default=100,
help="the number of characters that should overlap between two adjacent chunks",
)
parser.add_argument("--search_type", type=str, default="similarity", help="similarity type")
parser.add_argument("--retrival_k", type=int, default=10, help="Number of Documents to return.")
parser.add_argument(
"--fetch_k", type=int, default=20, help="Number of Documents to fetch to pass to MMR algorithm."
)
parser.add_argument(
"--lambda_mult",
type=float,
default=0.5,
help="Number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.",
)
parser.add_argument("--dataset_path", default=None, help="Path to the dataset")
parser.add_argument("--docs_path", default=None, help="Path to the retrieval documents")
# Retriever related options
parser.add_argument("--ingest_docs", action="store_true", help="Whether to ingest documents to vector database")
parser.add_argument("--retrieval_metrics", action="store_true", help="Whether to compute retrieval metrics.")
parser.add_argument("--ragas_metrics", action="store_true", help="Whether to compute ragas metrics.")
parser.add_argument("--limits", type=int, default=100, help="Number of examples to be evaluated by llm-as-judge")
parser.add_argument(
"--database_endpoint", type=str, default="http://localhost:6007/v1/dataprep", help="Service URL address."
)
parser.add_argument(
"--embedding_endpoint", type=str, default="http://localhost:6000/v1/embeddings", help="Service URL address."
)
parser.add_argument(
"--tei_embedding_endpoint",
type=str,
default="http://localhost:8090",
help="Service URL address of tei embedding.",
)
parser.add_argument(
"--retrieval_endpoint", type=str, default="http://localhost:7000/v1/retrieval", help="Service URL address."
)
parser.add_argument("--rerank", action="store_true", help="Whether to use rerank microservice.")
parser.add_argument(
"--reranking_endpoint", type=str, default="http://localhost:8000/v1/reranking", help="Service URL address."
)
parser.add_argument("--llm_endpoint", type=str, default=None, help="Service URL address.")
parser.add_argument(
"--show_progress_bar", action="store", default=True, type=bool, help="Whether to show a progress bar"
)
parser.add_argument("--contain_original_data", action="store_true", help="Whether to contain original data")
args = parser.parse_args()
return args
def main():
args = args_parser()
evaluator = MultiHop_Evaluator()
with open(args.docs_path, "r") as file:
doc_data = json.load(file)
documents = []
for doc in doc_data:
metadata = {"title": doc["title"], "published_at": doc["published_at"], "source": doc["source"]}
documents.append(doc["body"])
# save docs to a tmp file
tmp_corpus_file = "tmp_corpus.txt"
with open(tmp_corpus_file, "w") as f:
for doc in documents:
f.write(doc + "\n")
if args.ingest_docs:
evaluator.ingest_docs(tmp_corpus_file, args.database_endpoint, args.chunk_size, args.chunk_overlap)
with open(args.dataset_path, "r") as file:
all_queries = json.load(file)
# get retrieval quality
if args.retrieval_metrics:
retrieval_metrics = evaluator.get_retrieval_metrics(all_queries, args)
print(retrieval_metrics)
# get rag quality
if args.ragas_metrics:
ragas_metrics = evaluator.get_ragas_metrics(all_queries, args)
print(ragas_metrics)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,9 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
import os
path = os.path.join(os.path.dirname(__file__), "./data/80000_docs")
for file in os.listdir(path):
src_file = os.path.join(path, file)
os.rename(src_file, src_file + ".txt")

View File

@@ -0,0 +1,64 @@
#!/bin/bash
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
set -x
function main {
init_params "$@"
# run_benchmark
echo $dataset
if [[ ${dataset} == "MultiHop" ]]; then
run_multihop
elif [[ ${dataset} == "crud" ]]; then
run_crud
fi
}
# init params
function init_params {
for var in "$@"
do
case $var in
--dataset=*)
dataset=$( echo $var |cut -f2 -d=)
;;
*)
echo "Error: No such parameter: ${var}"
exit 1
;;
esac
done
}
# run_multihop
function run_multihop {
git clone https://github.com/yixuantt/MultiHop-RAG.git
python eval_multihop.py \
--docs_path MultiHop-RAG/dataset/corpus.json \
--dataset_path MultiHop-RAG/dataset/MultiHopRAG.json \
--ingest_docs \
--retrieval_metrics
}
# run_crud
function run_crud {
git clone https://github.com/IAAR-Shanghai/CRUD_RAG
mkdir data/
cp CRUD_RAG/data/crud_split/split_merged.json data/
cp -r CRUD_RAG/data/80000_docs/ data/
python process_crud_dataset.py
python eval_crud.py \
--dataset_path ./data/split_merged.json \
--docs_path ./data/80000_docs \
--ingest_docs
}
main "$@"

View File

@@ -1,4 +1,4 @@
# CodeGen accuracy Evaluation
# CodeGen Accuracy
## Evaluation Framework
@@ -13,7 +13,7 @@ Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples
Use `curl` command to test codegen service and ensure that it has started properly
```bash
export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen"
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
curl $CODEGEN_ENDPOINT \
-H "Content-Type: application/json" \
-d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
@@ -24,7 +24,7 @@ curl $CODEGEN_ENDPOINT \
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
#### command line usage
#### Environment
```shell
git clone https://github.com/opea-project/GenAIEval
@@ -32,15 +32,14 @@ cd GenAIEval
pip install -r requirements.txt
pip install -e .
cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py --model Qwen/CodeQwen1.5-7B-Chat \
--tasks humaneval \
--codegen_url $CODEGEN_ENDPOINT \
--max_length_generation 2048 \
--batch_size 1 \
--save_generations \
--save_references \
--allow_code_execution
```
#### Evaluation
```
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
export CODEGEN_MODEL=your_model
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
```
**_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.

View File

@@ -0,0 +1,17 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#
from evals.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
def main():
eval_args = setup_parser()
results = evaluate(eval_args)
print(results)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
python main.py --model $1 \
--tasks humaneval \
--codegen_url $2 \
--max_length_generation 2048 \
--batch_size 1 \
--save_generations \
--save_references \
--allow_code_execution

View File

@@ -1,4 +1,4 @@
# FaqGen Evaluation
# FaqGen Accuracy
## Dataset

View File

@@ -0,0 +1,4 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
python evaluate.py