Files

Xiaotian Chen 1bd56af994 Update TGI image versions (#1625 )

Signed-off-by: xiaotia3 <xiaotian.chen@intel.com>

2025-04-01 11:27:51 +08:00

evaluate.py

Merge FaqGen into ChatQnA (#1654 )

2025-03-20 17:40:00 +08:00

generate_FAQ.py

Merge FaqGen into ChatQnA (#1654 )

2025-03-20 17:40:00 +08:00

get_context.py

Merge FaqGen into ChatQnA (#1654 )

2025-03-20 17:40:00 +08:00

launch_tgi.sh

Update TGI image versions (#1625 )

2025-04-01 11:27:51 +08:00

post_process_FAQ.py

Merge FaqGen into ChatQnA (#1654 )

2025-03-20 17:40:00 +08:00

README.md

Merge FaqGen into ChatQnA (#1654 )

2025-03-20 17:40:00 +08:00

run_acc.sh

Merge FaqGen into ChatQnA (#1654 )

2025-03-20 17:40:00 +08:00

README.md

FaqGen Accuracy

Dataset

We evaluate performance on QA dataset Squad_v2. Generate FAQs on "context" columns in validation dataset, which contains 1204 unique records.

First download dataset and put at "./data".

Extract unique "context" columns, which will be save to 'data/sqv2_context.json':

python get_context.py

Generate FAQs

Launch FaQGen microservice

Please refer to FaQGen microservice, set up an microservice endpoint.

export FAQ_ENDPOINT = "http://${your_ip}:9000/v1/faqgen"

Generate FAQs with microservice

Use the microservice endpoint to generate FAQs for dataset.

python generate_FAQ.py

Post-process the output to get the right data, which will be save to 'data/sqv2_faq.json'.

python post_process_FAQ.py

Evaluate with Ragas

Launch TGI service

We use "mistralai/Mixtral-8x7B-Instruct-v0.1" as LLM referee to evaluate the model. First we need to launch a LLM endpoint on Gaudi.

export HUGGING_FACE_HUB_TOKEN="your_huggingface_token"
bash launch_tgi.sh

Get the endpoint:

export LLM_ENDPOINT = "http://${ip_address}:8082"

Verify the service:

curl http://${ip_address}:8082/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":128}}' \
    -H 'Content-Type: application/json'

Evaluate

evaluate the performance with the LLM:

python evaluate.py

Performance Result

Here is the tested result for your reference

answer_relevancy	faithfulness	context_utilization	reference_free_rubrics_score
0.7191	0.9681	0.8964	4.4125