update examples accuracy (#941)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
lkk
2024-10-14 13:20:50 +08:00
committed by GitHub
parent 441f8cc6ba
commit 088ab98f31
12 changed files with 784 additions and 14 deletions

View File

@@ -1,4 +1,4 @@
# CodeGen accuracy Evaluation
# CodeGen Accuracy
## Evaluation Framework
@@ -13,7 +13,7 @@ Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples
Use `curl` command to test codegen service and ensure that it has started properly
```bash
export CODEGEN_ENDPOINT = "http://${your_ip}:7778/v1/codegen"
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
curl $CODEGEN_ENDPOINT \
-H "Content-Type: application/json" \
-d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
@@ -24,7 +24,7 @@ curl $CODEGEN_ENDPOINT \
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
#### command line usage
#### Environment
```shell
git clone https://github.com/opea-project/GenAIEval
@@ -32,15 +32,14 @@ cd GenAIEval
pip install -r requirements.txt
pip install -e .
cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py --model Qwen/CodeQwen1.5-7B-Chat \
--tasks humaneval \
--codegen_url $CODEGEN_ENDPOINT \
--max_length_generation 2048 \
--batch_size 1 \
--save_generations \
--save_references \
--allow_code_execution
```
#### Evaluation
```
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
export CODEGEN_MODEL=your_model
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
```
**_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.

View File

@@ -0,0 +1,17 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#
from evals.evaluation.bigcode_evaluation_harness import evaluate, setup_parser
def main():
eval_args = setup_parser()
results = evaluate(eval_args)
print(results)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,13 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
python main.py --model $1 \
--tasks humaneval \
--codegen_url $2 \
--max_length_generation 2048 \
--batch_size 1 \
--save_generations \
--save_references \
--allow_code_execution