Enhance the CodeGen example for the VSCode plugin's public release (#18)

* update codegen readme and code

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

* update readme

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

* update readme

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean the server code

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

* refine document

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update readme

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

---------

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
lvliang-intel
2024-03-28 23:12:26 +08:00
committed by GitHub
parent 393f6f80cd
commit b91a010fd8
6 changed files with 112 additions and 165 deletions

View File

@@ -1,4 +1,22 @@
Code generation is a noteworthy application of Large Language Model (LLM) technology. In this example, we present a Copilot application to showcase how code generation can be executed on the Intel Gaudi2 platform. This CodeGen use case involves code generation utilizing open source models such as "m-a-p/OpenCodeInterpreter-DS-6.7B", "deepseek-ai/deepseek-coder-33b-instruct" and Text Generation Inference on Intel Gaudi2.
# Code Generation
Code-generating LLMs are specialized AI models designed for the task of generating computer code. Such models undergo training with datasets that encompass repositories, specialized documentation, programming code, relevant web content, and other related data. They possess a deep understanding of various programming languages, coding patterns, and software development concepts. Code LLMs are engineered to assist developers and programmers. When these LLMs are seamlessly integrated into the developer's Integrated Development Environment (IDE), they possess a comprehensive understanding of the coding context, which includes elements such as comments, function names, and variable names. This contextual awareness empowers them to provide more refined and contextually relevant coding suggestions.
Capabilities of LLMs in Coding:
- Code Generation: streamline coding through Code Generation, enabling non-programmers to describe tasks for code creation.
- Code Completion: accelerate coding by suggesting contextually relevant snippets as developers type.
- Code Translation and Modernization: translate and modernize code across multiple programming languages, aiding interoperability and updating legacy projects.
- Code summarization: extract key insights from codebases, improving readability and developer productivity.
- Code Refactoring: offer suggestions for code refactoring, enhancing code performance and efficiency.
- AI-Assisted Testing: assist in creating test cases, ensuring code robustness and accelerating development cycles.
- Error Detection and Debugging: detect errors in code and provide detailed descriptions and potential fixes, expediting debugging processes.
In this example, we present a Code Copilot application to showcase how code generation can be executed on the Intel Gaudi2 platform. This CodeGen use case involves code generation utilizing open source models such as "m-a-p/OpenCodeInterpreter-DS-6.7B", "deepseek-ai/deepseek-coder-33b-instruct" and Text Generation Inference on Intel Gaudi2.
CodeGen architecture shows below:
![architecture](https://i.imgur.com/G9ozwFX.png)
# Environment Setup
@@ -15,7 +33,7 @@ docker pull ghcr.io/huggingface/tgi-gaudi:1.2.1
Alternatively, you can build the Docker image yourself with:
```bash
bash ./tgi_gaudi/build_docker.sh
bash ./serving/tgi_gaudi/build_docker.sh
```
## Launch TGI Gaudi Service
@@ -23,13 +41,13 @@ bash ./tgi_gaudi/build_docker.sh
### Launch a local server instance on 1 Gaudi card:
```bash
bash ./tgi_gaudi/launch_tgi_service.sh
bash ./serving/tgi_gaudi/launch_tgi_service.sh
```
### Launch a local server instance on 4 Gaudi cards:
```bash
bash ./tgi_gaudi/launch_tgi_service.sh 4 9000 "deepseek-ai/deepseek-coder-33b-instruct"
bash ./serving/tgi_gaudi/launch_tgi_service.sh 4 9000 "deepseek-ai/deepseek-coder-33b-instruct"
```
### Customize TGI Gaudi Service
@@ -43,7 +61,7 @@ The ./tgi_gaudi/launch_tgi_service.sh script accepts three parameters:
You have the flexibility to customize these parameters according to your specific needs. Additionally, you can set the TGI Gaudi endpoint by exporting the environment variable `TGI_ENDPOINT`:
```bash
export TGI_ENDPOINT="xxx.xxx.xxx.xxx:8080"
export TGI_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
```
## Launch Copilot Docker
@@ -75,13 +93,13 @@ export HUGGINGFACEHUB_API_TOKEN=<token>
nohup python server.py &
```
## Install Copilot VSCode extension offline
The Copilot backend defaults to listening on port 8000, but you can adjust the port number as needed.
Copy the vsix file `copilot-0.0.1.vsix` to local and install it in VSCode as below.
# Install Copilot VSCode extension from Plugin Marketplace
![Install-screenshot](https://i.imgur.com/JXQ3rqE.jpg)
Install `Neural Copilot` in VSCode as below.
We will be also releasing the plugin in Visual Studio Code plugin market to facilitate the installation.
![Install-screenshot](https://i.imgur.com/cnHRAdD.png)
# How to use
@@ -90,7 +108,7 @@ We will be also releasing the plugin in Visual Studio Code plugin market to faci
Please adjust the service URL in the extension settings based on the endpoint of the code generation backend service.
![Setting-screenshot](https://i.imgur.com/4hjvKPu.png)
![Setting-screenshot](https://i.imgur.com/JfJVFV3.png)
![Setting-screenshot](https://i.imgur.com/AQZuzqd.png)
## Customize
@@ -98,7 +116,7 @@ The Copilot enables users to input their corresponding sensitive information and
![Customize](https://i.imgur.com/PkObak9.png)
## Code suggestion
## Code Suggestion
To trigger inline completion, you'll need to type # {your keyword} (start with your programming language's comment keyword, like // in C++ and # in python). Make sure Inline Suggest is enabled from the VS Code Settings.
For example:
@@ -123,9 +141,11 @@ To provide programmers with a smooth experience, the Copilot supports multiple w
## Chat with AI assistant
You can start a conversation with the AI programming assistant by clicking on the robot icon in the plugin bar on the left:
![icon](https://i.imgur.com/f7rzfCQ.png)
Then you can see the conversation window on the left, where you can chat with AI assistant:
![dialog](https://i.imgur.com/aiYzU60.png)
There are 4 areas worth noting:

View File

@@ -14,4 +14,4 @@
#!/bin/bash
docker build . -t copilot:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
docker build . -t intel/gen-ai-examples:copilot --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy

View File

@@ -15,13 +15,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import os
import types
from concurrent import futures
from typing import Optional
import requests
from fastapi import APIRouter, FastAPI
from fastapi.responses import RedirectResponse, StreamingResponse
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
@@ -37,84 +33,6 @@ app.add_middleware(
)
class CodeGenAPIRouter(APIRouter):
def __init__(self, entrypoint) -> None:
super().__init__()
self.entrypoint = entrypoint
print(f"[codegen - router] Initializing API Router, entrypoint={entrypoint}")
# Define LLM
self.llm = HuggingFaceEndpoint(
endpoint_url=entrypoint,
max_new_tokens=512,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.01,
repetition_penalty=1.03,
streaming=True,
)
print("[codegen - router] LLM initialized.")
def is_generator(self, obj):
return isinstance(obj, types.GeneratorType)
def handle_chat_completion_request(self, request: ChatCompletionRequest):
try:
print(f"Predicting chat completion using prompt '{request.prompt}'")
buffered_texts = ""
if request.stream:
generator = self.llm(request.prompt, callbacks=[StreamingStdOutCallbackHandler()])
if not self.is_generator(generator):
generator = (generator,)
def stream_generator():
nonlocal buffered_texts
for output in generator:
yield f"data: {output}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(stream_generator(), media_type="text/event-stream")
else:
response = self.llm(request.prompt)
except Exception as e:
print(f"An error occurred: {e}")
else:
print("Chat completion finished.")
return ChatCompletionResponse(response=response)
tgi_endpoint = os.getenv("TGI_ENDPOINT", "http://localhost:8080")
router = CodeGenAPIRouter(tgi_endpoint)
app.include_router(router)
def check_completion_request(request: BaseModel) -> Optional[str]:
if request.temperature is not None and request.temperature < 0:
return f"Param Error: {request.temperature} is less than the minimum of 0 --- 'temperature'"
if request.temperature is not None and request.temperature > 2:
return f"Param Error: {request.temperature} is greater than the maximum of 2 --- 'temperature'"
if request.top_p is not None and request.top_p < 0:
return f"Param Error: {request.top_p} is less than the minimum of 0 --- 'top_p'"
if request.top_p is not None and request.top_p > 1:
return f"Param Error: {request.top_p} is greater than the maximum of 1 --- 'top_p'"
if request.top_k is not None and (not isinstance(request.top_k, int)):
return f"Param Error: {request.top_k} is not valid under any of the given schemas --- 'top_k'"
if request.top_k is not None and request.top_k < 1:
return f"Param Error: {request.top_k} is greater than the minimum of 1 --- 'top_k'"
if request.max_new_tokens is not None and (not isinstance(request.max_new_tokens, int)):
return f"Param Error: {request.max_new_tokens} is not valid under any of the given schemas --- 'max_new_tokens'"
return None
def filter_code_format(code):
language_prefixes = {
"go": "```go",
@@ -145,30 +63,80 @@ def filter_code_format(code):
return code
class CodeGenAPIRouter(APIRouter):
def __init__(self, entrypoint) -> None:
super().__init__()
self.entrypoint = entrypoint
print(f"[codegen - router] Initializing API Router, entrypoint={entrypoint}")
# Define LLM
callbacks = [StreamingStdOutCallbackHandler()]
self.llm = HuggingFaceEndpoint(
endpoint_url=entrypoint,
max_new_tokens=1024,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.01,
repetition_penalty=1.03,
streaming=True,
callbacks=callbacks,
)
print("[codegen - router] LLM initialized.")
def handle_chat_completion_request(self, request: ChatCompletionRequest):
try:
print(f"Predicting chat completion using prompt '{request.prompt}'")
if request.stream:
async def stream_generator():
for chunk in self.llm.stream(request.prompt):
yield f"data: {chunk.encode()}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(stream_generator(), media_type="text/event-stream")
else:
result = self.llm(request.prompt)
response = filter_code_format(result)
except Exception as e:
print(f"An error occurred: {e}")
else:
print("Chat completion finished.")
return ChatCompletionResponse(response=response)
tgi_endpoint = os.getenv("TGI_ENDPOINT", "http://localhost:8080")
router = CodeGenAPIRouter(tgi_endpoint)
def check_completion_request(request: BaseModel) -> Optional[str]:
if request.temperature is not None and request.temperature < 0:
return f"Param Error: {request.temperature} is less than the minimum of 0 --- 'temperature'"
if request.temperature is not None and request.temperature > 2:
return f"Param Error: {request.temperature} is greater than the maximum of 2 --- 'temperature'"
if request.top_p is not None and request.top_p < 0:
return f"Param Error: {request.top_p} is less than the minimum of 0 --- 'top_p'"
if request.top_p is not None and request.top_p > 1:
return f"Param Error: {request.top_p} is greater than the maximum of 1 --- 'top_p'"
if request.top_k is not None and (not isinstance(request.top_k, int)):
return f"Param Error: {request.top_k} is not valid under any of the given schemas --- 'top_k'"
if request.top_k is not None and request.top_k < 1:
return f"Param Error: {request.top_k} is greater than the minimum of 1 --- 'top_k'"
if request.max_new_tokens is not None and (not isinstance(request.max_new_tokens, int)):
return f"Param Error: {request.max_new_tokens} is not valid under any of the given schemas --- 'max_new_tokens'"
return None
# router /v1/code_generation only supports non-streaming mode.
@router.post("/v1/code_generation")
async def code_generation_endpoint(chat_request: ChatCompletionRequest):
if router.use_deepspeed:
responses = []
def send_request(port):
try:
url = f"http://{router.host}:{port}/v1/code_generation"
response = requests.post(url, json=chat_request.dict())
response.raise_for_status()
json_response = json.loads(response.content)
cleaned_code = filter_code_format(json_response["response"])
chat_completion_response = ChatCompletionResponse(response=cleaned_code)
responses.append(chat_completion_response)
except requests.exceptions.RequestException as e:
print(f"Error sending/receiving on port {port}: {e}")
with futures.ThreadPoolExecutor(max_workers=router.world_size) as executor:
worker_ports = [router.port + i + 1 for i in range(router.world_size)]
executor.map(send_request, worker_ports)
if responses:
return responses[0]
else:
ret = check_completion_request(chat_request)
if ret is not None:
raise RuntimeError("Invalid parameter.")
@@ -178,56 +146,15 @@ async def code_generation_endpoint(chat_request: ChatCompletionRequest):
# router /v1/code_chat supports both non-streaming and streaming mode.
@router.post("/v1/code_chat")
async def code_chat_endpoint(chat_request: ChatCompletionRequest):
if router.use_deepspeed:
if chat_request.stream:
responses = []
def generate_stream(port):
url = f"http://{router.host}:{port}/v1/code_generation"
response = requests.post(url, json=chat_request.dict(), stream=True, timeout=1000)
responses.append(response)
with futures.ThreadPoolExecutor(max_workers=router.world_size) as executor:
worker_ports = [router.port + i + 1 for i in range(router.world_size)]
executor.map(generate_stream, worker_ports)
while not responses:
pass
def generate():
if responses[0]:
for chunk in responses[0].iter_lines(decode_unicode=False, delimiter=b"\0"):
if chunk:
yield f"data: {chunk}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
else:
responses = []
def send_request(port):
try:
url = f"http://{router.host}:{port}/v1/code_generation"
response = requests.post(url, json=chat_request.dict())
response.raise_for_status()
json_response = json.loads(response.content)
chat_completion_response = ChatCompletionResponse(response=json_response["response"])
responses.append(chat_completion_response)
except requests.exceptions.RequestException as e:
print(f"Error sending/receiving on port {port}: {e}")
with futures.ThreadPoolExecutor(max_workers=router.world_size) as executor:
worker_ports = [router.port + i + 1 for i in range(router.world_size)]
executor.map(send_request, worker_ports)
if responses:
return responses[0]
else:
ret = check_completion_request(chat_request)
if ret is not None:
raise RuntimeError("Invalid parameter.")
return router.handle_chat_completion_request(chat_request)
app.include_router(router)
@app.get("/")
async def redirect_root_to_docs():
return RedirectResponse("/docs")

Binary file not shown.