Enhance the CodeGen example for the VSCode plugin's public release (#18)

* update codegen readme and code Signed-off-by: lvliang-intel <liang1.lv@intel.com> * update readme Signed-off-by: lvliang-intel <liang1.lv@intel.com> * update readme Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean the server code Signed-off-by: lvliang-intel <liang1.lv@intel.com> * refine document Signed-off-by: lvliang-intel <liang1.lv@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update readme Signed-off-by: lvliang-intel <liang1.lv@intel.com> --------- Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-03-28 23:12:26 +08:00
parent 393f6f80cd
commit b91a010fd8
6 changed files with 112 additions and 165 deletions
--- a/CodeGen/README.md
+++ b/CodeGen/README.md
@@ -1,4 +1,22 @@
-Code generation is a noteworthy application of Large Language Model (LLM) technology. In this example, we present a Copilot application to showcase how code generation can be executed on the Intel Gaudi2 platform. This CodeGen use case involves code generation utilizing open source models such as "m-a-p/OpenCodeInterpreter-DS-6.7B", "deepseek-ai/deepseek-coder-33b-instruct" and Text Generation Inference on Intel Gaudi2.
+# Code Generation
+
+Code-generating LLMs are specialized AI models designed for the task of generating computer code. Such models undergo training with datasets that encompass repositories, specialized documentation, programming code, relevant web content, and other related data. They possess a deep understanding of various programming languages, coding patterns, and software development concepts. Code LLMs are engineered to assist developers and programmers. When these LLMs are seamlessly integrated into the developer's Integrated Development Environment (IDE), they possess a comprehensive understanding of the coding context, which includes elements such as comments, function names, and variable names. This contextual awareness empowers them to provide more refined and contextually relevant coding suggestions.
+
+Capabilities of LLMs in Coding:
+
+- Code Generation: streamline coding through Code Generation, enabling non-programmers to describe tasks for code creation.
+- Code Completion: accelerate coding by suggesting contextually relevant snippets as developers type.
+- Code Translation and Modernization: translate and modernize code across multiple programming languages, aiding interoperability and updating legacy projects.
+- Code summarization: extract key insights from codebases, improving readability and developer productivity.
+- Code Refactoring: offer suggestions for code refactoring, enhancing code performance and efficiency.
+- AI-Assisted Testing: assist in creating test cases, ensuring code robustness and accelerating development cycles.
+- Error Detection and Debugging: detect errors in code and provide detailed descriptions and potential fixes, expediting debugging processes.
+
+In this example, we present a Code Copilot application to showcase how code generation can be executed on the Intel Gaudi2 platform. This CodeGen use case involves code generation utilizing open source models such as "m-a-p/OpenCodeInterpreter-DS-6.7B", "deepseek-ai/deepseek-coder-33b-instruct" and Text Generation Inference on Intel Gaudi2.
+
+CodeGen architecture shows below:
+
+![architecture](https://i.imgur.com/G9ozwFX.png)

 # Environment Setup

@@ -15,7 +33,7 @@ docker pull ghcr.io/huggingface/tgi-gaudi:1.2.1
 Alternatively, you can build the Docker image yourself with:

 ```bash
-bash ./tgi_gaudi/build_docker.sh
+bash ./serving/tgi_gaudi/build_docker.sh
 ```

 ## Launch TGI Gaudi Service
@@ -23,13 +41,13 @@ bash ./tgi_gaudi/build_docker.sh
 ### Launch a local server instance on 1 Gaudi card:

 ```bash
-bash ./tgi_gaudi/launch_tgi_service.sh
+bash ./serving/tgi_gaudi/launch_tgi_service.sh
 ```

 ### Launch a local server instance on 4 Gaudi cards:

 ```bash
-bash ./tgi_gaudi/launch_tgi_service.sh 4 9000 "deepseek-ai/deepseek-coder-33b-instruct"
+bash ./serving/tgi_gaudi/launch_tgi_service.sh 4 9000 "deepseek-ai/deepseek-coder-33b-instruct"
 ```

 ### Customize TGI Gaudi Service
@@ -43,7 +61,7 @@ The ./tgi_gaudi/launch_tgi_service.sh script accepts three parameters:
 You have the flexibility to customize these parameters according to your specific needs. Additionally, you can set the TGI Gaudi endpoint by exporting the environment variable `TGI_ENDPOINT`:

 ```bash
-export TGI_ENDPOINT="xxx.xxx.xxx.xxx:8080"
+export TGI_ENDPOINT="http://xxx.xxx.xxx.xxx:8080"
 ```

 ## Launch Copilot Docker
@@ -75,13 +93,13 @@ export HUGGINGFACEHUB_API_TOKEN=<token>
 nohup python server.py &
 ```

-## Install Copilot VSCode extension offline
+The Copilot backend defaults to listening on port 8000, but you can adjust the port number as needed.

-Copy the vsix file `copilot-0.0.1.vsix` to local and install it in VSCode as below.
+# Install Copilot VSCode extension from Plugin Marketplace

-![Install-screenshot](https://i.imgur.com/JXQ3rqE.jpg)
+Install `Neural Copilot` in VSCode as below.

-We will be also releasing the plugin in Visual Studio Code plugin market to facilitate the installation.
+![Install-screenshot](https://i.imgur.com/cnHRAdD.png)

 # How to use

@@ -90,7 +108,7 @@ We will be also releasing the plugin in Visual Studio Code plugin market to faci
 Please adjust the service URL in the extension settings based on the endpoint of the code generation backend service.

 ![Setting-screenshot](https://i.imgur.com/4hjvKPu.png)
-![Setting-screenshot](https://i.imgur.com/JfJVFV3.png)
+![Setting-screenshot](https://i.imgur.com/AQZuzqd.png)

 ## Customize

@@ -98,7 +116,7 @@ The Copilot enables users to input their corresponding sensitive information and

 ![Customize](https://i.imgur.com/PkObak9.png)

-## Code suggestion
+## Code Suggestion

 To trigger inline completion, you'll need to type # {your keyword} (start with your programming language's comment keyword, like // in C++ and # in python). Make sure Inline Suggest is enabled from the VS Code Settings.
 For example:
@@ -123,9 +141,11 @@ To provide programmers with a smooth experience, the Copilot supports multiple w
 ## Chat with AI assistant

 You can start a conversation with the AI programming assistant by clicking on the robot icon in the plugin bar on the left:
+
 ![icon](https://i.imgur.com/f7rzfCQ.png)

 Then you can see the conversation window on the left, where you can chat with AI assistant:
+
 ![dialog](https://i.imgur.com/aiYzU60.png)

 There are 4 areas worth noting:
--- a/CodeGen/codegen/build_docker.sh
+++ b/CodeGen/codegen/build_docker.sh
@@ -14,4 +14,4 @@

 #!/bin/bash

-docker build . -t copilot:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
+docker build . -t intel/gen-ai-examples:copilot --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
--- a/CodeGen/codegen/codegen-app/server.py
+++ b/CodeGen/codegen/codegen-app/server.py
@@ -15,13 +15,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import json
 import os
-import types
-from concurrent import futures
 from typing import Optional

-import requests
 from fastapi import APIRouter, FastAPI
 from fastapi.responses import RedirectResponse, StreamingResponse
 from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
@@ -37,84 +33,6 @@ app.add_middleware(
 )


-class CodeGenAPIRouter(APIRouter):
-    def __init__(self, entrypoint) -> None:
-        super().__init__()
-        self.entrypoint = entrypoint
-        print(f"[codegen - router] Initializing API Router, entrypoint={entrypoint}")
-
-        # Define LLM
-        self.llm = HuggingFaceEndpoint(
-            endpoint_url=entrypoint,
-            max_new_tokens=512,
-            top_k=10,
-            top_p=0.95,
-            typical_p=0.95,
-            temperature=0.01,
-            repetition_penalty=1.03,
-            streaming=True,
-        )
-        print("[codegen - router] LLM initialized.")
-
-    def is_generator(self, obj):
-        return isinstance(obj, types.GeneratorType)
-
-    def handle_chat_completion_request(self, request: ChatCompletionRequest):
-        try:
-            print(f"Predicting chat completion using prompt '{request.prompt}'")
-            buffered_texts = ""
-            if request.stream:
-                generator = self.llm(request.prompt, callbacks=[StreamingStdOutCallbackHandler()])
-                if not self.is_generator(generator):
-                    generator = (generator,)
-
-                def stream_generator():
-                    nonlocal buffered_texts
-                    for output in generator:
-                        yield f"data: {output}\n\n"
-                    yield "data: [DONE]\n\n"
-
-                return StreamingResponse(stream_generator(), media_type="text/event-stream")
-            else:
-                response = self.llm(request.prompt)
-        except Exception as e:
-            print(f"An error occurred: {e}")
-        else:
-            print("Chat completion finished.")
-            return ChatCompletionResponse(response=response)
-
-
-tgi_endpoint = os.getenv("TGI_ENDPOINT", "http://localhost:8080")
-router = CodeGenAPIRouter(tgi_endpoint)
-
-app.include_router(router)
-
-
-def check_completion_request(request: BaseModel) -> Optional[str]:
-    if request.temperature is not None and request.temperature < 0:
-        return f"Param Error: {request.temperature} is less than the minimum of 0 --- 'temperature'"
-
-    if request.temperature is not None and request.temperature > 2:
-        return f"Param Error: {request.temperature} is greater than the maximum of 2 --- 'temperature'"
-
-    if request.top_p is not None and request.top_p < 0:
-        return f"Param Error: {request.top_p} is less than the minimum of 0 --- 'top_p'"
-
-    if request.top_p is not None and request.top_p > 1:
-        return f"Param Error: {request.top_p} is greater than the maximum of 1 --- 'top_p'"
-
-    if request.top_k is not None and (not isinstance(request.top_k, int)):
-        return f"Param Error: {request.top_k} is not valid under any of the given schemas --- 'top_k'"
-
-    if request.top_k is not None and request.top_k < 1:
-        return f"Param Error: {request.top_k} is greater than the minimum of 1 --- 'top_k'"
-
-    if request.max_new_tokens is not None and (not isinstance(request.max_new_tokens, int)):
-        return f"Param Error: {request.max_new_tokens} is not valid under any of the given schemas --- 'max_new_tokens'"
-
-    return None
-
-
 def filter_code_format(code):
    language_prefixes = {
        "go": "```go",
@@ -145,30 +63,80 @@ def filter_code_format(code):
    return code


+class CodeGenAPIRouter(APIRouter):
+    def __init__(self, entrypoint) -> None:
+        super().__init__()
+        self.entrypoint = entrypoint
+        print(f"[codegen - router] Initializing API Router, entrypoint={entrypoint}")
+
+        # Define LLM
+        callbacks = [StreamingStdOutCallbackHandler()]
+        self.llm = HuggingFaceEndpoint(
+            endpoint_url=entrypoint,
+            max_new_tokens=1024,
+            top_k=10,
+            top_p=0.95,
+            typical_p=0.95,
+            temperature=0.01,
+            repetition_penalty=1.03,
+            streaming=True,
+            callbacks=callbacks,
+        )
+        print("[codegen - router] LLM initialized.")
+
+    def handle_chat_completion_request(self, request: ChatCompletionRequest):
+        try:
+            print(f"Predicting chat completion using prompt '{request.prompt}'")
+            if request.stream:
+
+                async def stream_generator():
+                    for chunk in self.llm.stream(request.prompt):
+                        yield f"data: {chunk.encode()}\n\n"
+                    yield "data: [DONE]\n\n"
+
+                return StreamingResponse(stream_generator(), media_type="text/event-stream")
+            else:
+                result = self.llm(request.prompt)
+                response = filter_code_format(result)
+        except Exception as e:
+            print(f"An error occurred: {e}")
+        else:
+            print("Chat completion finished.")
+            return ChatCompletionResponse(response=response)
+
+
+tgi_endpoint = os.getenv("TGI_ENDPOINT", "http://localhost:8080")
+router = CodeGenAPIRouter(tgi_endpoint)
+
+
+def check_completion_request(request: BaseModel) -> Optional[str]:
+    if request.temperature is not None and request.temperature < 0:
+        return f"Param Error: {request.temperature} is less than the minimum of 0 --- 'temperature'"
+
+    if request.temperature is not None and request.temperature > 2:
+        return f"Param Error: {request.temperature} is greater than the maximum of 2 --- 'temperature'"
+
+    if request.top_p is not None and request.top_p < 0:
+        return f"Param Error: {request.top_p} is less than the minimum of 0 --- 'top_p'"
+
+    if request.top_p is not None and request.top_p > 1:
+        return f"Param Error: {request.top_p} is greater than the maximum of 1 --- 'top_p'"
+
+    if request.top_k is not None and (not isinstance(request.top_k, int)):
+        return f"Param Error: {request.top_k} is not valid under any of the given schemas --- 'top_k'"
+
+    if request.top_k is not None and request.top_k < 1:
+        return f"Param Error: {request.top_k} is greater than the minimum of 1 --- 'top_k'"
+
+    if request.max_new_tokens is not None and (not isinstance(request.max_new_tokens, int)):
+        return f"Param Error: {request.max_new_tokens} is not valid under any of the given schemas --- 'max_new_tokens'"
+
+    return None
+
+
 # router /v1/code_generation only supports non-streaming mode.
@router.post("/v1/code_generation")
 async def code_generation_endpoint(chat_request: ChatCompletionRequest):
-    if router.use_deepspeed:
-        responses = []
-
-        def send_request(port):
-            try:
-                url = f"http://{router.host}:{port}/v1/code_generation"
-                response = requests.post(url, json=chat_request.dict())
-                response.raise_for_status()
-                json_response = json.loads(response.content)
-                cleaned_code = filter_code_format(json_response["response"])
-                chat_completion_response = ChatCompletionResponse(response=cleaned_code)
-                responses.append(chat_completion_response)
-            except requests.exceptions.RequestException as e:
-                print(f"Error sending/receiving on port {port}: {e}")
-
-        with futures.ThreadPoolExecutor(max_workers=router.world_size) as executor:
-            worker_ports = [router.port + i + 1 for i in range(router.world_size)]
-            executor.map(send_request, worker_ports)
-        if responses:
-            return responses[0]
-    else:
    ret = check_completion_request(chat_request)
    if ret is not None:
        raise RuntimeError("Invalid parameter.")
@@ -178,56 +146,15 @@ async def code_generation_endpoint(chat_request: ChatCompletionRequest):
 # router /v1/code_chat supports both non-streaming and streaming mode.
@router.post("/v1/code_chat")
 async def code_chat_endpoint(chat_request: ChatCompletionRequest):
-    if router.use_deepspeed:
-        if chat_request.stream:
-            responses = []
-
-            def generate_stream(port):
-                url = f"http://{router.host}:{port}/v1/code_generation"
-                response = requests.post(url, json=chat_request.dict(), stream=True, timeout=1000)
-                responses.append(response)
-
-            with futures.ThreadPoolExecutor(max_workers=router.world_size) as executor:
-                worker_ports = [router.port + i + 1 for i in range(router.world_size)]
-                executor.map(generate_stream, worker_ports)
-
-            while not responses:
-                pass
-
-            def generate():
-                if responses[0]:
-                    for chunk in responses[0].iter_lines(decode_unicode=False, delimiter=b"\0"):
-                        if chunk:
-                            yield f"data: {chunk}\n\n"
-                    yield "data: [DONE]\n\n"
-
-            return StreamingResponse(generate(), media_type="text/event-stream")
-        else:
-            responses = []
-
-            def send_request(port):
-                try:
-                    url = f"http://{router.host}:{port}/v1/code_generation"
-                    response = requests.post(url, json=chat_request.dict())
-                    response.raise_for_status()
-                    json_response = json.loads(response.content)
-                    chat_completion_response = ChatCompletionResponse(response=json_response["response"])
-                    responses.append(chat_completion_response)
-                except requests.exceptions.RequestException as e:
-                    print(f"Error sending/receiving on port {port}: {e}")
-
-            with futures.ThreadPoolExecutor(max_workers=router.world_size) as executor:
-                worker_ports = [router.port + i + 1 for i in range(router.world_size)]
-                executor.map(send_request, worker_ports)
-            if responses:
-                return responses[0]
-    else:
    ret = check_completion_request(chat_request)
    if ret is not None:
        raise RuntimeError("Invalid parameter.")
    return router.handle_chat_completion_request(chat_request)


+app.include_router(router)
+
+
@app.get("/")
 async def redirect_root_to_docs():
    return RedirectResponse("/docs")
--- a/CodeGen/copilot-0.0.1.vsix
+++ b/CodeGen/copilot-0.0.1.vsix
--- a/CodeGen/serving/tgi_gaudi/build_docker.sh
+++ b/CodeGen/serving/tgi_gaudi/build_docker.sh
--- a/CodeGen/serving/tgi_gaudi/launch_tgi_service.sh
+++ b/CodeGen/serving/tgi_gaudi/launch_tgi_service.sh