Add documentation
This commit is contained in:
3
.env.example
Normal file
3
.env.example
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
OPENAI_API_KEY=
|
||||||
|
SPEECH_KEY=
|
||||||
|
SPEECH_REGION=
|
||||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -32,6 +32,7 @@ yarn-error.log*
|
|||||||
|
|
||||||
# env files (can opt-in for committing if needed)
|
# env files (can opt-in for committing if needed)
|
||||||
.env*
|
.env*
|
||||||
|
!.env.example
|
||||||
|
|
||||||
# vercel
|
# vercel
|
||||||
.vercel
|
.vercel
|
||||||
|
|||||||
139
README.md
139
README.md
@@ -1,36 +1,137 @@
|
|||||||
This is a [Next.js](https://nextjs.org) project bootstrapped with [`create-next-app`](https://nextjs.org/docs/app/api-reference/cli/create-next-app).
|
<div align="center">
|
||||||
|
<a href="https://github.com/othneildrew/Best-README-Template">
|
||||||
|
<img src="assets/avatar.png" alt="Logo" height="80">
|
||||||
|
</a>
|
||||||
|
|
||||||
|
<h3 align="center">Conversational AI Avatar</h3>
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
An prototype to demonstrate the workflow of an end-to-end ai avatar, using speech-to-text, text-to-speech and an agentic workflow.
|
||||||
|
</p>
|
||||||
|
Demo Video:
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<!-- TABLE OF CONTENTS -->
|
||||||
|
<details>
|
||||||
|
<summary>Table of Contents</summary>
|
||||||
|
<ol>
|
||||||
|
<li>
|
||||||
|
<a href="#about-the-project">About The Project</a>
|
||||||
|
<ul>
|
||||||
|
<li><a href="#key-features">Key Features</a></li>
|
||||||
|
<li><a href="#objectives">Objectives</a></li>
|
||||||
|
<li><a href="#high-level-architecture">High Level Architecture</a></li>
|
||||||
|
<li><a href="#built-with">Built With</a></li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
<a href="#getting-started">Getting Started</a>
|
||||||
|
</li>
|
||||||
|
<li><a href="#known-issues">Known Issues</a></li>
|
||||||
|
<li><a href="#roadmap">Roadmap</a></li>
|
||||||
|
</ol>
|
||||||
|
</details>
|
||||||
|
|
||||||
|
<!-- ABOUT THE PROJECT -->
|
||||||
|
## About The Project
|
||||||
|
|
||||||
|
This project is a prototype designed to demonstrate the workflow of an end-to-end
|
||||||
|
AI avatar. This project integrates various technologies to enable speech-to-text,
|
||||||
|
text-to-speech, and an agentic workflow to create a seamless conversational experience.
|
||||||
|
|
||||||
|
### Key Features
|
||||||
|
- Conversation through **speech or text**
|
||||||
|
- **3D avatar** for a more engaging user interaction
|
||||||
|
- Avatar **lip-sync** for more immersive conversational responses
|
||||||
|
|
||||||
|
### Objectives
|
||||||
|
- To explore the integration of different AI technologies that together enable a seamless conversational experience.
|
||||||
|
- To create a user-friendly and interactive AI avatar.
|
||||||
|
- To demonstrate the potential applications of conversational AI in various domains such as learning platforms.
|
||||||
|
|
||||||
|
### High Level architecture
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/architecture-dark.png" align="middle" width = "1000" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
The high-level architecture of the Conversational AI Avatar consists of the following components:
|
||||||
|
|
||||||
|
1. **Speech-to-Text (STT) Module**: Converts spoken language into text. Currently it is using the OpenAI Whisper API, but as the model is open source, we could also be running it locally
|
||||||
|
|
||||||
|
2. **LLM Agents**: This module is crucial for maintaining a coherent and contextually relevant conversations with the AI avatar. The module is using Langchain and Langgraph for the agentic workflow and OpenAI LLM models.\
|
||||||
|
Note: The components highlighted in orange are currently not implemented due to the rapid prototyping phase. However, they are crucial for enhancing conversation flows and actions, and provide a more immersive experience.
|
||||||
|
|
||||||
|
3. **Text-to-Speech (TTS) and Lip Sync Module**: This module addresses both speech synthesis and lip synchronization. In this high level architecture both are merged because we are using Microsoft Cognitive Services Specch SDK, which already gives us the speech synthesis and visemes with timestamps for the lip sync. In the future we could try other advanced methods for TTS, such as Eleven Labs API. However, visemes prediction is a challenging problem with limited solutions. We could look for open source alternatives or create a custom solution.
|
||||||
|
|
||||||
|
4. **3D Avatar Module**: A visual representation of the AI enhances the immersive experience. This prototype use a *Ready Player Me* avatar, which can be exported with armature and morph targets to create an expressive and interactive avatar. The avatar is rendered using Three.js, a WebGL library.
|
||||||
|
|
||||||
|
The flow of the application is the following:
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/sequence-dark.png" align="middle" width = "1000" />
|
||||||
|
</p>
|
||||||
|
|
||||||
|
### Built With
|
||||||
|
|
||||||
|
* [![LangChain][LangChain]][LangChain-url]
|
||||||
|
* [![LangGraph][LangGraph]][LangGraph-url]
|
||||||
|
* [![OpenAI][OpenAI]][OpenAI-url]
|
||||||
|
* [![Next][Next.js]][Next-url]
|
||||||
|
|
||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
|
|
||||||
First, run the development server:
|
You need API keys for 2 different Services: OpenAI and Azure
|
||||||
|
|
||||||
|
Refer to the following links to learn how to create the keys:
|
||||||
|
- https://platform.openai.com/docs/quickstart
|
||||||
|
- https://azure.microsoft.com/en-us/free
|
||||||
|
|
||||||
|
Copy the `.env.example` to create a `.env.local` at the root of the repository and add the OpenAI and Azure Keys
|
||||||
|
|
||||||
|
First, install the dependencies
|
||||||
|
```bash
|
||||||
|
npm install
|
||||||
|
# or
|
||||||
|
yarn
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, run the development server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
npm run dev
|
npm run dev
|
||||||
# or
|
# or
|
||||||
yarn dev
|
yarn dev
|
||||||
# or
|
|
||||||
pnpm dev
|
|
||||||
# or
|
|
||||||
bun dev
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Open [http://localhost:3000](http://localhost:3000) with your browser to see the result.
|
Open [http://localhost:3000](http://localhost:3000) with your browser to see the webapp.
|
||||||
|
|
||||||
You can start editing the page by modifying `app/page.tsx`. The page auto-updates as you edit the file.
|
You can start editing the page by modifying `app/page.tsx`. The page auto-updates as you edit the file.
|
||||||
|
|
||||||
This project uses [`next/font`](https://nextjs.org/docs/app/building-your-application/optimizing/fonts) to automatically optimize and load [Geist](https://vercel.com/font), a new font family for Vercel.
|
## Known Issues
|
||||||
|
|
||||||
## Learn More
|
* Sometimes, the LLM gives a response with values that aren’t accepted in the HTTP headers. The current workflow of sending result values on the headers along with the audio stream must be changed to send a content type of "*multipart/mix*" and avoid this issue
|
||||||
|
* React three fiber hasn't updated its dependency to React 19, so there may be some errors when installing the npm packages
|
||||||
|
|
||||||
To learn more about Next.js, take a look at the following resources:
|
## Roadmap
|
||||||
|
|
||||||
- [Next.js Documentation](https://nextjs.org/docs) - learn about Next.js features and API.
|
- [ ] Expand with short term and long term memory. Conversation summarization and user-based memory should work great.
|
||||||
- [Learn Next.js](https://nextjs.org/learn) - an interactive Next.js tutorial.
|
- [ ] Test a reasoning model like deepseek to stream the thinking of the avatar and get better responses in general
|
||||||
|
- [ ] Implement a more agentic workflow, understand the need of the user like search, retreive information, task saving, etc.
|
||||||
|
- [ ] Add tool calling for actions inside the immersive experience, maybe move the avatar around a room, across rooms, write something on a blackboard, point at a place, etc
|
||||||
|
- [ ] Make the avatar more emotional. Using agentic workflows we can add an agent that break the conversation into parts and classify the emotion for each part. Then we can update the face emotions via morph targets based on the classifications
|
||||||
|
- [ ] Test speech to text using elevenlabs and getting visemes with an open source solution
|
||||||
|
|
||||||
You can check out [the Next.js GitHub repository](https://github.com/vercel/next.js) - your feedback and contributions are welcome!
|
<!-- MARKDOWN LINKS & IMAGES -->
|
||||||
|
<!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->
|
||||||
## Deploy on Vercel
|
[Next.js]: https://img.shields.io/badge/next.js-000000?style=for-the-badge&logo=nextdotjs&logoColor=white
|
||||||
|
[Next-url]: https://nextjs.org/
|
||||||
The easiest way to deploy your Next.js app is to use the [Vercel Platform](https://vercel.com/new?utm_medium=default-template&filter=next.js&utm_source=create-next-app&utm_campaign=create-next-app-readme) from the creators of Next.js.
|
[LangChain]: https://img.shields.io/badge/langchain-1C3C3C?style=for-the-badge&logo=langchain&logoColor=white
|
||||||
|
[LangChain-url]: https://www.langchain.com/
|
||||||
Check out our [Next.js deployment documentation](https://nextjs.org/docs/app/building-your-application/deploying) for more details.
|
[LangGraph]: https://img.shields.io/badge/langgraph-1C3C3C?style=for-the-badge&logo=langgraph&logoColor=white
|
||||||
|
[LangGraph-url]: https://www.langchain.com/
|
||||||
|
[OpenAI]: https://img.shields.io/badge/openai-412991?style=for-the-badge&logo=openai&logoColor=white
|
||||||
|
[OpenAI-url]: https://platform.openai.com/
|
||||||
BIN
assets/architecture-dark.png
Normal file
BIN
assets/architecture-dark.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 943 KiB |
BIN
assets/avatar.png
Normal file
BIN
assets/avatar.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 80 KiB |
BIN
assets/sequence-dark.png
Normal file
BIN
assets/sequence-dark.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 384 KiB |
Reference in New Issue
Block a user