Voice Chat Implementation¶
This document outlines OmniButler's voice communication channel, which uses Twilio for telephony integration.
Overview¶
OmniButler's voice interface enables users to interact with the system through phone calls, providing access to financial information and AI assistant capabilities through speech. The system uses Twilio for call handling, Google's Speech-to-Text for transcription, and text-to-speech for responses.
System Architecture¶
```mermaid sequenceDiagram participant User participant Twilio participant OmniButler participant Transcription participant LLMService participant TTS
User->>Twilio: Place call
Twilio->>OmniButler: Voice webhook
OmniButler->>Twilio: TwiML response (greeting)
Twilio->>User: Play greeting
User->>Twilio: Speak command
Twilio->>OmniButler: Audio stream
OmniButler->>Transcription: Process audio
Transcription->>OmniButler: Transcribed text
OmniButler->>LLMService: Process command
LLMService->>OmniButler: Generate response
OmniButler->>TTS: Convert to speech
TTS->>OmniButler: Audio response
OmniButler->>Twilio: TwiML with response
Twilio->>User: Play response audio
```
Key Components¶
1. Voice Router System¶
The voice system consists of multiple specialized routers:
from fastapi import APIRouter
from .command_router import router as command_router
from .respond import router as respond_router
from .send_email import router as send_email_router
from .transcribe import router as transcribe_router
voice_router = APIRouter()
voice_router.include_router(respond_router)
voice_router.include_router(transcribe_router)
voice_router.include_router(command_router)
voice_router.include_router(send_email_router)
2. Component Functions¶
Transcription Service¶
- Receives audio from Twilio
- Uses Google Speech-to-Text for transcription
- Processes audio in chunks for real-time transcription
- Handles different audio formats and languages
Command Router¶
- Parses transcribed text to identify commands
- Routes commands to appropriate handlers
- Provides fallback responses for unrecognized commands
- Maintains call context for multi-turn interactions
Response Generation¶
- Processes user queries through the LLM service
- Formats responses for voice output
- Uses SSML for natural-sounding speech with proper intonation
- Handles different response types (confirmations, errors, data responses)
Email Integration¶
- Enables sending emails through voice commands
- Handles recipient selection, subject composition, and message dictation
- Provides confirmation before sending
- Supports email templates for common scenarios
Call Flow¶
- User initiates a call to the Twilio-provisioned phone number
- Twilio connects to OmniButler through the voice webhook
- Initial greeting is played to the user
- User speaks a command or query
- Audio is transcribed to text
- Command is processed:
- Intent is identified
- Parameters are extracted
- LLM generates appropriate response
- Response is converted to speech using TTS
- Audio response is played to the user
- Conversation continues with further commands or ends with user hanging up
Implementation Details¶
Voice Endpoints¶
The system exposes several endpoints for voice interaction:
/voice/respond- Handles initial call setup and responses/voice/transcribe- Processes speech-to-text conversion/voice/command- Routes and processes identified commands/voice/send-email- Handles email dictation and sending
Twilio Integration¶
The voice system leverages several Twilio capabilities:
- TwiML for controlling call flow
- Speech recognition for basic commands
- Audio streaming for advanced transcription
- Media handling for processing audio
Speech Processing¶
Speech processing involves:
- Recording user speech through Twilio
- Transcribing with Google Speech-to-Text
- Processing transcribed text with NLP
- Generating responses with the LLM
- Converting responses to speech with proper SSML formatting
Current Status¶
The voice channel is functionally implemented but currently inactive in production. It requires maintenance before being re-enabled:
- Update Twilio phone number configurations
- Review and optimize speech recognition settings
- Enhance error handling for failed transcriptions
- Update prompts for voice-specific interaction patterns
Future Enhancements¶
- Voice Authentication
- Biometric voice verification
- Secure PIN or passphrase authentication
-
Multi-factor authentication options
-
Enhanced Voice Interaction
- Interruption handling
- Barge-in capability
- Context-aware conversation memory
-
Background noise filtering
-
Specialized Voice Interfaces
- Custom voice interfaces for specific tasks
- Transaction-specific voice workflows
-
Voice alerts and notifications
-
Personalization
- Voice profile recognition
- User-specific voice preferences
- Custom vocabulary for frequent terms