Voice Chat Implementation¶

This document outlines OmniButler's voice communication channel, which uses Twilio for telephony integration.

Overview¶

OmniButler's voice interface enables users to interact with the system through phone calls, providing access to financial information and AI assistant capabilities through speech. The system uses Twilio for call handling, Google's Speech-to-Text for transcription, and text-to-speech for responses.

System Architecture¶

```mermaid sequenceDiagram participant User participant Twilio participant OmniButler participant Transcription participant LLMService participant TTS

User->>Twilio: Place call
Twilio->>OmniButler: Voice webhook
OmniButler->>Twilio: TwiML response (greeting)
Twilio->>User: Play greeting
User->>Twilio: Speak command
Twilio->>OmniButler: Audio stream
OmniButler->>Transcription: Process audio
Transcription->>OmniButler: Transcribed text
OmniButler->>LLMService: Process command
LLMService->>OmniButler: Generate response
OmniButler->>TTS: Convert to speech
TTS->>OmniButler: Audio response
OmniButler->>Twilio: TwiML with response
Twilio->>User: Play response audio

```

Key Components¶

1. Voice Router System¶

The voice system consists of multiple specialized routers:

from fastapi import APIRouter

from .command_router import router as command_router
from .respond import router as respond_router
from .send_email import router as send_email_router
from .transcribe import router as transcribe_router

voice_router = APIRouter()

voice_router.include_router(respond_router)
voice_router.include_router(transcribe_router)
voice_router.include_router(command_router)
voice_router.include_router(send_email_router)

2. Component Functions¶

Transcription Service¶

Receives audio from Twilio
Uses Google Speech-to-Text for transcription
Processes audio in chunks for real-time transcription
Handles different audio formats and languages

Command Router¶

Parses transcribed text to identify commands
Routes commands to appropriate handlers
Provides fallback responses for unrecognized commands
Maintains call context for multi-turn interactions

Response Generation¶

Processes user queries through the LLM service
Formats responses for voice output
Uses SSML for natural-sounding speech with proper intonation
Handles different response types (confirmations, errors, data responses)

Email Integration¶

Enables sending emails through voice commands
Handles recipient selection, subject composition, and message dictation
Provides confirmation before sending
Supports email templates for common scenarios

Call Flow¶

User initiates a call to the Twilio-provisioned phone number
Twilio connects to OmniButler through the voice webhook
Initial greeting is played to the user
User speaks a command or query
Audio is transcribed to text
Command is processed:
Intent is identified
Parameters are extracted
LLM generates appropriate response
Response is converted to speech using TTS
Audio response is played to the user
Conversation continues with further commands or ends with user hanging up

Implementation Details¶

Voice Endpoints¶

The system exposes several endpoints for voice interaction:

/voice/respond - Handles initial call setup and responses
/voice/transcribe - Processes speech-to-text conversion
/voice/command - Routes and processes identified commands
/voice/send-email - Handles email dictation and sending

Twilio Integration¶

The voice system leverages several Twilio capabilities:

TwiML for controlling call flow
Speech recognition for basic commands
Audio streaming for advanced transcription
Media handling for processing audio

Speech Processing¶

Speech processing involves:

Recording user speech through Twilio
Transcribing with Google Speech-to-Text
Processing transcribed text with NLP
Generating responses with the LLM
Converting responses to speech with proper SSML formatting

Current Status¶

The voice channel is functionally implemented but currently inactive in production. It requires maintenance before being re-enabled:

Update Twilio phone number configurations
Review and optimize speech recognition settings
Enhance error handling for failed transcriptions
Update prompts for voice-specific interaction patterns

Future Enhancements¶

Voice Authentication
Biometric voice verification
Secure PIN or passphrase authentication
Multi-factor authentication options
Enhanced Voice Interaction
Interruption handling
Barge-in capability
Context-aware conversation memory
Background noise filtering
Specialized Voice Interfaces
Custom voice interfaces for specific tasks
Transaction-specific voice workflows
Voice alerts and notifications
Personalization
Voice profile recognition
User-specific voice preferences
Custom vocabulary for frequent terms