Skip to content

Voice Chat Implementation

This document outlines OmniButler's voice communication channel, which uses Twilio for telephony integration.

Overview

OmniButler's voice interface enables users to interact with the system through phone calls, providing access to financial information and AI assistant capabilities through speech. The system uses Twilio for call handling, Google's Speech-to-Text for transcription, and text-to-speech for responses.

System Architecture

```mermaid sequenceDiagram participant User participant Twilio participant OmniButler participant Transcription participant LLMService participant TTS

User->>Twilio: Place call
Twilio->>OmniButler: Voice webhook
OmniButler->>Twilio: TwiML response (greeting)
Twilio->>User: Play greeting
User->>Twilio: Speak command
Twilio->>OmniButler: Audio stream
OmniButler->>Transcription: Process audio
Transcription->>OmniButler: Transcribed text
OmniButler->>LLMService: Process command
LLMService->>OmniButler: Generate response
OmniButler->>TTS: Convert to speech
TTS->>OmniButler: Audio response
OmniButler->>Twilio: TwiML with response
Twilio->>User: Play response audio

```

Key Components

1. Voice Router System

The voice system consists of multiple specialized routers:

from fastapi import APIRouter

from .command_router import router as command_router
from .respond import router as respond_router
from .send_email import router as send_email_router
from .transcribe import router as transcribe_router

voice_router = APIRouter()

voice_router.include_router(respond_router)
voice_router.include_router(transcribe_router)
voice_router.include_router(command_router)
voice_router.include_router(send_email_router)

2. Component Functions

Transcription Service

  • Receives audio from Twilio
  • Uses Google Speech-to-Text for transcription
  • Processes audio in chunks for real-time transcription
  • Handles different audio formats and languages

Command Router

  • Parses transcribed text to identify commands
  • Routes commands to appropriate handlers
  • Provides fallback responses for unrecognized commands
  • Maintains call context for multi-turn interactions

Response Generation

  • Processes user queries through the LLM service
  • Formats responses for voice output
  • Uses SSML for natural-sounding speech with proper intonation
  • Handles different response types (confirmations, errors, data responses)

Email Integration

  • Enables sending emails through voice commands
  • Handles recipient selection, subject composition, and message dictation
  • Provides confirmation before sending
  • Supports email templates for common scenarios

Call Flow

  1. User initiates a call to the Twilio-provisioned phone number
  2. Twilio connects to OmniButler through the voice webhook
  3. Initial greeting is played to the user
  4. User speaks a command or query
  5. Audio is transcribed to text
  6. Command is processed:
  7. Intent is identified
  8. Parameters are extracted
  9. LLM generates appropriate response
  10. Response is converted to speech using TTS
  11. Audio response is played to the user
  12. Conversation continues with further commands or ends with user hanging up

Implementation Details

Voice Endpoints

The system exposes several endpoints for voice interaction:

  • /voice/respond - Handles initial call setup and responses
  • /voice/transcribe - Processes speech-to-text conversion
  • /voice/command - Routes and processes identified commands
  • /voice/send-email - Handles email dictation and sending

Twilio Integration

The voice system leverages several Twilio capabilities:

  • TwiML for controlling call flow
  • Speech recognition for basic commands
  • Audio streaming for advanced transcription
  • Media handling for processing audio

Speech Processing

Speech processing involves:

  1. Recording user speech through Twilio
  2. Transcribing with Google Speech-to-Text
  3. Processing transcribed text with NLP
  4. Generating responses with the LLM
  5. Converting responses to speech with proper SSML formatting

Current Status

The voice channel is functionally implemented but currently inactive in production. It requires maintenance before being re-enabled:

  1. Update Twilio phone number configurations
  2. Review and optimize speech recognition settings
  3. Enhance error handling for failed transcriptions
  4. Update prompts for voice-specific interaction patterns

Future Enhancements

  1. Voice Authentication
  2. Biometric voice verification
  3. Secure PIN or passphrase authentication
  4. Multi-factor authentication options

  5. Enhanced Voice Interaction

  6. Interruption handling
  7. Barge-in capability
  8. Context-aware conversation memory
  9. Background noise filtering

  10. Specialized Voice Interfaces

  11. Custom voice interfaces for specific tasks
  12. Transaction-specific voice workflows
  13. Voice alerts and notifications

  14. Personalization

  15. Voice profile recognition
  16. User-specific voice preferences
  17. Custom vocabulary for frequent terms