Voice-Only Agents
Voice-only agents provide hands-free, conversational AI experiences optimized for spoken interactions. These agents are perfect for scenarios where users need to multitask or prefer natural speech over text-based communication.Voice Agent Configuration
The voice agent setup is accessible through the main agent configuration interface, as shown in the companion settings.
Voice-Only Mode Toggle
In the left sidebar, you can see the “Voice only” toggle that enables pure voice interaction:- Voice Only Toggle: Switches the agent to speech-only mode
- Companion Voice: Select from available voice personalities
- Real-time Processing: Enable immediate speech-to-speech interaction
Voice Selection Options
The voice dropdown adapts to the model family you select:
Gemini Voice Set
- Sportsman, Customer support, Sarah, Brooke, Katie, Zemo, ajith, duaila, azj, ajz, sjl, brit, Swissen
- Available whenever a Gemini 2.5 model powers the agent
- Blend friendly assistants (Customer support, Sarah) with energetic presenters (Sportsman, Zemo)
OpenAI Realtime Voice Set
- Alloy, Echo, Shimmer, Ash, Ballad, Coral, Sage, Verse, Cedar, Marin
- Automatically displayed when GPT Realtime, GPT‑4o Realtime, or GPT Realtime Mini is active
- Covers professional narrators (Alloy, Ash), expressive hosts (Ballad, Coral), and supportive guides (Sage, Verse)
Real-Time Voice Models
Voice-only agents can leverage advanced real-time models:
OpenAI Realtime Models
- GPT Realtime Mini: Fastest response time for highly interactive conversations
- GPT‑4o Realtime: Balanced latency and quality for premium experiences
- GPT Realtime: General-purpose realtime model with strong reasoning
Gemini Models
- Gemini 2.5 Flash Lite: Lightweight option for responsive experiences
- Gemini 2.5 Flash: Balanced speed and quality
- Gemini 2.5 Pro: Highest reasoning capability in the Gemini lineup
Groq-Hosted Models
- GPT OSS 20B / 120B: High-performance open-source GPT derivatives
- Qwen3‑32B: Strong multilingual and reasoning support
- Moonshotai Kimi K2: Alternative option for distinctive response style
Voice Agent Benefits
Hands-Free Operation
- Multitasking: Users can interact while working on other tasks
- Accessibility: Perfect for users with visual impairments or mobility limitations
- Convenience: Natural conversation without typing or screen interaction
- Mobile-Friendly: Ideal for on-the-go interactions
Natural Conversation Flow
- Real-time Responses: Immediate feedback like human conversation
- Interruption Handling: Agents can handle mid-sentence interruptions
- Context Awareness: Maintain conversation context across turns
- Emotional Nuance: Voice conveys tone and emotion better than text
Implementation Strategies
Voice-First Design Principles
When configuring voice-only agents, consider these design principles:Conversation Flow Optimization
Structure conversations for voice interaction:
- Clear Opening: “Hi, how can I help you?” works well for voice
- Guided Discovery: Suggest specific questions users can ask
- Confirmation Loops: Verify understanding through speech
- Natural Closing: End conversations gracefully
Use Cases for Voice-Only Agents
Customer Support Hotlines
- 24/7 Availability: Replace or supplement human phone support
- Quick Triage: Route calls based on spoken requests
- Information Retrieval: Answer frequently asked questions
- Escalation Management: Transfer complex issues to humans
Smart Speaker Integration
- Home Automation: Control connected devices through voice
- Information Services: Weather, news, and general inquiries
- Entertainment: Music, podcasts, and interactive content
- Productivity: Calendar management, reminders, and tasks
Automotive Applications
- Hands-Free Assistance: Safe interaction while driving
- Navigation Help: Provide directions and traffic updates
- Vehicle Control: Adjust settings through voice commands
- Emergency Support: Quick access to help when needed
Healthcare and Wellness
- Symptom Checking: Initial health assessments through conversation
- Medication Reminders: Voice-activated pill reminders
- Mental Health Support: Conversational therapy and check-ins
- Accessibility Services: Support for users with disabilities
Technical Implementation
Voice Processing Pipeline
Voice-only agents follow this processing flow:- Speech Recognition: Convert user speech to text
- Intent Understanding: Process natural language input
- Response Generation: Create appropriate textual response
- Text-to-Speech: Convert response to natural speech
- Audio Delivery: Stream audio back to user
Audio Quality Considerations
Optimize for voice quality:
- Clear Audio Input: Ensure good microphone quality
- Noise Cancellation: Handle background noise appropriately
- Speech Rate: Adjust speaking speed for clarity
- Volume Leveling: Maintain consistent audio levels
- Echo Handling: Prevent audio feedback loops
Best Practices for Voice Agents
Conversation Design
- Be Conversational: Use natural speech patterns, not robotic responses
- Stay Concise: Voice users have limited attention spans
- Provide Context: Help users understand what’s happening
- Handle Errors Gracefully: When misunderstanding occurs, clarify politely
- Use Confirmations: Verify important information verbally
Voice Personality Development
Match voice characteristics to your brand:
- Professional Services: Use clear, authoritative voices (Sage, Echo)
- Customer Service: Choose friendly, helpful tones (Coral, Shimmer)
- Healthcare: Select calm, reassuring voices (Ash, Ballad)
- Entertainment: Pick engaging, expressive options (Ballad, Coral)
Accessibility Considerations
- Clear Pronunciation: Ensure technical terms are spoken clearly
- Adjustable Speed: Allow users to control speaking pace
- Repeat Options: Enable users to request information again
- Simple Navigation: Keep voice menus straightforward
- Error Recovery: Provide clear paths when users get lost
Testing Voice-Only Agents
Quality Assurance Process
- Speech Recognition Accuracy: Test with various accents and speech patterns
- Response Appropriateness: Verify answers are suitable for voice delivery
- Conversation Flow: Ensure natural dialogue progression
- Error Handling: Test recovery from misunderstood input
- Performance: Check response times and audio quality
User Testing Strategy
- Diverse User Groups: Test with different demographics and abilities
- Real-World Scenarios: Simulate actual usage conditions
- Background Noise: Test performance in noisy environments
- Extended Conversations: Verify context retention over longer interactions
- Edge Cases: Test unusual requests and conversation patterns
Performance Optimization
Latency Reduction
Minimize delay in voice interactions:
- Model Selection: Choose real-time optimized models
- Streaming Responses: Deliver audio as it’s generated
- Predictive Processing: Anticipate likely user responses
- Network Optimization: Ensure reliable connectivity
- Local Processing: Cache common responses when possible
Scalability Planning
- Concurrent Users: Plan for multiple simultaneous voice sessions
- Resource Management: Monitor CPU and bandwidth usage
- Queue Management: Handle peak usage periods gracefully
- Fallback Systems: Provide alternatives when voice fails
- Analytics Integration: Track usage patterns and performance metrics
Integration with Other Features
Multi-Modal Fallbacks
Even voice-only agents can benefit from multi-modal capabilities:- Text Alternatives: Provide text options when voice fails
- Visual Confirmations: Send follow-up messages for important actions
- Rich Content: Share links or documents via other channels
- Screen Sharing: Enable visual support when needed
Data Collection
Voice agents can collect information through conversation:
- Verbal Forms: Gather information through natural dialogue
- Confirmation Steps: Verify collected data audibly
- Privacy Compliance: Handle sensitive information appropriately
- Data Validation: Confirm spellings and details verbally
