Skip to main content

Voice-Only Agents

Voice-only agents provide hands-free, conversational AI experiences optimized for spoken interactions. These agents are perfect for scenarios where users need to multitask or prefer natural speech over text-based communication.

Voice Agent Configuration

The voice agent setup is accessible through the main agent configuration interface, as shown in the companion settings. Voice Configuration Interface

Voice-Only Mode Toggle

In the left sidebar, you can see the “Voice only” toggle that enables pure voice interaction:
  • Voice Only Toggle: Switches the agent to speech-only mode
  • Companion Voice: Select from available voice personalities
  • Real-time Processing: Enable immediate speech-to-speech interaction

Voice Selection Options

Voice Options The voice dropdown adapts to the model family you select:

Gemini Voice Set

  • Sportsman, Customer support, Sarah, Brooke, Katie, Zemo, ajith, duaila, azj, ajz, sjl, brit, Swissen
  • Available whenever a Gemini 2.5 model powers the agent
  • Blend friendly assistants (Customer support, Sarah) with energetic presenters (Sportsman, Zemo)

OpenAI Realtime Voice Set

  • Alloy, Echo, Shimmer, Ash, Ballad, Coral, Sage, Verse, Cedar, Marin
  • Automatically displayed when GPT Realtime, GPT‑4o Realtime, or GPT Realtime Mini is active
  • Covers professional narrators (Alloy, Ash), expressive hosts (Ballad, Coral), and supportive guides (Sage, Verse)

Real-Time Voice Models

Real-Time Model Selection Voice-only agents can leverage advanced real-time models:

OpenAI Realtime Models

  • GPT Realtime Mini: Fastest response time for highly interactive conversations
  • GPT‑4o Realtime: Balanced latency and quality for premium experiences
  • GPT Realtime: General-purpose realtime model with strong reasoning

Gemini Models

  • Gemini 2.5 Flash Lite: Lightweight option for responsive experiences
  • Gemini 2.5 Flash: Balanced speed and quality
  • Gemini 2.5 Pro: Highest reasoning capability in the Gemini lineup

Groq-Hosted Models

  • GPT OSS 20B / 120B: High-performance open-source GPT derivatives
  • Qwen3‑32B: Strong multilingual and reasoning support
  • Moonshotai Kimi K2: Alternative option for distinctive response style
These models enable natural, flowing conversations with minimal delay between user speech and agent responses.

Voice Agent Benefits

Hands-Free Operation

  • Multitasking: Users can interact while working on other tasks
  • Accessibility: Perfect for users with visual impairments or mobility limitations
  • Convenience: Natural conversation without typing or screen interaction
  • Mobile-Friendly: Ideal for on-the-go interactions

Natural Conversation Flow

  • Real-time Responses: Immediate feedback like human conversation
  • Interruption Handling: Agents can handle mid-sentence interruptions
  • Context Awareness: Maintain conversation context across turns
  • Emotional Nuance: Voice conveys tone and emotion better than text

Implementation Strategies

Voice-First Design Principles

When configuring voice-only agents, consider these design principles:
Voice Interaction Guidelines:
- Use conversational, natural language
- Keep responses concise but complete
- Include verbal confirmations for important actions
- Design for listening, not reading
- Handle background noise gracefully

Conversation Flow Optimization

Agent Behavior Configuration Structure conversations for voice interaction:
  • Clear Opening: “Hi, how can I help you?” works well for voice
  • Guided Discovery: Suggest specific questions users can ask
  • Confirmation Loops: Verify understanding through speech
  • Natural Closing: End conversations gracefully

Use Cases for Voice-Only Agents

Customer Support Hotlines

  • 24/7 Availability: Replace or supplement human phone support
  • Quick Triage: Route calls based on spoken requests
  • Information Retrieval: Answer frequently asked questions
  • Escalation Management: Transfer complex issues to humans

Smart Speaker Integration

  • Home Automation: Control connected devices through voice
  • Information Services: Weather, news, and general inquiries
  • Entertainment: Music, podcasts, and interactive content
  • Productivity: Calendar management, reminders, and tasks

Automotive Applications

  • Hands-Free Assistance: Safe interaction while driving
  • Navigation Help: Provide directions and traffic updates
  • Vehicle Control: Adjust settings through voice commands
  • Emergency Support: Quick access to help when needed

Healthcare and Wellness

  • Symptom Checking: Initial health assessments through conversation
  • Medication Reminders: Voice-activated pill reminders
  • Mental Health Support: Conversational therapy and check-ins
  • Accessibility Services: Support for users with disabilities

Technical Implementation

Voice Processing Pipeline

Voice-only agents follow this processing flow:
  1. Speech Recognition: Convert user speech to text
  2. Intent Understanding: Process natural language input
  3. Response Generation: Create appropriate textual response
  4. Text-to-Speech: Convert response to natural speech
  5. Audio Delivery: Stream audio back to user

Audio Quality Considerations

Voice Settings Configuration Optimize for voice quality:
  • Clear Audio Input: Ensure good microphone quality
  • Noise Cancellation: Handle background noise appropriately
  • Speech Rate: Adjust speaking speed for clarity
  • Volume Leveling: Maintain consistent audio levels
  • Echo Handling: Prevent audio feedback loops

Best Practices for Voice Agents

Conversation Design

  1. Be Conversational: Use natural speech patterns, not robotic responses
  2. Stay Concise: Voice users have limited attention spans
  3. Provide Context: Help users understand what’s happening
  4. Handle Errors Gracefully: When misunderstanding occurs, clarify politely
  5. Use Confirmations: Verify important information verbally

Voice Personality Development

Avatar and Voice Matching Match voice characteristics to your brand:
  • Professional Services: Use clear, authoritative voices (Sage, Echo)
  • Customer Service: Choose friendly, helpful tones (Coral, Shimmer)
  • Healthcare: Select calm, reassuring voices (Ash, Ballad)
  • Entertainment: Pick engaging, expressive options (Ballad, Coral)

Accessibility Considerations

  • Clear Pronunciation: Ensure technical terms are spoken clearly
  • Adjustable Speed: Allow users to control speaking pace
  • Repeat Options: Enable users to request information again
  • Simple Navigation: Keep voice menus straightforward
  • Error Recovery: Provide clear paths when users get lost

Testing Voice-Only Agents

Quality Assurance Process

  1. Speech Recognition Accuracy: Test with various accents and speech patterns
  2. Response Appropriateness: Verify answers are suitable for voice delivery
  3. Conversation Flow: Ensure natural dialogue progression
  4. Error Handling: Test recovery from misunderstood input
  5. Performance: Check response times and audio quality

User Testing Strategy

  • Diverse User Groups: Test with different demographics and abilities
  • Real-World Scenarios: Simulate actual usage conditions
  • Background Noise: Test performance in noisy environments
  • Extended Conversations: Verify context retention over longer interactions
  • Edge Cases: Test unusual requests and conversation patterns

Performance Optimization

Latency Reduction

Real-Time Processing Minimize delay in voice interactions:
  • Model Selection: Choose real-time optimized models
  • Streaming Responses: Deliver audio as it’s generated
  • Predictive Processing: Anticipate likely user responses
  • Network Optimization: Ensure reliable connectivity
  • Local Processing: Cache common responses when possible

Scalability Planning

  • Concurrent Users: Plan for multiple simultaneous voice sessions
  • Resource Management: Monitor CPU and bandwidth usage
  • Queue Management: Handle peak usage periods gracefully
  • Fallback Systems: Provide alternatives when voice fails
  • Analytics Integration: Track usage patterns and performance metrics

Integration with Other Features

Multi-Modal Fallbacks

Even voice-only agents can benefit from multi-modal capabilities:
  • Text Alternatives: Provide text options when voice fails
  • Visual Confirmations: Send follow-up messages for important actions
  • Rich Content: Share links or documents via other channels
  • Screen Sharing: Enable visual support when needed

Data Collection

Form Integration with Voice Voice agents can collect information through conversation:
  • Verbal Forms: Gather information through natural dialogue
  • Confirmation Steps: Verify collected data audibly
  • Privacy Compliance: Handle sensitive information appropriately
  • Data Validation: Confirm spellings and details verbally
Voice-only agents represent the future of natural, accessible AI interaction. By leveraging advanced real-time voice models and thoughtful conversation design, these agents can provide compelling user experiences that feel more like talking with a knowledgeable assistant than interacting with a computer.