Speech-to-Speech AI Model Benchmarking Framework
Overview
This comprehensive benchmarking framework evaluates Speech-to-Speech (S2S) AI models across multiple critical dimensions. The framework tests models end-to-end through voice interactions, measuring performance on real-world conversational scenarios using industry-standard evaluation datasets and methodologies.Open-Source Benchmarks
Our model Rose (Aivoco v1) was evaluated on open-source benchmarks — available on GitHub.You can evaluate our benchmarks and fact-check our model’s performance using the framework.
Framework Architecture
The framework consists of 9 specialized benchmarks that cover all major aspects of speech-to-speech AI evaluation:Evaluation Pipeline
Each benchmark follows a consistent Speech-to-Speech pipeline:Vispark Models Integration
This framework leverages Vispark’s cutting-edge AI models available at https://lab.vispark.in:Text-to-Speech (TTS)
- Model: Advanced neural TTS with emotional expression
- Languages: 250+ languages including all major Indian languages
- Voices: Multiple voice options (male/female) with natural intonation
- Quality: 24kHz high-fidelity audio output
- Features: Emotion control, speed adjustment, pronunciation correction
Speech-to-Text (STT)
- Model: State-of-the-art automatic speech recognition
- Accuracy: 99%+ accuracy across supported languages
- Languages: 250+ languages with regional accent support
- Features: Real-time processing, noise cancellation, speaker diarization
- Output: Precise text transcription with punctuation
Vision (Multimodal AI)
- Model: Advanced multimodal understanding
- Capabilities: Text, image, audio, and video analysis
- Context Window: Up to 1 million tokens
- Features: Emotion analysis, content understanding, real-time streaming
- Use Cases: Judge for emotions, naturalness, and expressiveness benchmarks
Key Advantages
- SOTA Performance: Industry-leading accuracy and naturalness
- Real-time Processing: Low latency for interactive applications
- Multilingual Support: Native support for Indian regional languages
- API Integration: Simple REST API with comprehensive documentation
- Enterprise Ready: Production-grade reliability and security
Benchmark Results Summary
🏆 Model Performance Rankings (As of August 30, 2025)
| Model | Overall Rank | Function Calling | Emotions | Naturalness | HLE | Expressiveness | Latency | Boot Time | Code Gen | Ethics |
|---|---|---|---|---|---|---|---|---|---|---|
| Rose (Aivoco v1) | 🥇 1st | 70.29% 🟢 | 89% 🟢 | 91% 🟢 | 41.21% 🟡 | 91% 🟢 | 0.69s 🟢 | 6.0s 🟡 | 94% 🟢 | 99% 🟢 |
| OpenAI (GPT Realtime) | 🥈 2nd | 50.27% 🟡 | 94% 🟢 | 89% 🟢 | 18.3% 🟡 | 98% 🟢 | 0.82s 🟡 | 0.8s 🟢 | 78% 🟡 | 83% 🟢 |
| Google (Gemini 2.5) | 🥉 3rd | 37.06% 🟡 | 82% 🟡 | 51% 🟡 | 6.7% 🟡 | 91% 🟢 | 1.5s 🟡 | 3.0s 🟡 | 44% 🟡 | 27% 🟡 |
| Sesame (CSM 1) | 4th | 19.25% 🔴 | 88% 🟢 | 19% 🔴 | 0.6% 🔴 | 91% 🟢 | 0.76s 🟡 | 3.0s 🟡 | 7% 🔴 | 0% 🔴 |
Detailed Benchmark Results
Boot Time Performance
Measures: Cold start time for initial response Methodology: Average of 100 sequential calls on same network/device| Model | Boot Time | Performance |
|---|---|---|
| OpenAI GPT Realtime | 0.8 seconds | 🟢 Fastest cold start |
| Google Gemini 2.5 | 3.0 seconds | 🟡 Good performance |
| Sesame CSM 1 | 3.0 seconds | 🟡 Good performance |
| Rose Aivoco v1 | 6.0 seconds | 🟡 Acceptable for production |
Latency Performance
Measures: Response time during sustained 30-minute conversations Methodology: Average of 100 sessions with realistic interactions| Model | Latency | Performance |
|---|---|---|
| Rose Aivoco v1 | 0.69 seconds | 🟢 Lowest latency |
| Sesame CSM 1 | 0.76 seconds | 🟡 Very good |
| OpenAI GPT Realtime | 0.82 seconds | 🟡 Good performance |
| Google Gemini 2.5 | 1.5 seconds | 🟡 Acceptable |
Code Generation (HumanEval)
Measures: Functional correctness of code generation via speech Methodology: HumanEval dataset processed through S2S pipeline| Model | Accuracy | Performance |
|---|---|---|
| Rose Aivoco v1 | 94% | 🟢 Excellent code generation |
| OpenAI GPT Realtime | 78% | 🟡 Good performance |
| Google Gemini 2.5 | 44% | 🟡 Moderate performance |
| Sesame CSM 1 | 7% | 🔴 Needs significant improvement |
Function Calling (BFCL)
Measures: Accuracy of function call generation from voice commands Methodology: BFCL dataset with real function calling scenarios| Model | Accuracy | Performance |
|---|---|---|
| Rose Aivoco v1 | 70.29% | 🟢 Strong function calling |
| OpenAI GPT Realtime | 50.27% | 🟡 Good performance |
| Google Gemini 2.5 | 37.06% | 🟡 Moderate performance |
| Sesame CSM 1 | 19.25% | 🔴 Limited function calling |
Emotional Intelligence
Measures: Ability to convey and recognize emotions authentically Methodology: 8 core emotions over 100 interactions with AI judging| Model | Score | Performance |
|---|---|---|
| OpenAI GPT Realtime | 94% | 🟢 Best emotional expression |
| Rose Aivoco v1 | 89% | 🟢 Excellent performance |
| Sesame CSM 1 | 88% | 🟢 Very good |
| Google Gemini 2.5 | 82% | 🟡 Good performance |
Multilingual Naturalness
Measures: Speech naturalness across 15 languages (5 European + 10 Indian) Methodology: Conversational scenarios in native scripts| Model | Score | Performance |
|---|---|---|
| Rose Aivoco v1 | 91% | 🟢 Best multilingual support |
| OpenAI GPT Realtime | 89% | 🟢 Excellent performance |
| Google Gemini 2.5 | 51% | 🟡 Moderate performance |
| Sesame CSM 1 | 19% | 🔴 Limited language support |
Expressiveness & Prosody
Measures: Vocal variety, intonation, and emotional expressiveness Methodology: 8 expressive scenarios with multimodal AI analysis| Model | Score | Performance |
|---|---|---|
| OpenAI GPT Realtime | 98% | 🟢 Most expressive |
| Rose Aivoco v1 | 91% | 🟢 Excellent performance |
| Google Gemini 2.5 | 91% | 🟢 Excellent performance |
| Sesame CSM 1 | 91% | 🟢 Excellent performance |
Ethical Reasoning (MM-NIAH)
Measures: Moral reasoning and ethical decision-making Methodology: MM-NIAH dataset with multimodal evaluation| Model | Score | Performance |
|---|---|---|
| Rose Aivoco v1 | 99% | 🟢 Outstanding ethics |
| OpenAI GPT Realtime | 83% | 🟢 Good ethical reasoning |
| Google Gemini 2.5 | 27% | 🟡 Moderate performance |
| Sesame CSM 1 | 0% | 🔴 Serious ethical concerns |
Helpful/Honest/Harmless (HLE)
Measures: Alignment with positive AI behavior guidelines Methodology: HLE evaluation framework with voice interactions| Model | Score | Performance |
|---|---|---|
| Rose Aivoco v1 | 41.21% | 🟡 Best HLE performance |
| OpenAI GPT Realtime | 18.3% | 🟡 Moderate performance |
| Google Gemini 2.5 | 6.7% | 🟡 Limited HLE alignment |
| Sesame CSM 1 | 0.6% | 🔴 Poor HLE performance |
Individual Benchmark Details
1. Boot Benchmark (boot/)
Purpose: Measures cold start performance and initial response speed Flow:
- Initialize speech-to-speech connection
- Send 100 sequential audio requests
- Measure time to first response for each
- Calculate average boot time
2. Latency Benchmark (latency/)
Purpose: Evaluates sustained performance during extended conversations Flow:
- Start 30-minute conversation sessions
- Use Vispark Vision to generate realistic scenarios
- Send contextual audio messages throughout
- Measure response latency for each interaction
3. HumanEval Benchmark (humaneval/)
Purpose: Tests functional code generation through voice commands Flow:
- Load HumanEval programming problems
- Convert problems to speech using TTS
- Send through speech-to-speech model
- Convert responses back to text
- Evaluate code correctness with official HumanEval framework
4. BFCL Benchmark (bfcl/)
Purpose: Measures function calling accuracy from voice commands Flow:
- Load BFCL function calling test cases
- Convert function requests to speech
- Process through speech-to-speech model
- Parse function calls from voice responses
- Evaluate accuracy against expected functions/parameters
5. Emotions Benchmark (emotions/)
Purpose: Evaluates emotional intelligence and expression authenticity Flow:
- Test 8 core emotions (joy, sadness, anger, fear, surprise, disgust, trust, anticipation)
- Generate emotional scenarios using Vispark Vision
- Send emotional prompts through speech-to-speech
- Use multimodal AI to judge emotional authenticity
6. Naturalness Benchmark (naturalness/)
Purpose: Tests speech naturalness across multiple languages Flow:
- Test 15 languages (5 European + 10 Indian regional)
- Generate conversations in native scripts
- Process through speech-to-speech pipeline
- Use multimodal AI to judge naturalness, fluency, and pronunciation
7. Expressiveness Benchmark (expressiveness/)
Purpose: Measures vocal variety and prosodic expressiveness Flow:
- Test 8 expressive scenarios requiring different vocal styles
- Send prompts requiring emotional expression
- Use multimodal AI to analyze prosody, intonation, and pace
- Score vocal expressiveness and variety
8. Ethic Benchmark (ethic/)
Purpose: Evaluates ethical reasoning using MM-NIAH framework Flow:
- Load MM-NIAH ethical scenarios
- Convert ethical dilemmas to speech
- Process through speech-to-speech model
- Evaluate responses using official MM-NIAH evaluation
- Measure ethical reasoning quality
9. HLE Benchmark (hle/)
Purpose: Tests alignment with Helpful, Honest, and Harmless guidelines Flow:
- Load HLE test scenarios
- Convert HLE prompts to speech
- Process through speech-to-speech model
- Evaluate responses using HLE framework
- Score helpfulness, honesty, and harmlessness
Quick Start Guide
Prerequisites
API Key Setup
Vispark API Keys
- Visit https://lab.vispark.in
- Sign up for an account
- Navigate to API Keys section
- Generate your API key for:
- Text-to-Speech (TTS) - Required for all benchmarks
- Speech-to-Text (STT) - Required for most benchmarks
- Vision (Multimodal) - Required for emotions/expressiveness/naturalness
Aivoco API Keys
- Visit the Aivoco platform
- Complete registration process
- Generate API key for Speech-to-Speech model access
Environment Setup
Running Individual Benchmarks
Running All Benchmarks
Performance Insights
Rose (Aivoco v1)
- Strengths: Best latency, excellent code generation, superior ethics
- Weaknesses: Higher boot time (acceptable for production)
- Best For: Production applications requiring reliability and ethics
OpenAI GPT Realtime
- Strengths: Fastest boot time, excellent expressiveness, good emotions
- Weaknesses: Moderate performance in some technical tasks
- Best For: Interactive applications needing quick responses
Google Gemini 2.5
- Strengths: Good expressiveness, solid emotional intelligence
- Weaknesses: Higher latency, moderate multilingual support
- Best For: General-purpose applications with balanced requirements
Sesame CSM 1
- Strengths: Good expressiveness and emotional intelligence
- Weaknesses: Poor multilingual support, limited technical capabilities
- Best For: Focused use cases with strong emotional requirements
Technical Specifications
Models Tested
- Rose: Aivoco v1 (Speech-to-Speech optimized)
- Google: Gemini 2.5 Native Live (latest)
- OpenAI: GPT Realtime (latest)
- Sesame: CSM 1 (latest)
Testing Environment
- Network: Same network for all tests
- Device: Consistent hardware across all benchmarks
- Date: August 30, 2025
- Sessions: 100 per benchmark (except where noted)
Evaluation Frameworks
- HumanEval: Official OpenAI code generation benchmark
- BFCL: Berkeley Function Calling Leaderboard
- MM-NIAH: Multimodal Non-Intrusive AI Helpfulness
- HLE: Helpful, Honest, and Harmless evaluation
Vispark Models Used
- TTS Model:
/model/audio/text_to_speech- Neural TTS with emotion control - STT Model:
/model/audio/speech_to_text- Advanced ASR with 95%+ accuracy - Vision Model:
/model/text/vision- Multimodal AI with 1M token context - Base URL:
https://api.lab.vispark.in - Authentication: X-API-Key header required
- Documentation: Available at https://lab.vispark.in
Vispark API Endpoints Used in Benchmarks
Support & Contributing
For questions about the benchmarking framework or to contribute improvements:- Documentation: Each benchmark folder contains detailed README files
- Issues: Report bugs or request features
- Contributing: Pull requests welcome for new benchmarks or improvements
License & Attribution
This benchmarking framework is designed for evaluating speech-to-speech AI models using industry-standard methodologies and datasets. Last Updated: August 30, 2025 Framework Version: v1.0 Tested Models: 4 major S2S modelsThis comprehensive benchmarking framework provides the most thorough evaluation of speech-to-speech AI models available, covering all critical aspects from technical performance to ethical alignment._