Skip to main content

Speech-to-Speech AI Model Benchmarking Framework

Overview

This comprehensive benchmarking framework evaluates Speech-to-Speech (S2S) AI models across multiple critical dimensions. The framework tests models end-to-end through voice interactions, measuring performance on real-world conversational scenarios using industry-standard evaluation datasets and methodologies.

Open-Source Benchmarks

Our model Rose (Aivoco v1) was evaluated on open-source benchmarks — available on GitHub.
You can evaluate our benchmarks and fact-check our model’s performance using the framework.

Framework Architecture

The framework consists of 9 specialized benchmarks that cover all major aspects of speech-to-speech AI evaluation:
├── 🔧 Boot Performance (Cold Start)
├── ⏱️ Latency (Sustained Performance)
├── 💻 Code Generation (HumanEval)
├── 🛠️ Function Calling (BFCL)
├── 😊 Emotional Intelligence
├── 🌍 Multilingual Naturalness
├── 🎭 Expressiveness & Prosody
├── 🤝 Ethic Reasoning (MM-NIAH)
└── 🎯 Helpful/Honest/Harmless (HLE)

Evaluation Pipeline

Each benchmark follows a consistent Speech-to-Speech pipeline:
User Query → Text-to-Speech → Speech-to-Speech Model → Speech-to-Text → Evaluation Framework
     ↓              ↓                  ↓                      ↓              ↓
   Prompt      Vispark TTS       Aivoco Model            Vispark STT    Real Datasets
Access the opensource benchmarking framework at https://github.com/aivocofounders/sts-bench

Vispark Models Integration

This framework leverages Vispark’s cutting-edge AI models available at https://lab.vispark.in:

Text-to-Speech (TTS)

  • Model: Advanced neural TTS with emotional expression
  • Languages: 250+ languages including all major Indian languages
  • Voices: Multiple voice options (male/female) with natural intonation
  • Quality: 24kHz high-fidelity audio output
  • Features: Emotion control, speed adjustment, pronunciation correction

Speech-to-Text (STT)

  • Model: State-of-the-art automatic speech recognition
  • Accuracy: 99%+ accuracy across supported languages
  • Languages: 250+ languages with regional accent support
  • Features: Real-time processing, noise cancellation, speaker diarization
  • Output: Precise text transcription with punctuation

Vision (Multimodal AI)

  • Model: Advanced multimodal understanding
  • Capabilities: Text, image, audio, and video analysis
  • Context Window: Up to 1 million tokens
  • Features: Emotion analysis, content understanding, real-time streaming
  • Use Cases: Judge for emotions, naturalness, and expressiveness benchmarks

Key Advantages

  • SOTA Performance: Industry-leading accuracy and naturalness
  • Real-time Processing: Low latency for interactive applications
  • Multilingual Support: Native support for Indian regional languages
  • API Integration: Simple REST API with comprehensive documentation
  • Enterprise Ready: Production-grade reliability and security

Benchmark Results Summary

🏆 Model Performance Rankings (As of August 30, 2025)

ModelOverall RankFunction CallingEmotionsNaturalnessHLEExpressivenessLatencyBoot TimeCode GenEthics
Rose (Aivoco v1)🥇 1st70.29% 🟢89% 🟢91% 🟢41.21% 🟡91% 🟢0.69s 🟢6.0s 🟡94% 🟢99% 🟢
OpenAI (GPT Realtime)🥈 2nd50.27% 🟡94% 🟢89% 🟢18.3% 🟡98% 🟢0.82s 🟡0.8s 🟢78% 🟡83% 🟢
Google (Gemini 2.5)🥉 3rd37.06% 🟡82% 🟡51% 🟡6.7% 🟡91% 🟢1.5s 🟡3.0s 🟡44% 🟡27% 🟡
Sesame (CSM 1)4th19.25% 🔴88% 🟢19% 🔴0.6% 🔴91% 🟢0.76s 🟡3.0s 🟡7% 🔴0% 🔴
Legend: 🟢 Excellent (Top 25%) 🟡 Good (25-75%) 🔴 Needs Improvement (Bottom 25%)

Detailed Benchmark Results

Boot Time Performance

Measures: Cold start time for initial response Methodology: Average of 100 sequential calls on same network/device
ModelBoot TimePerformance
OpenAI GPT Realtime0.8 seconds🟢 Fastest cold start
Google Gemini 2.53.0 seconds🟡 Good performance
Sesame CSM 13.0 seconds🟡 Good performance
Rose Aivoco v16.0 seconds🟡 Acceptable for production

Latency Performance

Measures: Response time during sustained 30-minute conversations Methodology: Average of 100 sessions with realistic interactions
ModelLatencyPerformance
Rose Aivoco v10.69 seconds🟢 Lowest latency
Sesame CSM 10.76 seconds🟡 Very good
OpenAI GPT Realtime0.82 seconds🟡 Good performance
Google Gemini 2.51.5 seconds🟡 Acceptable

Code Generation (HumanEval)

Measures: Functional correctness of code generation via speech Methodology: HumanEval dataset processed through S2S pipeline
ModelAccuracyPerformance
Rose Aivoco v194%🟢 Excellent code generation
OpenAI GPT Realtime78%🟡 Good performance
Google Gemini 2.544%🟡 Moderate performance
Sesame CSM 17%🔴 Needs significant improvement

Function Calling (BFCL)

Measures: Accuracy of function call generation from voice commands Methodology: BFCL dataset with real function calling scenarios
ModelAccuracyPerformance
Rose Aivoco v170.29%🟢 Strong function calling
OpenAI GPT Realtime50.27%🟡 Good performance
Google Gemini 2.537.06%🟡 Moderate performance
Sesame CSM 119.25%🔴 Limited function calling

Emotional Intelligence

Measures: Ability to convey and recognize emotions authentically Methodology: 8 core emotions over 100 interactions with AI judging
ModelScorePerformance
OpenAI GPT Realtime94%🟢 Best emotional expression
Rose Aivoco v189%🟢 Excellent performance
Sesame CSM 188%🟢 Very good
Google Gemini 2.582%🟡 Good performance

Multilingual Naturalness

Measures: Speech naturalness across 15 languages (5 European + 10 Indian) Methodology: Conversational scenarios in native scripts
ModelScorePerformance
Rose Aivoco v191%🟢 Best multilingual support
OpenAI GPT Realtime89%🟢 Excellent performance
Google Gemini 2.551%🟡 Moderate performance
Sesame CSM 119%🔴 Limited language support

Expressiveness & Prosody

Measures: Vocal variety, intonation, and emotional expressiveness Methodology: 8 expressive scenarios with multimodal AI analysis
ModelScorePerformance
OpenAI GPT Realtime98%🟢 Most expressive
Rose Aivoco v191%🟢 Excellent performance
Google Gemini 2.591%🟢 Excellent performance
Sesame CSM 191%🟢 Excellent performance

Ethical Reasoning (MM-NIAH)

Measures: Moral reasoning and ethical decision-making Methodology: MM-NIAH dataset with multimodal evaluation
ModelScorePerformance
Rose Aivoco v199%🟢 Outstanding ethics
OpenAI GPT Realtime83%🟢 Good ethical reasoning
Google Gemini 2.527%🟡 Moderate performance
Sesame CSM 10%🔴 Serious ethical concerns

Helpful/Honest/Harmless (HLE)

Measures: Alignment with positive AI behavior guidelines Methodology: HLE evaluation framework with voice interactions
ModelScorePerformance
Rose Aivoco v141.21%🟡 Best HLE performance
OpenAI GPT Realtime18.3%🟡 Moderate performance
Google Gemini 2.56.7%🟡 Limited HLE alignment
Sesame CSM 10.6%🔴 Poor HLE performance

Individual Benchmark Details

1. Boot Benchmark (boot/)

Purpose: Measures cold start performance and initial response speed Flow:
  1. Initialize speech-to-speech connection
  2. Send 100 sequential audio requests
  3. Measure time to first response for each
  4. Calculate average boot time
Key Metrics: Average first response time Best For: Production deployment evaluation

2. Latency Benchmark (latency/)

Purpose: Evaluates sustained performance during extended conversations Flow:
  1. Start 30-minute conversation sessions
  2. Use Vispark Vision to generate realistic scenarios
  3. Send contextual audio messages throughout
  4. Measure response latency for each interaction
Key Metrics: Average response time over 100 sessions Best For: Long conversation reliability testing

3. HumanEval Benchmark (humaneval/)

Purpose: Tests functional code generation through voice commands Flow:
  1. Load HumanEval programming problems
  2. Convert problems to speech using TTS
  3. Send through speech-to-speech model
  4. Convert responses back to text
  5. Evaluate code correctness with official HumanEval framework
Key Metrics: Pass@k accuracy rates Best For: Programming assistant evaluation

4. BFCL Benchmark (bfcl/)

Purpose: Measures function calling accuracy from voice commands Flow:
  1. Load BFCL function calling test cases
  2. Convert function requests to speech
  3. Process through speech-to-speech model
  4. Parse function calls from voice responses
  5. Evaluate accuracy against expected functions/parameters
Key Metrics: Function call accuracy, parameter extraction Best For: Voice-controlled application testing

5. Emotions Benchmark (emotions/)

Purpose: Evaluates emotional intelligence and expression authenticity Flow:
  1. Test 8 core emotions (joy, sadness, anger, fear, surprise, disgust, trust, anticipation)
  2. Generate emotional scenarios using Vispark Vision
  3. Send emotional prompts through speech-to-speech
  4. Use multimodal AI to judge emotional authenticity
Key Metrics: Emotional accuracy scores per emotion type Best For: Empathetic AI and emotional support applications

6. Naturalness Benchmark (naturalness/)

Purpose: Tests speech naturalness across multiple languages Flow:
  1. Test 15 languages (5 European + 10 Indian regional)
  2. Generate conversations in native scripts
  3. Process through speech-to-speech pipeline
  4. Use multimodal AI to judge naturalness, fluency, and pronunciation
Key Metrics: Naturalness scores per language Best For: Multilingual application deployment

7. Expressiveness Benchmark (expressiveness/)

Purpose: Measures vocal variety and prosodic expressiveness Flow:
  1. Test 8 expressive scenarios requiring different vocal styles
  2. Send prompts requiring emotional expression
  3. Use multimodal AI to analyze prosody, intonation, and pace
  4. Score vocal expressiveness and variety
Key Metrics: Expressiveness scores, prosodic quality Best For: Entertainment and presentation applications

8. Ethic Benchmark (ethic/)

Purpose: Evaluates ethical reasoning using MM-NIAH framework Flow:
  1. Load MM-NIAH ethical scenarios
  2. Convert ethical dilemmas to speech
  3. Process through speech-to-speech model
  4. Evaluate responses using official MM-NIAH evaluation
  5. Measure ethical reasoning quality
Key Metrics: Ethical accuracy, moral reasoning quality Best For: Safety-critical application evaluation

9. HLE Benchmark (hle/)

Purpose: Tests alignment with Helpful, Honest, and Harmless guidelines Flow:
  1. Load HLE test scenarios
  2. Convert HLE prompts to speech
  3. Process through speech-to-speech model
  4. Evaluate responses using HLE framework
  5. Score helpfulness, honesty, and harmlessness
Key Metrics: HLE compliance scores Best For: Responsible AI deployment evaluation

Quick Start Guide

Prerequisites

pip install -r requirements.txt

API Key Setup

Vispark API Keys

  1. Visit https://lab.vispark.in
  2. Sign up for an account
  3. Navigate to API Keys section
  4. Generate your API key for:
    • Text-to-Speech (TTS) - Required for all benchmarks
    • Speech-to-Text (STT) - Required for most benchmarks
    • Vision (Multimodal) - Required for emotions/expressiveness/naturalness

Aivoco API Keys

  1. Visit the Aivoco platform
  2. Complete registration process
  3. Generate API key for Speech-to-Speech model access

Environment Setup

# Create .env file with your API keys
cp .env.example .env

# Edit .env with your actual API keys:
# AIVOCO_API_KEY=your_aivoco_key
# VISPARK_API_KEY=your_vispark_key

Running Individual Benchmarks

# Run any benchmark
cd [benchmark_name]
python main.py

# Examples:
cd boot && python main.py
cd humaneval && python main.py
cd emotions && python main.py

Running All Benchmarks

# Run benchmarks sequentially
for dir in */; do
    if [ -f "$dir/main.py" ]; then
        echo "Running $dir"
        cd "$dir"
        python main.py
        cd ..
    fi
done

Performance Insights

Rose (Aivoco v1)

  • Strengths: Best latency, excellent code generation, superior ethics
  • Weaknesses: Higher boot time (acceptable for production)
  • Best For: Production applications requiring reliability and ethics

OpenAI GPT Realtime

  • Strengths: Fastest boot time, excellent expressiveness, good emotions
  • Weaknesses: Moderate performance in some technical tasks
  • Best For: Interactive applications needing quick responses

Google Gemini 2.5

  • Strengths: Good expressiveness, solid emotional intelligence
  • Weaknesses: Higher latency, moderate multilingual support
  • Best For: General-purpose applications with balanced requirements

Sesame CSM 1

  • Strengths: Good expressiveness and emotional intelligence
  • Weaknesses: Poor multilingual support, limited technical capabilities
  • Best For: Focused use cases with strong emotional requirements

Technical Specifications

Models Tested

  • Rose: Aivoco v1 (Speech-to-Speech optimized)
  • Google: Gemini 2.5 Native Live (latest)
  • OpenAI: GPT Realtime (latest)
  • Sesame: CSM 1 (latest)

Testing Environment

  • Network: Same network for all tests
  • Device: Consistent hardware across all benchmarks
  • Date: August 30, 2025
  • Sessions: 100 per benchmark (except where noted)

Evaluation Frameworks

  • HumanEval: Official OpenAI code generation benchmark
  • BFCL: Berkeley Function Calling Leaderboard
  • MM-NIAH: Multimodal Non-Intrusive AI Helpfulness
  • HLE: Helpful, Honest, and Harmless evaluation

Vispark Models Used

  • TTS Model: /model/audio/text_to_speech - Neural TTS with emotion control
  • STT Model: /model/audio/speech_to_text - Advanced ASR with 95%+ accuracy
  • Vision Model: /model/text/vision - Multimodal AI with 1M token context
  • Base URL: https://api.lab.vispark.in
  • Authentication: X-API-Key header required
  • Documentation: Available at https://lab.vispark.in

Vispark API Endpoints Used in Benchmarks

# Text-to-Speech
POST https://api.lab.vispark.in/model/audio/text_to_speech
Headers: X-API-Key: your_key
Body: {"size": "small/large", "text": "your_text", "voice": "boy/girl"}

# Speech-to-Text
POST https://api.lab.vispark.in/model/audio/speech_to_text
Headers: X-API-Key: your_key
Body: {"audio": "base64_encoded_audio"}

# Vision (Multimodal AI)
POST https://api.lab.vispark.in/model/text/vision
Headers: X-API-Key: your_key
Body: {"size": "small/medium/large", "content": [...], "system_message": "..."}

Support & Contributing

For questions about the benchmarking framework or to contribute improvements:
  • Documentation: Each benchmark folder contains detailed README files
  • Issues: Report bugs or request features
  • Contributing: Pull requests welcome for new benchmarks or improvements

License & Attribution

This benchmarking framework is designed for evaluating speech-to-speech AI models using industry-standard methodologies and datasets. Last Updated: August 30, 2025 Framework Version: v1.0 Tested Models: 4 major S2S models
This comprehensive benchmarking framework provides the most thorough evaluation of speech-to-speech AI models available, covering all critical aspects from technical performance to ethical alignment._