Speech-to-Speech AI Model Benchmarking Framework

Overview

This comprehensive benchmarking framework evaluates Speech-to-Speech (S2S) AI models across multiple critical dimensions. The framework tests models end-to-end through voice interactions, measuring performance on real-world conversational scenarios using industry-standard evaluation datasets and methodologies.

Open-Source Benchmarks

Our model Rose (Aivoco v1) was evaluated on open-source benchmarks — available on GitHub.
You can evaluate our benchmarks and fact-check our model’s performance using the framework.

Framework Architecture

The framework consists of 9 specialized benchmarks that cover all major aspects of speech-to-speech AI evaluation:

├── 🔧 Boot Performance (Cold Start)
├── ⏱️ Latency (Sustained Performance)
├── 💻 Code Generation (HumanEval)
├── 🛠️ Function Calling (BFCL)
├── 😊 Emotional Intelligence
├── 🌍 Multilingual Naturalness
├── 🎭 Expressiveness & Prosody
├── 🤝 Ethic Reasoning (MM-NIAH)
└── 🎯 Helpful/Honest/Harmless (HLE)

Evaluation Pipeline

Each benchmark follows a consistent Speech-to-Speech pipeline:

User Query → Text-to-Speech → Speech-to-Speech Model → Speech-to-Text → Evaluation Framework
     ↓              ↓                  ↓                      ↓              ↓
   Prompt      Vispark TTS       Aivoco Model            Vispark STT    Real Datasets

Access the opensource benchmarking framework at https://github.com/aivocofounders/sts-bench

Vispark Models Integration

This framework leverages Vispark’s cutting-edge AI models available at https://lab.vispark.in:

Text-to-Speech (TTS)

Model: Advanced neural TTS with emotional expression
Languages: 250+ languages including all major Indian languages
Voices: Multiple voice options (male/female) with natural intonation
Quality: 24kHz high-fidelity audio output
Features: Emotion control, speed adjustment, pronunciation correction

Speech-to-Text (STT)

Model: State-of-the-art automatic speech recognition
Accuracy: 99%+ accuracy across supported languages
Languages: 250+ languages with regional accent support
Features: Real-time processing, noise cancellation, speaker diarization
Output: Precise text transcription with punctuation

Vision (Multimodal AI)

Model: Advanced multimodal understanding
Capabilities: Text, image, audio, and video analysis
Context Window: Up to 1 million tokens
Features: Emotion analysis, content understanding, real-time streaming
Use Cases: Judge for emotions, naturalness, and expressiveness benchmarks

Key Advantages

SOTA Performance: Industry-leading accuracy and naturalness
Real-time Processing: Low latency for interactive applications
Multilingual Support: Native support for Indian regional languages
API Integration: Simple REST API with comprehensive documentation
Enterprise Ready: Production-grade reliability and security

Benchmark Results Summary

🏆 Model Performance Rankings (As of August 30, 2025)

Model	Overall Rank	Function Calling	Emotions	Naturalness	HLE	Expressiveness	Latency	Boot Time	Code Gen	Ethics
Rose (Aivoco v1)	🥇 1st	70.29% 🟢	89% 🟢	91% 🟢	41.21% 🟡	91% 🟢	0.69s 🟢	6.0s 🟡	94% 🟢	99% 🟢
OpenAI (GPT Realtime)	🥈 2nd	50.27% 🟡	94% 🟢	89% 🟢	18.3% 🟡	98% 🟢	0.82s 🟡	0.8s 🟢	78% 🟡	83% 🟢
Google (Gemini 2.5)	🥉 3rd	37.06% 🟡	82% 🟡	51% 🟡	6.7% 🟡	91% 🟢	1.5s 🟡	3.0s 🟡	44% 🟡	27% 🟡
Sesame (CSM 1)	4th	19.25% 🔴	88% 🟢	19% 🔴	0.6% 🔴	91% 🟢	0.76s 🟡	3.0s 🟡	7% 🔴	0% 🔴

Legend: 🟢 Excellent (Top 25%) 🟡 Good (25-75%) 🔴 Needs Improvement (Bottom 25%)

Detailed Benchmark Results

Boot Time Performance

Measures: Cold start time for initial response Methodology: Average of 100 sequential calls on same network/device

Model	Boot Time	Performance
OpenAI GPT Realtime	0.8 seconds	🟢 Fastest cold start
Google Gemini 2.5	3.0 seconds	🟡 Good performance
Sesame CSM 1	3.0 seconds	🟡 Good performance
Rose Aivoco v1	6.0 seconds	🟡 Acceptable for production

Latency Performance

Measures: Response time during sustained 30-minute conversations Methodology: Average of 100 sessions with realistic interactions

Model	Latency	Performance
Rose Aivoco v1	0.69 seconds	🟢 Lowest latency
Sesame CSM 1	0.76 seconds	🟡 Very good
OpenAI GPT Realtime	0.82 seconds	🟡 Good performance
Google Gemini 2.5	1.5 seconds	🟡 Acceptable

Code Generation (HumanEval)

Measures: Functional correctness of code generation via speech Methodology: HumanEval dataset processed through S2S pipeline

Model	Accuracy	Performance
Rose Aivoco v1	94%	🟢 Excellent code generation
OpenAI GPT Realtime	78%	🟡 Good performance
Google Gemini 2.5	44%	🟡 Moderate performance
Sesame CSM 1	7%	🔴 Needs significant improvement

Function Calling (BFCL)

Measures: Accuracy of function call generation from voice commands Methodology: BFCL dataset with real function calling scenarios

Model	Accuracy	Performance
Rose Aivoco v1	70.29%	🟢 Strong function calling
OpenAI GPT Realtime	50.27%	🟡 Good performance
Google Gemini 2.5	37.06%	🟡 Moderate performance
Sesame CSM 1	19.25%	🔴 Limited function calling

Emotional Intelligence

Measures: Ability to convey and recognize emotions authentically Methodology: 8 core emotions over 100 interactions with AI judging

Model	Score	Performance
OpenAI GPT Realtime	94%	🟢 Best emotional expression
Rose Aivoco v1	89%	🟢 Excellent performance
Sesame CSM 1	88%	🟢 Very good
Google Gemini 2.5	82%	🟡 Good performance

Multilingual Naturalness

Measures: Speech naturalness across 15 languages (5 European + 10 Indian) Methodology: Conversational scenarios in native scripts

Model	Score	Performance
Rose Aivoco v1	91%	🟢 Best multilingual support
OpenAI GPT Realtime	89%	🟢 Excellent performance
Google Gemini 2.5	51%	🟡 Moderate performance
Sesame CSM 1	19%	🔴 Limited language support

Expressiveness & Prosody

Measures: Vocal variety, intonation, and emotional expressiveness Methodology: 8 expressive scenarios with multimodal AI analysis

Model	Score	Performance
OpenAI GPT Realtime	98%	🟢 Most expressive
Rose Aivoco v1	91%	🟢 Excellent performance
Google Gemini 2.5	91%	🟢 Excellent performance
Sesame CSM 1	91%	🟢 Excellent performance

Ethical Reasoning (MM-NIAH)

Measures: Moral reasoning and ethical decision-making Methodology: MM-NIAH dataset with multimodal evaluation

Model	Score	Performance
Rose Aivoco v1	99%	🟢 Outstanding ethics
OpenAI GPT Realtime	83%	🟢 Good ethical reasoning
Google Gemini 2.5	27%	🟡 Moderate performance
Sesame CSM 1	0%	🔴 Serious ethical concerns

Helpful/Honest/Harmless (HLE)

Measures: Alignment with positive AI behavior guidelines Methodology: HLE evaluation framework with voice interactions

Model	Score	Performance
Rose Aivoco v1	41.21%	🟡 Best HLE performance
OpenAI GPT Realtime	18.3%	🟡 Moderate performance
Google Gemini 2.5	6.7%	🟡 Limited HLE alignment
Sesame CSM 1	0.6%	🔴 Poor HLE performance

Individual Benchmark Details

1. Boot Benchmark (`boot/`)

Purpose: Measures cold start performance and initial response speed Flow:

Initialize speech-to-speech connection
Send 100 sequential audio requests
Measure time to first response for each
Calculate average boot time

Key Metrics: Average first response time Best For: Production deployment evaluation

2. Latency Benchmark (`latency/`)

Purpose: Evaluates sustained performance during extended conversations Flow:

Start 30-minute conversation sessions
Use Vispark Vision to generate realistic scenarios
Send contextual audio messages throughout
Measure response latency for each interaction

Key Metrics: Average response time over 100 sessions Best For: Long conversation reliability testing

3. HumanEval Benchmark (`humaneval/`)

Purpose: Tests functional code generation through voice commands Flow:

Load HumanEval programming problems
Convert problems to speech using TTS
Send through speech-to-speech model
Convert responses back to text
Evaluate code correctness with official HumanEval framework

Key Metrics: Pass@k accuracy rates Best For: Programming assistant evaluation

4. BFCL Benchmark (`bfcl/`)

Purpose: Measures function calling accuracy from voice commands Flow:

Load BFCL function calling test cases
Convert function requests to speech
Process through speech-to-speech model
Parse function calls from voice responses
Evaluate accuracy against expected functions/parameters

Key Metrics: Function call accuracy, parameter extraction Best For: Voice-controlled application testing

5. Emotions Benchmark (`emotions/`)

Purpose: Evaluates emotional intelligence and expression authenticity Flow:

Test 8 core emotions (joy, sadness, anger, fear, surprise, disgust, trust, anticipation)
Generate emotional scenarios using Vispark Vision
Send emotional prompts through speech-to-speech
Use multimodal AI to judge emotional authenticity

Key Metrics: Emotional accuracy scores per emotion type Best For: Empathetic AI and emotional support applications

6. Naturalness Benchmark (`naturalness/`)

Purpose: Tests speech naturalness across multiple languages Flow:

Test 15 languages (5 European + 10 Indian regional)
Generate conversations in native scripts
Process through speech-to-speech pipeline
Use multimodal AI to judge naturalness, fluency, and pronunciation

Key Metrics: Naturalness scores per language Best For: Multilingual application deployment

7. Expressiveness Benchmark (`expressiveness/`)

Purpose: Measures vocal variety and prosodic expressiveness Flow:

Test 8 expressive scenarios requiring different vocal styles
Send prompts requiring emotional expression
Use multimodal AI to analyze prosody, intonation, and pace
Score vocal expressiveness and variety

Key Metrics: Expressiveness scores, prosodic quality Best For: Entertainment and presentation applications

8. Ethic Benchmark (`ethic/`)

Purpose: Evaluates ethical reasoning using MM-NIAH framework Flow:

Load MM-NIAH ethical scenarios
Convert ethical dilemmas to speech
Process through speech-to-speech model
Evaluate responses using official MM-NIAH evaluation
Measure ethical reasoning quality

Key Metrics: Ethical accuracy, moral reasoning quality Best For: Safety-critical application evaluation

9. HLE Benchmark (`hle/`)

Purpose: Tests alignment with Helpful, Honest, and Harmless guidelines Flow:

Load HLE test scenarios
Convert HLE prompts to speech
Process through speech-to-speech model
Evaluate responses using HLE framework
Score helpfulness, honesty, and harmlessness

Key Metrics: HLE compliance scores Best For: Responsible AI deployment evaluation

Quick Start Guide

Prerequisites

pip install -r requirements.txt

API Key Setup

Vispark API Keys

Visit https://lab.vispark.in
Sign up for an account
Navigate to API Keys section
Generate your API key for:
- Text-to-Speech (TTS) - Required for all benchmarks
- Speech-to-Text (STT) - Required for most benchmarks
- Vision (Multimodal) - Required for emotions/expressiveness/naturalness

Aivoco API Keys

Visit the Aivoco platform
Complete registration process
Generate API key for Speech-to-Speech model access

Environment Setup

# Create .env file with your API keys
cp .env.example .env

# Edit .env with your actual API keys:
# AIVOCO_API_KEY=your_aivoco_key
# VISPARK_API_KEY=your_vispark_key

Running Individual Benchmarks

# Run any benchmark
cd [benchmark_name]
python main.py

# Examples:
cd boot && python main.py
cd humaneval && python main.py
cd emotions && python main.py

Running All Benchmarks

# Run benchmarks sequentially
for dir in */; do
    if [ -f "$dir/main.py" ]; then
        echo "Running $dir"
        cd "$dir"
        python main.py
        cd ..
    fi
done

Performance Insights

Rose (Aivoco v1)

Strengths: Best latency, excellent code generation, superior ethics
Weaknesses: Higher boot time (acceptable for production)
Best For: Production applications requiring reliability and ethics

OpenAI GPT Realtime

Strengths: Fastest boot time, excellent expressiveness, good emotions
Weaknesses: Moderate performance in some technical tasks
Best For: Interactive applications needing quick responses

Google Gemini 2.5

Strengths: Good expressiveness, solid emotional intelligence
Weaknesses: Higher latency, moderate multilingual support
Best For: General-purpose applications with balanced requirements

Sesame CSM 1

Strengths: Good expressiveness and emotional intelligence
Weaknesses: Poor multilingual support, limited technical capabilities
Best For: Focused use cases with strong emotional requirements

Technical Specifications

Models Tested

Rose: Aivoco v1 (Speech-to-Speech optimized)
Google: Gemini 2.5 Native Live (latest)
OpenAI: GPT Realtime (latest)
Sesame: CSM 1 (latest)

Testing Environment

Network: Same network for all tests
Device: Consistent hardware across all benchmarks
Date: August 30, 2025
Sessions: 100 per benchmark (except where noted)

Evaluation Frameworks

HumanEval: Official OpenAI code generation benchmark
BFCL: Berkeley Function Calling Leaderboard
MM-NIAH: Multimodal Non-Intrusive AI Helpfulness
HLE: Helpful, Honest, and Harmless evaluation

Vispark Models Used

TTS Model: /model/audio/text_to_speech - Neural TTS with emotion control
STT Model: /model/audio/speech_to_text - Advanced ASR with 95%+ accuracy
Vision Model: /model/text/vision - Multimodal AI with 1M token context
Base URL: https://api.lab.vispark.in
Authentication: X-API-Key header required
Documentation: Available at https://lab.vispark.in

Vispark API Endpoints Used in Benchmarks

# Text-to-Speech
POST https://api.lab.vispark.in/model/audio/text_to_speech
Headers: X-API-Key: your_key
Body: {"size": "small/large", "text": "your_text", "voice": "boy/girl"}

# Speech-to-Text
POST https://api.lab.vispark.in/model/audio/speech_to_text
Headers: X-API-Key: your_key
Body: {"audio": "base64_encoded_audio"}

# Vision (Multimodal AI)
POST https://api.lab.vispark.in/model/text/vision
Headers: X-API-Key: your_key
Body: {"size": "small/medium/large", "content": [...], "system_message": "..."}

Support & Contributing

For questions about the benchmarking framework or to contribute improvements:

Documentation: Each benchmark folder contains detailed README files
Issues: Report bugs or request features
Contributing: Pull requests welcome for new benchmarks or improvements

License & Attribution

This benchmarking framework is designed for evaluating speech-to-speech AI models using industry-standard methodologies and datasets. Last Updated: August 30, 2025 Framework Version: v1.0 Tested Models: 4 major S2S models

This comprehensive benchmarking framework provides the most thorough evaluation of speech-to-speech AI models available, covering all critical aspects from technical performance to ethical alignment._

Getting started

Models

ROSE

Telephony

Model Benchmarks

​Speech-to-Speech AI Model Benchmarking Framework

​Overview

​Open-Source Benchmarks

​Framework Architecture

​Evaluation Pipeline

​Vispark Models Integration

​Text-to-Speech (TTS)

​Speech-to-Text (STT)

​Vision (Multimodal AI)

​Key Advantages

​Benchmark Results Summary

​🏆 Model Performance Rankings (As of August 30, 2025)

​Detailed Benchmark Results

​Boot Time Performance

​Latency Performance

​Code Generation (HumanEval)

​Function Calling (BFCL)

​Emotional Intelligence

​Multilingual Naturalness

​Expressiveness & Prosody

​Ethical Reasoning (MM-NIAH)

​Helpful/Honest/Harmless (HLE)

​Individual Benchmark Details

​1. Boot Benchmark (boot/)

​2. Latency Benchmark (latency/)

​3. HumanEval Benchmark (humaneval/)

​4. BFCL Benchmark (bfcl/)

​5. Emotions Benchmark (emotions/)

​6. Naturalness Benchmark (naturalness/)

​7. Expressiveness Benchmark (expressiveness/)

​8. Ethic Benchmark (ethic/)

​9. HLE Benchmark (hle/)

​Quick Start Guide

​Prerequisites

​API Key Setup

​Vispark API Keys

​Aivoco API Keys

​Environment Setup

​Running Individual Benchmarks

​Running All Benchmarks

​Performance Insights

​Rose (Aivoco v1)

​OpenAI GPT Realtime

​Google Gemini 2.5

​Sesame CSM 1

​Technical Specifications

​Models Tested

​Testing Environment

​Evaluation Frameworks

​Vispark Models Used

​Vispark API Endpoints Used in Benchmarks

​Support & Contributing

​License & Attribution