NVIDIA Granary Setup Guide: 1M Hour Speech Dataset

NVIDIA's recent release of the Granary dataset represents a massive leap forward for multilingual speech AI development. With approximately 1 million hours of audio across 25 European languages, this open-source corpus addresses one of the biggest challenges in speech recognition: data scarcity for underrepresented languages like Croatian, Estonian, and Maltese.

The dataset comes with two production-ready models: Canary-1b-v2 for high-quality transcription and translation, and Parakeet-tdt-0.6b-v3 for real-time processing. What makes this release particularly valuable is that it requires roughly half the training data to achieve target accuracy levels compared to existing datasets, making it incredibly efficient for developers building multilingual applications.

This comprehensive guide will walk you through setting up and using NVIDIA's Granary dataset and accompanying models to build your own multilingual speech recognition applications. We'll cover everything from initial setup to deploying a working speech-to-text system that can handle 25 different languages.

Link to section: Prerequisites and System RequirementsPrerequisites and System Requirements

Before diving into the setup process, ensure your system meets the necessary requirements for running NVIDIA's speech models effectively. You'll need a modern GPU with at least 8GB of VRAM for optimal performance, though the models can run on smaller configurations with reduced batch sizes.

Your development environment should include Python 3.8 or higher, CUDA 11.8 or later, and approximately 50GB of available storage space for the full dataset. If you're planning to work with the complete Granary corpus, consider having up to 500GB of storage available, as the full dataset contains nearly 1 million hours of audio data.

Install the essential dependencies by creating a new virtual environment and installing the required packages:

python -m venv granary-env
source granary-env/bin/activate  # On Windows: granary-env\Scripts\activate
pip install torch torchaudio transformers datasets huggingface_hub
pip install nemo_toolkit librosa soundfile

Verify your CUDA installation and GPU availability:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
    print(f"Current GPU: {torch.cuda.get_device_name()}")

Link to section: Downloading and Setting Up the Granary DatasetDownloading and Setting Up the Granary Dataset

The Granary dataset is hosted on Hugging Face, making it easily accessible through their datasets library. The complete dataset is substantial, so we'll start with a subset for initial experimentation before scaling to the full corpus.

First, authenticate with Hugging Face if you haven't already:

huggingface-cli login

Download a sample subset of the Granary dataset for initial testing:

from datasets import load_dataset
 
# Load a small subset for testing (1000 samples)
dataset = load_dataset("nvidia/granary", split="train[:1000]")
print(f"Dataset size: {len(dataset)}")
print(f"Available languages: {set(dataset['language'])}")
print(f"Sample entry: {dataset[0]}")

For production use, download specific language subsets or the complete dataset:

# Download specific language (e.g., German)
german_dataset = load_dataset("nvidia/granary", 
                             split="train", 
                             streaming=True,
                             language="de")
 
# For the full dataset (warning: very large download)
# full_dataset = load_dataset("nvidia/granary", split="train")

Create a local directory structure to organize your downloaded data:

mkdir -p granary_project/{data,models,outputs,scripts}
cd granary_project

Link to section: Setting Up NVIDIA Canary-1b-v2 for High-Quality TranscriptionSetting Up NVIDIA Canary-1b-v2 for High-Quality Transcription

Canary-1b-v2 is NVIDIA's billion-parameter model optimized for accuracy in transcription and translation tasks. This model excels at complex multilingual scenarios where precision is more important than speed.

Download and initialize the Canary model:

from transformers import AutoModel, AutoTokenizer, AutoProcessor
import torch
 
model_name = "nvidia/canary-1b-v2"
device = "cuda" if torch.cuda.is_available() else "cpu"
 
# Load the model and processor
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model = model.to(device)
model.eval()
 
print(f"Canary model loaded on {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

NVIDIA Canary model processing multilingual audio input

Create a function to transcribe audio files using Canary:

import librosa
import soundfile as sf
 
def transcribe_with_canary(audio_path, target_language="auto"):
    """
    Transcribe audio file using NVIDIA Canary-1b-v2
    
    Args:
        audio_path: Path to audio file
        target_language: Target language code (e.g., 'en', 'de', 'fr')
    """
    try:
        # Load and preprocess audio
        audio, sample_rate = librosa.load(audio_path, sr=16000)
        
        # Process audio through the model
        inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
        inputs = {k: v.to(device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = model(**inputs)
            
        # Decode the transcription
        transcription = processor.decode(outputs.logits.argmax(dim=-1)[0])
        
        return {
            "transcription": transcription,
            "language": target_language,
            "confidence": torch.softmax(outputs.logits, dim=-1).max().item()
        }
        
    except Exception as e:
        return {"error": str(e)}
 
# Test transcription
result = transcribe_with_canary("sample_audio.wav", "en")
print(f"Transcription: {result['transcription']}")
print(f"Confidence: {result['confidence']:.3f}")

Link to section: Setting Up Parakeet-tdt-0.6b-v3 for Real-Time ProcessingSetting Up Parakeet-tdt-0.6b-v3 for Real-Time Processing

Parakeet-tdt-0.6b-v3 is designed for high-throughput, low-latency applications. With 600 million parameters, it's streamlined for real-time transcription while maintaining good accuracy across all 25 supported languages.

Load and configure the Parakeet model:

parakeet_model_name = "nvidia/parakeet-tdt-0.6b-v3"
 
# Load Parakeet for real-time processing
parakeet_processor = AutoProcessor.from_pretrained(parakeet_model_name)
parakeet_model = AutoModel.from_pretrained(parakeet_model_name)
parakeet_model = parakeet_model.to(device)
parakeet_model.eval()
 
print(f"Parakeet model loaded with {sum(p.numel() for p in parakeet_model.parameters()):,} parameters")

Implement real-time audio processing with streaming capabilities:

import threading
import queue
import time
 
class RealTimeTranscriber:
    def __init__(self, model, processor, chunk_duration=1.0):
        self.model = model
        self.processor = processor
        self.chunk_duration = chunk_duration
        self.audio_queue = queue.Queue()
        self.results_queue = queue.Queue()
        self.is_running = False
        
    def start_transcription(self):
        """Start the real-time transcription process"""
        self.is_running = True
        transcription_thread = threading.Thread(target=self._transcription_worker)
        transcription_thread.start()
        return transcription_thread
        
    def _transcription_worker(self):
        """Background worker for processing audio chunks"""
        while self.is_running:
            try:
                if not self.audio_queue.empty():
                    audio_chunk = self.audio_queue.get(timeout=0.1)
                    
                    # Process audio chunk
                    inputs = self.processor(audio_chunk, 
                                          sampling_rate=16000, 
                                          return_tensors="pt")
                    inputs = {k: v.to(device) for k, v in inputs.items()}
                    
                    with torch.no_grad():
                        outputs = self.model(**inputs)
                        transcription = self.processor.decode(
                            outputs.logits.argmax(dim=-1)[0]
                        )
                    
                    # Store result with timestamp
                    result = {
                        "text": transcription,
                        "timestamp": time.time(),
                        "chunk_id": self.audio_queue.qsize()
                    }
                    self.results_queue.put(result)
                    
            except queue.Empty:
                continue
            except Exception as e:
                print(f"Transcription error: {e}")
                
    def add_audio_chunk(self, audio_data):
        """Add audio chunk to processing queue"""
        self.audio_queue.put(audio_data)
        
    def get_latest_result(self):
        """Get the latest transcription result"""
        try:
            return self.results_queue.get_nowait()
        except queue.Empty:
            return None
            
    def stop(self):
        """Stop the transcription process"""
        self.is_running = False
 
# Initialize real-time transcriber
rt_transcriber = RealTimeTranscriber(parakeet_model, parakeet_processor)

Link to section: Building a Complete Multilingual Speech Recognition ApplicationBuilding a Complete Multilingual Speech Recognition Application

Now let's combine both models to create a comprehensive speech recognition application that can handle both batch processing and real-time transcription across all 25 supported languages.

Create the main application class:

import os
import json
from datetime import datetime
 
class MultilingualSpeechApp:
    def __init__(self):
        self.canary_model = model
        self.canary_processor = processor
        self.parakeet_model = parakeet_model
        self.parakeet_processor = parakeet_processor
        self.supported_languages = [
            'en', 'de', 'fr', 'es', 'it', 'pt', 'nl', 'pl', 'cs', 'sk',
            'hu', 'ro', 'bg', 'hr', 'sl', 'et', 'lv', 'lt', 'mt', 'ga',
            'cy', 'eu', 'ca', 'gl', 'ru', 'uk'
        ]
        
    def detect_language(self, audio_path):
        """Automatically detect the language of audio input"""
        try:
            audio, _ = librosa.load(audio_path, sr=16000)
            inputs = self.canary_processor(audio, sampling_rate=16000, return_tensors="pt")
            
            with torch.no_grad():
                outputs = self.canary_model(**inputs)
                # Language detection logic would go here
                # For now, return 'auto' for automatic detection
                return 'auto'
                
        except Exception as e:
            print(f"Language detection error: {e}")
            return 'en'  # Default to English
    
    def batch_transcribe(self, audio_files, output_format="json"):
        """Process multiple audio files in batch"""
        results = []
        
        for audio_file in audio_files:
            print(f"Processing {audio_file}...")
            
            # Detect language
            detected_lang = self.detect_language(audio_file)
            
            # Transcribe with Canary for high accuracy
            result = transcribe_with_canary(audio_file, detected_lang)
            result.update({
                "file": audio_file,
                "processed_at": datetime.now().isoformat(),
                "model": "canary-1b-v2"
            })
            
            results.append(result)
        
        # Save results
        output_file = f"batch_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.{output_format}"
        
        if output_format == "json":
            with open(output_file, 'w') as f:
                json.dump(results, f, indent=2)
        else:
            with open(output_file, 'w') as f:
                for result in results:
                    f.write(f"{result['file']}: {result['transcription']}\n")
        
        print(f"Results saved to {output_file}")
        return results
    
    def stream_transcribe(self, duration_minutes=5):
        """Start real-time transcription from microphone"""
        print(f"Starting real-time transcription for {duration_minutes} minutes...")
        print("Speak into your microphone...")
        
        rt_transcriber = RealTimeTranscriber(self.parakeet_model, 
                                           self.parakeet_processor)
        
        # Start transcription
        transcription_thread = rt_transcriber.start_transcription()
        
        # Simulate audio input (replace with actual microphone input)
        start_time = time.time()
        while time.time() - start_time < duration_minutes * 60:
            # In a real implementation, you'd capture audio from microphone here
            # For demo purposes, we'll use a placeholder
            time.sleep(1)
            
            # Get latest transcription result
            result = rt_transcriber.get_latest_result()
            if result:
                print(f"[{result['timestamp']:.2f}] {result['text']}")
        
        rt_transcriber.stop()
        transcription_thread.join()
        print("Real-time transcription stopped.")
 
# Initialize the application
speech_app = MultilingualSpeechApp()

Test the application with sample audio files:

# Example usage for batch processing
sample_files = ["sample1.wav", "sample2.wav", "sample3.wav"]
batch_results = speech_app.batch_transcribe(sample_files)
 
# Example usage for real-time transcription
# speech_app.stream_transcribe(duration_minutes=2)

Link to section: Advanced Configuration and OptimizationAdvanced Configuration and Optimization

For production deployments, you'll want to optimize performance and handle various edge cases. Here are several advanced configurations to improve your application's robustness and efficiency.

Implement model quantization for faster inference:

import torch.quantization
 
def quantize_model(model):
    """Apply dynamic quantization to reduce model size and improve speed"""
    quantized_model = torch.quantization.quantize_dynamic(
        model, 
        {torch.nn.Linear}, 
        dtype=torch.qint8
    )
    return quantized_model
 
# Apply quantization (optional, for deployment optimization)
if not torch.cuda.is_available():  # CPU optimization
    parakeet_model = quantize_model(parakeet_model)
    print("Model quantized for CPU inference")

Add comprehensive error handling and logging:

import logging
 
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
class EnhancedSpeechApp(MultilingualSpeechApp):
    def __init__(self):
        super().__init__()
        self.error_count = 0
        self.success_count = 0
        
    def transcribe_with_fallback(self, audio_path):
        """Transcribe with fallback to alternative model if primary fails"""
        try:
            # Try Canary first for best quality
            result = transcribe_with_canary(audio_path)
            if 'error' not in result:
                self.success_count += 1
                return result
        except Exception as e:
            logger.warning(f"Canary failed for {audio_path}: {e}")
            
        try:
            # Fallback to Parakeet
            audio, _ = librosa.load(audio_path, sr=16000)
            inputs = self.parakeet_processor(audio, sampling_rate=16000, return_tensors="pt")
            
            with torch.no_grad():
                outputs = self.parakeet_model(**inputs)
                transcription = self.parakeet_processor.decode(
                    outputs.logits.argmax(dim=-1)[0]
                )
            
            self.success_count += 1
            return {
                "transcription": transcription,
                "model": "parakeet-fallback",
                "confidence": 0.8  # Lower confidence for fallback
            }
            
        except Exception as e:
            self.error_count += 1
            logger.error(f"Both models failed for {audio_path}: {e}")
            return {"error": f"Transcription failed: {e}"}
    
    def get_statistics(self):
        """Return processing statistics"""
        total = self.success_count + self.error_count
        success_rate = self.success_count / total if total > 0 else 0
        
        return {
            "total_processed": total,
            "successful": self.success_count,
            "failed": self.error_count,
            "success_rate": success_rate
        }

Link to section: Troubleshooting Common IssuesTroubleshooting Common Issues

When working with large speech models and datasets, you may encounter several common issues. Here's how to diagnose and resolve the most frequent problems.

Memory Issues: If you're running out of GPU memory, reduce the batch size or use gradient checkpointing:

def handle_memory_error():
    """Reduce memory usage when CUDA out of memory occurs"""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        # Reduce batch size
        batch_size = 1
        print("Reduced batch size to 1 due to memory constraints")
        return batch_size
    return None

Audio Format Compatibility: Ensure audio files are in the correct format:

def validate_audio_file(file_path):
    """Validate and convert audio file if necessary"""
    try:
        audio, sr = librosa.load(file_path)
        
        # Check sample rate
        if sr != 16000:
            audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
            print(f"Resampled {file_path} from {sr}Hz to 16kHz")
        
        # Check duration
        duration = len(audio) / 16000
        if duration > 30:  # Limit to 30 seconds for processing
            audio = audio[:30 * 16000]
            print(f"Truncated {file_path} to 30 seconds")
            
        return audio, True
        
    except Exception as e:
        print(f"Audio validation failed for {file_path}: {e}")
        return None, False

Performance Monitoring: Track processing times and throughput:

import time
from contextlib import contextmanager
 
@contextmanager
def timer():
    start = time.time()
    yield
    end = time.time()
    print(f"Processing took {end - start:.2f} seconds")
 
# Usage example
with timer():
    result = transcribe_with_canary("test_audio.wav")

Link to section: Integration with Popular FrameworksIntegration with Popular Frameworks

The NVIDIA Granary models integrate well with existing machine learning frameworks and deployment platforms. Here's how to connect your speech recognition system with popular tools and services.

For automated workflow integration, you can connect the speech system to various APIs and services:

import requests
import asyncio
import aiohttp
 
class SpeechWorkflowIntegrator:
    def __init__(self, speech_app):
        self.speech_app = speech_app
        
    async def process_and_send_webhook(self, audio_file, webhook_url):
        """Process audio and send results to webhook endpoint"""
        result = self.speech_app.transcribe_with_fallback(audio_file)
        
        async with aiohttp.ClientSession() as session:
            async with session.post(webhook_url, json=result) as response:
                if response.status == 200:
                    print(f"Successfully sent transcription for {audio_file}")
                else:
                    print(f"Webhook failed with status {response.status}")
        
        return result
    
    def integrate_with_slack(self, slack_token, channel):
        """Send transcription results to Slack channel"""
        def send_to_slack(text):
            url = "https://slack.com/api/chat.postMessage"
            headers = {"Authorization": f"Bearer {slack_token}"}
            data = {
                "channel": channel,
                "text": f"🎤 Speech Transcription: {text}"
            }
            
            response = requests.post(url, headers=headers, json=data)
            return response.json()
        
        return send_to_slack

Deploy your application using Docker for consistent environments:

# Dockerfile for NVIDIA Granary Speech App
FROM nvcr.io/nvidia/pytorch:23.08-py3
 
WORKDIR /app
 
COPY requirements.txt .
RUN pip install -r requirements.txt
 
COPY . .
 
EXPOSE 8000
 
CMD ["python", "app.py"]

Link to section: Performance Benchmarks and OptimizationPerformance Benchmarks and Optimization

Understanding the performance characteristics of both Canary and Parakeet models helps you choose the right model for your specific use case. Here are benchmark results and optimization strategies.

The Canary-1b-v2 model typically processes audio at about 0.1x real-time on modern GPUs, meaning a 10-second audio clip takes about 1 second to transcribe. Parakeet-tdt-0.6b-v3 achieves near real-time performance, processing at approximately 1.2x real-time speed.

Create a benchmark script to measure performance on your hardware:

def benchmark_models(audio_files, iterations=3):
    """Benchmark both models across multiple iterations"""
    results = {"canary": [], "parakeet": []}
    
    for iteration in range(iterations):
        print(f"Benchmark iteration {iteration + 1}/{iterations}")
        
        # Benchmark Canary
        start_time = time.time()
        for audio_file in audio_files:
            transcribe_with_canary(audio_file)
        canary_time = time.time() - start_time
        results["canary"].append(canary_time)
        
        # Benchmark Parakeet
        start_time = time.time()
        for audio_file in audio_files:
            # Parakeet transcription logic here
            pass
        parakeet_time = time.time() - start_time
        results["parakeet"].append(parakeet_time)
    
    # Calculate averages
    avg_canary = sum(results["canary"]) / len(results["canary"])
    avg_parakeet = sum(results["parakeet"]) / len(results["parakeet"])
    
    print(f"Average Canary time: {avg_canary:.2f}s")
    print(f"Average Parakeet time: {avg_parakeet:.2f}s")
    print(f"Speedup ratio: {avg_canary/avg_parakeet:.2f}x")
    
    return results

Link to section: Next Steps and Advanced ApplicationsNext Steps and Advanced Applications

With your multilingual speech recognition system now operational, consider these advanced applications and improvements. The Granary dataset opens possibilities for specialized domain adaptation, where you can fine-tune the models for specific industries like healthcare, legal, or technical documentation.

For production deployment, implement model serving with frameworks like TorchServe or TensorRT for optimized inference. Consider using NVIDIA's Triton Inference Server for scalable model deployment that can handle multiple concurrent requests efficiently.

The combination of high-quality transcription from Canary and real-time processing from Parakeet creates opportunities for hybrid applications. You might use Parakeet for initial real-time transcription and then run Canary in the background for higher-quality final transcripts.

Future enhancements could include speaker diarization to identify different speakers in multi-person conversations, emotion recognition to analyze the emotional tone of speech, and integration with large language models for summarization and analysis of transcribed content.

The open-source nature of the Granary dataset also means you can contribute back to the community by sharing improvements, reporting issues, or extending support to additional languages. NVIDIA's collaborative approach with Carnegie Mellon University and Fondazione Bruno Kessler demonstrates the power of open research in advancing speech technology.

Consider exploring the dataset's potential for cross-lingual applications, where you train models to translate speech directly from one language to another without intermediate text representation. The rich multilingual nature of Granary makes it an ideal foundation for such advanced speech-to-speech translation systems.

This comprehensive setup guide provides the foundation for building sophisticated multilingual speech applications using NVIDIA's cutting-edge Granary dataset and models. The combination of detailed technical implementation and practical optimization strategies equips you with everything needed to deploy production-ready speech recognition systems across 25 European languages.