faster-whisper
RepositoryFreeFaster Whisper transcription with CTranslate2
Capabilities13 decomposed
ctranslate2-accelerated speech-to-text transcription
Medium confidenceReimplements OpenAI's Whisper ASR model using CTranslate2, a specialized inference engine for Transformer models that applies operator-level optimizations (graph compilation, memory pooling, quantization-aware kernels) to achieve 4x faster transcription than the original implementation while maintaining identical accuracy. The WhisperModel class wraps CTranslate2's compiled model format, enabling CPU and GPU inference with automatic device selection and fallback mechanisms.
Uses CTranslate2's compiled model format with operator-level kernel optimizations and memory pooling rather than PyTorch's dynamic graph execution, enabling 4x speedup through reduced memory allocations and fused operations. Includes automatic model conversion pipeline from Hugging Face Hub with 13+ pre-optimized variants.
4x faster than openai/whisper on CPU, maintains identical accuracy, requires no FFmpeg installation, and provides pre-converted models eliminating conversion overhead for end users.
batched parallel transcription with dynamic scheduling
Medium confidenceBatchedInferencePipeline class implements a queue-based parallel processing architecture that groups multiple audio files into batches and processes them through the CTranslate2 inference engine simultaneously, achieving 3-5x additional speedup over sequential WhisperModel transcription. Uses dynamic batch sizing based on available GPU/CPU memory and implements work-stealing scheduling to balance load across processing threads.
Implements work-stealing queue scheduler with dynamic batch sizing that adapts to available GPU memory at runtime, rather than fixed batch sizes. Integrates directly with CTranslate2's batch inference API, avoiding Python-level serialization overhead.
3-5x faster than sequential WhisperModel for batch jobs, requires no external orchestration framework (vs Ray/Dask), and automatically manages GPU memory allocation without manual tuning.
pyav-based audio decoding without ffmpeg dependency
Medium confidenceImplements audio decoding using PyAV (Python bindings for FFmpeg libraries) bundled as a dependency, eliminating the need for separate FFmpeg installation. The decode_audio() utility supports 100+ audio formats (MP3, WAV, FLAC, M4A, OGG, OPUS, AIFF, etc.) and automatically resamples to 16kHz mono, handling format detection, channel mixing, and sample rate conversion in a single pass.
Bundles PyAV as a dependency, eliminating separate FFmpeg installation while supporting 100+ audio formats. Implements single-pass decoding with automatic resampling to 16kHz mono, avoiding multi-step preprocessing pipelines.
No FFmpeg installation required (vs. librosa/soundfile which require FFmpeg), supports 100+ formats natively, and single-pass preprocessing reduces I/O overhead vs. separate decode-then-resample steps.
model conversion pipeline from pytorch to ctranslate2 format
Medium confidenceProvides model conversion utilities that transform OpenAI's PyTorch Whisper checkpoints into optimized CTranslate2 format, applying graph compilation, operator fusion, and quantization during conversion. The conversion process is one-time offline operation that generates hardware-optimized model files, enabling fast inference without requiring PyTorch at runtime.
Implements offline conversion pipeline that applies graph compilation, operator fusion, and quantization at conversion time, generating hardware-optimized models. Pre-converted models available for download, eliminating conversion step for end users.
Offline conversion enables aggressive optimization (operator fusion, graph compilation) not possible at runtime, pre-converted models eliminate user-side conversion complexity, and quantization during conversion is irreversible (prevents accidental precision loss).
output format generation (json, srt, vtt) with configurable timestamps
Medium confidenceProvides format_timestamp() utility and output formatting options that convert transcription results into standard subtitle formats (SRT, VTT) and JSON, with configurable timestamp precision and segment boundaries. The formatter handles edge cases like overlapping segments, missing timestamps, and language-specific formatting rules.
Provides unified formatting interface supporting multiple output formats (SRT, VTT, JSON) with configurable timestamp precision and segment boundaries. Handles edge cases like overlapping segments and missing timestamps automatically.
Single utility handles multiple output formats (vs. separate tools for each format), configurable timestamp precision enables use cases from video editing to accessibility, and automatic edge case handling reduces post-processing.
silero vad-based voice activity detection and silence removal
Medium confidenceIntegrates Silero VAD v6 model to detect speech segments and remove silence from audio before transcription, reducing processing time by ~50% by skipping non-speech regions. The VAD pipeline operates as a preprocessing stage that segments audio into speech/non-speech chunks, filters out silence, and passes only active speech regions to the Whisper encoder, reducing token count and inference cost.
Uses Silero VAD v6 as a preprocessing stage integrated into the audio pipeline, not as post-processing filtering. Segments audio into speech chunks before encoding, reducing token count and Whisper encoder load proportionally to silence duration.
~50% faster transcription on audio with >30% silence, requires no external VAD library installation (Silero bundled), and operates at inference time rather than requiring separate preprocessing steps.
word-level timestamp alignment via cross-attention mechanism
Medium confidenceExtracts word-level timestamps by analyzing cross-attention weights between the Whisper decoder and encoder outputs, mapping each decoded token to its corresponding audio time region. The mechanism leverages the Transformer's attention patterns to align subword tokens to audio frames, then aggregates token-level alignments into word-level boundaries without requiring external alignment models or post-processing.
Extracts alignment directly from Whisper's cross-attention weights without external alignment models (vs. forced alignment tools like Montreal Forced Aligner). Operates during inference, not as post-processing, enabling real-time timestamp generation.
No external alignment model required, timestamps generated during transcription with zero additional latency, and accuracy matches Whisper's own token predictions.
multi-language auto-detection with 99-language support
Medium confidenceAutomatically detects the language of input audio by processing the first 30 seconds through Whisper's language identification head, which outputs probability scores across 99 supported languages. The detection runs as a lightweight preprocessing step before full transcription, enabling single-pass multilingual pipelines without requiring language hints or separate language detection models.
Leverages Whisper's built-in language identification head (trained on 99 languages) rather than external language detection models. Runs as lightweight preprocessing step using only the first 30 seconds of audio, enabling fast language routing.
Supports 99 languages natively (vs. 50-60 for most external language ID tools), requires no additional model downloads, and integrates seamlessly into transcription pipeline.
quantization-aware model compression with int8 and float16 precision
Medium confidenceSupports 8-bit integer quantization (int8) and float16 precision modes during model loading, reducing model size by 35-50% and memory footprint proportionally while maintaining >99% accuracy. Quantization is applied at the CTranslate2 model conversion stage, not runtime, enabling hardware-accelerated quantized inference on CPUs and GPUs that support int8 operations.
Quantization applied at CTranslate2 model conversion stage (offline), not runtime, enabling hardware-accelerated int8 inference without Python-level quantization overhead. Pre-converted quantized models available for download, eliminating conversion step for users.
35-50% memory reduction with <1% accuracy loss, hardware-accelerated int8 inference (vs. software quantization), and pre-converted models eliminate user-side conversion complexity.
hotword and prefix biasing for domain-specific transcription
Medium confidenceAccepts a list of hotwords and optional prefix text that biases the Whisper decoder toward recognizing specific terms or continuing with expected text patterns. The biasing mechanism modifies token logits during beam search decoding, increasing probability of hotword tokens and prefix-consistent sequences, enabling domain-specific transcription without fine-tuning.
Implements logit-level biasing during beam search decoding, modifying token probabilities in-flight rather than post-processing or fine-tuning. Hotwords and prefix are applied per-transcription without model reloading, enabling dynamic domain switching.
No fine-tuning required, dynamic hotword updates per session, and logit-level biasing integrates directly with Whisper's beam search (vs. post-processing filtering which may break coherence).
configurable beam search decoding with temperature fallback
Medium confidenceImplements beam search decoding with configurable beam width (default 5) and temperature-based fallback mechanism. If beam search fails to produce valid output (e.g., due to numerical instability), the decoder automatically falls back to temperature-sampled decoding with adjustable temperature parameter, ensuring robustness across diverse audio conditions without requiring user intervention.
Implements automatic fallback from beam search to temperature sampling without user intervention, ensuring transcription robustness across edge-case audio. Beam width and temperature are configurable per-transcription, enabling dynamic strategy adjustment.
Automatic fallback mechanism eliminates transcription failures on problematic audio (vs. fixed beam search which may fail), and per-transcription configuration enables adaptive strategies without model reloading.
stereo diarization with left/right channel separation
Medium confidenceProcesses stereo audio by separating left and right channels and transcribing each independently, then merging results with channel labels to enable speaker diarization without external speaker separation models. The mechanism treats each channel as a separate audio stream, assigns speaker labels based on channel identity, and reconstructs timeline with speaker boundaries.
Implements channel-based diarization by processing stereo channels independently and merging results with speaker labels, avoiding external speaker separation models. Operates at audio preprocessing stage, not post-processing.
No external speaker diarization model required, simple channel-based approach for pre-separated audio, and integrated into transcription pipeline without additional inference overhead.
automatic model downloading and caching from hugging face hub
Medium confidenceProvides download_model() utility that automatically fetches pre-converted CTranslate2 models from Hugging Face Hub, caches them locally with integrity verification, and manages model versioning. The caching mechanism uses content-addressable storage (hash-based paths) to prevent corruption and enable atomic updates, with configurable cache directory and automatic cleanup of unused models.
Uses content-addressable caching with hash-based paths and integrity verification, enabling atomic updates and corruption detection. Integrates directly with Hugging Face Hub API, eliminating manual model conversion for end users.
Automatic model download and caching with zero user setup, hash-based integrity verification prevents corruption, and pre-converted models eliminate conversion overhead vs. manual PyTorch-to-CTranslate2 conversion.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with faster-whisper, ranked by overlap. Discovered automatically through the match graph.
faster-whisper-tiny.en
automatic-speech-recognition model by undefined. 11,12,112 downloads.
Mistral: Voxtral Small 24B 2507
Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding. Input audio...
Qwen3-ASR-1.7B
automatic-speech-recognition model by undefined. 17,74,899 downloads.
whisperX
 |Free|
Whisper CLI
OpenAI speech recognition CLI.
Scribewave
AI-Powered Transcription and Language...
Best For
- ✓developers building production ASR pipelines with latency constraints
- ✓teams deploying speech recognition on edge devices or cost-sensitive infrastructure
- ✓researchers benchmarking Whisper performance across hardware configurations
- ✓batch processing pipelines (e.g., daily transcription jobs, content moderation workflows)
- ✓teams with large audio datasets requiring high-throughput processing
- ✓applications where latency per file is less critical than overall throughput
- ✓applications handling user-uploaded audio of unknown format
- ✓deployment environments where FFmpeg installation is restricted or unavailable
Known Limitations
- ⚠CTranslate2 compilation step required during model loading (~5-10s on first run), adds startup latency
- ⚠Model format is CTranslate2-specific; cannot directly use PyTorch checkpoints without conversion
- ⚠GPU acceleration requires CUDA 11.0+ or compatible hardware; CPU fallback is slower than GPU by 8-15x
- ⚠No dynamic model switching mid-transcription; must reload model class to change variants
- ⚠Batching introduces 100-500ms latency overhead per batch due to queue aggregation; unsuitable for real-time streaming
- ⚠Batch size must be tuned manually based on GPU memory; no automatic adaptive batching
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Package Details
About
Faster Whisper transcription with CTranslate2
Categories
Alternatives to faster-whisper
This repository contains a hand-curated resources for Prompt Engineering with a focus on Generative Pre-trained Transformer (GPT), ChatGPT, PaLM etc
Compare →World's first open-source, agentic video production system. 12 pipelines, 52 tools, 500+ agent skills. Turn your AI coding assistant into a full video production studio.
Compare →Are you the builder of faster-whisper?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →