Deepgram API
APIFreeSpeech-to-text API — Nova-2, real-time streaming, diarization, sentiment, 36+ languages.
Capabilities18 decomposed
streaming-speech-to-text-transcription-with-real-time-processing
Medium confidenceConverts live audio streams to text via WebSocket (WSS) protocol with ultra-low latency processing. Deepgram's Flux models process audio chunks incrementally, detecting natural speech boundaries and returning partial transcripts in real-time without waiting for audio completion. Supports 150-225 concurrent WebSocket connections depending on tier, enabling high-throughput voice applications.
Flux models are purpose-built for conversational speech with turn-taking detection and interruption handling, processing audio incrementally via WebSocket to return partial results before audio ends — unlike batch-only APIs. Supports 10-language multilingual conversations within a single stream without language switching overhead.
Faster real-time response than Google Cloud Speech-to-Text or AWS Transcribe because Flux models emit partial transcripts mid-speech rather than waiting for audio completion, enabling immediate downstream processing.
batch-audio-transcription-with-speaker-diarization
Medium confidenceProcesses pre-recorded audio files via REST API with automatic speaker identification and segmentation. Nova-3 models analyze complete audio files to detect multiple speakers, assign speaker labels, and return structured transcripts with speaker turns and timing information. Handles background noise, crosstalk, and far-field audio through deep learning-based noise robustness.
Nova-3 Multilingual model automatically detects language across 45+ languages without pre-configuration, and speaker diarization works across all supported languages — enabling single API call for multilingual multi-speaker content. Handles far-field and noisy audio through specialized training.
More cost-effective than Whisper Cloud for batch processing (Nova-3 pricing undercuts Whisper), and includes speaker diarization natively without separate API calls or post-processing.
custom-model-training-for-proprietary-speech-patterns
Medium confidenceDeepgram offers custom model training for organizations with proprietary speech patterns, accents, or domain-specific audio characteristics. Custom models are trained on customer-provided datasets and deployed as dedicated endpoints. Enables organizations to achieve higher accuracy on edge-case audio (heavy accents, background noise, specialized vocabulary) that generic models struggle with.
Custom models are trained on customer data and deployed as isolated endpoints, ensuring proprietary speech patterns remain private and not mixed into public models. Deepgram handles full training pipeline including data validation, model optimization, and endpoint provisioning.
More private than using public models (no data leakage to competitors); more cost-effective than building in-house speech recognition infrastructure; faster than training custom models from scratch because Deepgram provides pre-trained foundation.
smart-formatting-for-readable-transcripts
Medium confidenceAutomatically applies formatting rules to transcripts to improve readability without manual post-processing. Converts numbers to digits, adds punctuation, capitalizes proper nouns, and formats currency/dates according to locale. Smart formatting operates on raw transcription output, transforming 'one thousand two hundred thirty four dollars' to '$1,234' and 'the meeting is on january fifteenth' to 'The meeting is on January 15th'.
Smart formatting is applied during transcription post-processing, not as separate API call — integrated into response pipeline to avoid latency. Handles multiple formatting types (numbers, dates, currency, punctuation) in single pass.
More efficient than calling separate text formatting API because formatting is built into Deepgram's response; more accurate than regex-based post-processing because formatting rules understand speech context.
multi-language-support-within-single-conversation-stream
Medium confidenceFlux Multilingual model supports 10 languages (English, Spanish, German, French, Hindi, Russian, Portuguese, Japanese, Italian, Dutch) within a single WebSocket stream, automatically detecting language switches mid-conversation. Enables applications to handle multilingual users without requiring separate connections or language pre-specification. Language detection happens continuously throughout the stream.
Flux Multilingual detects language switches continuously within a single stream without reconnection or model switching — language detection is per-segment, not per-stream. Enables seamless multilingual conversations without user intervention.
More seamless than competitors requiring separate API calls per language or manual language selection; lower latency than sequential language detection because detection is integrated into transcription model.
concurrent-connection-management-with-tiered-rate-limits
Medium confidenceDeepgram enforces concurrent connection limits that vary by API type and subscription tier. WebSocket STT supports 150 (free/pay-as-you-go) or 225 (Growth tier) concurrent connections; REST STT/TTS limited to 50 concurrent; Voice Agent API limited to 45 (free) or 60 (Growth) concurrent; Audio Intelligence limited to 10 concurrent regardless of tier. Developers must manage connection pooling and queuing to respect these limits.
Concurrency limits are enforced per API type and tier, with WebSocket getting higher limits than REST — reflects Deepgram's architecture where WebSocket is more efficient for streaming. Audio Intelligence has universal 10-concurrent cap, creating asymmetric bottleneck.
More transparent than some competitors about concurrency limits; Growth tier upgrade provides meaningful concurrency increase for WebSocket (150→225) but not for REST or Audio Intelligence.
freemium-tier-with-200-dollar-credit-and-no-expiration
Medium confidenceDeepgram offers free tier with $200 credit that never expires, no credit card required to sign up. Free tier includes access to all public models (Flux, Nova-3) and all endpoints (STT, TTS, Voice Agent, Audio Intelligence) at full concurrency limits (150 WebSocket STT, 50 REST, etc.). Developers can build and test production applications without payment until credit is exhausted.
Non-expiring $200 credit is unusual in the industry — most competitors offer monthly free tier or time-limited trial. No credit card requirement lowers barrier to entry for developers.
More generous than Google Cloud Speech-to-Text free tier (60 minutes/month) or AWS Transcribe free tier (250 minutes/month); non-expiring credit is better than time-limited trials because developers can work at their own pace.
pay-as-you-go-pricing-with-growth-tier-discounts
Medium confidenceDeepgram offers two pricing models: pay-as-you-go (per-minute consumption) and Growth tier (pre-paid annual credits with 10-20% discount). Pay-as-you-go pricing ranges from $0.0048/min (Nova-3 Monolingual) to $0.0078/min (Flux Multilingual) for STT. Growth tier offers same models at discounted rates ($0.0042-$0.0068/min) with pre-paid annual commitment. Pricing is per-minute of audio processed, not per request.
Pricing is per-minute of audio processed, not per API call — transparent and predictable for high-volume applications. Growth tier discount (10-20%) is modest compared to some competitors but no minimum commitment required.
More transparent than competitors with opaque enterprise pricing; per-minute pricing is fairer than per-request for long-form audio; Growth tier discount is smaller than some competitors (AWS, Google) but no long-term contract lock-in.
deepgram-cli-with-28-api-commands-and-mcp-server
Medium confidenceDeepgram CLI is a command-line tool with 28 built-in commands for transcription, synthesis, and management tasks. Includes integrated MCP (Model Context Protocol) server, enabling AI agents to call Deepgram APIs directly without custom integration code. CLI supports both interactive and scripted usage, with output formatting options (JSON, text, etc.).
Built-in MCP server enables AI agents to call Deepgram without custom integration — agents can use Deepgram as a native tool via MCP protocol. CLI includes 28 commands covering common operations, reducing need for custom scripts.
More convenient than calling REST API directly from shell scripts; MCP integration is more modern than webhook-based integrations, enabling AI agents to use Deepgram as a native capability.
sdk-support-across-five-languages-with-feature-parity
Medium confidenceDeepgram provides official SDKs for Python, JavaScript, Go, .NET, and Java. SDKs abstract HTTP/WebSocket complexity, handle authentication, manage connection pooling, and provide language-idiomatic APIs. Feature parity across SDKs is claimed but not verified — specific version numbers and supported features per SDK not documented.
SDKs are available for five major languages, providing language-idiomatic APIs rather than forcing developers to use raw HTTP. WebSocket connection management is abstracted, reducing complexity for streaming use cases.
More convenient than raw HTTP clients because SDKs handle authentication, connection pooling, and error handling; available across more languages than some competitors (e.g., ElevenLabs).
automatic-language-detection-and-multilingual-transcription
Medium confidenceAutomatically identifies spoken language from audio without pre-configuration, supporting 45+ languages in Nova-3 Multilingual model or 10 languages in Flux Multilingual for real-time. Detection happens during initial audio processing; language is returned in response metadata and used to optimize transcription accuracy for that language's phonetics and vocabulary.
Nova-3 Multilingual detects from 45+ languages automatically, while Flux Multilingual handles 10 languages in real-time streaming — Deepgram's approach embeds language detection into the transcription model rather than as a separate preprocessing step, reducing latency.
Faster than Google Cloud Speech-to-Text's language detection because detection and transcription happen in a single model pass rather than sequential API calls; supports more languages than most competitors' auto-detection (45+ vs. typical 20-30).
conversational-turn-detection-and-interruption-handling
Medium confidenceFlux models detect natural speech boundaries and turn-taking in conversations, automatically identifying when a speaker has finished talking and when another speaker begins. Built-in interruption handling allows overlapping speech to be processed without requiring explicit silence detection thresholds. Enables voice agents to know when to stop listening and trigger response generation without timeout-based heuristics.
Flux models are trained specifically on conversational speech patterns to detect natural turn boundaries without explicit silence thresholds — unlike generic STT models that require fixed timeout windows. Handles overlapping speech (interruptions) as a first-class feature rather than edge case.
More natural than Whisper or Google Cloud Speech-to-Text because turn detection is built into the model rather than requiring post-processing heuristics; eliminates latency from silence timeout windows.
unified-voice-agent-orchestration-with-stт-llm-tts-integration
Medium confidenceVoice Agent API combines speech-to-text, LLM integration, and text-to-speech in a single WebSocket connection, orchestrating the full conversational loop. Audio input flows to Flux STT model, transcript is sent to configured LLM (provider UNKNOWN), LLM response is streamed to TTS model, and synthesized audio is returned to client — all within one persistent connection without intermediate API calls.
Single WebSocket connection handles STT→LLM→TTS pipeline without intermediate REST calls, reducing latency and connection overhead. Flux models' turn detection integrates with LLM triggering — agent knows when to stop listening and start generating response.
Simpler than building voice agents with separate Deepgram STT + OpenAI LLM + ElevenLabs TTS APIs because orchestration is built-in; lower latency than sequential API calls because all components share one connection.
text-to-speech-synthesis-with-streaming-input
Medium confidenceConverts text to natural-sounding audio via REST or WebSocket API. Supports streaming text input (partial text can be sent before full response is available), enabling real-time audio generation as LLM generates response tokens. Multiple voices and languages available (specific count and list not documented). Synthesized audio is returned as audio stream (format UNKNOWN).
Supports streaming text input via WebSocket, enabling audio generation to begin before full text is available — useful for real-time LLM response streaming. Integration with Voice Agent API allows TTS to receive LLM output directly without intermediate buffering.
Streaming text input is less common than competitors (ElevenLabs, Google Cloud TTS) — enables lower latency for LLM-to-speech pipelines by starting audio generation before LLM completes.
sentiment-analysis-on-transcribed-speech
Medium confidenceAudio Intelligence endpoint analyzes transcribed speech to detect emotional tone and sentiment (positive, negative, neutral). Processes audio or transcript to extract sentiment signals, returning sentiment labels and confidence scores. Operates as post-processing on transcription output or as standalone analysis on pre-transcribed text.
Sentiment analysis operates on speech audio directly (not just text), capturing vocal tone and prosody cues that text-only sentiment misses. Integrates with speaker diarization to attribute sentiment to specific speakers.
More accurate than text-only sentiment because it captures vocal tone, emphasis, and prosody; integrated with Deepgram's transcription pipeline so no separate audio upload needed.
topic-detection-and-content-categorization
Medium confidenceAudio Intelligence endpoint automatically identifies topics and themes discussed in audio conversations. Analyzes transcribed speech to extract key topics, categorize conversation content, and return topic labels with relevance scores. Enables automatic routing, content classification, and conversation summarization without manual tagging.
Topic detection integrates with speaker diarization and sentiment analysis to provide multi-dimensional conversation analysis in single API call. Operates on speech audio directly, capturing context from tone and pacing that text-only approaches miss.
More efficient than separate text classification APIs because topics are extracted during transcription processing rather than requiring separate text analysis pass.
automatic-summarization-of-audio-conversations
Medium confidenceAudio Intelligence endpoint generates abstractive summaries of audio conversations, condensing key points and action items from transcribed speech. Summarization operates on full transcript or speaker segments, extracting essential information and generating concise natural language summaries without manual review.
Summarization operates on speech audio with speaker context (from diarization) and sentiment (from sentiment analysis), enabling summaries that attribute statements to speakers and highlight emotional context. Single API call generates summary without separate LLM call.
More integrated than calling separate LLM for summarization because summary generation is optimized for speech patterns and includes speaker attribution natively.
keyterm-prompting-for-domain-specific-accuracy
Medium confidenceAllows developers to provide domain-specific keywords or phrases that the STT model should prioritize during transcription. Keyterm prompting biases the model's decoding toward specified terms, improving accuracy for technical jargon, product names, or domain-specific vocabulary that might otherwise be misrecognized. Implemented as optional parameter in transcription requests.
Keyterm prompting is built into Deepgram's STT models as a native feature, not post-processing — the model's decoding process directly incorporates keyterm bias during transcription rather than correcting afterward. Works across all languages and models.
More effective than post-processing keyword replacement because bias is applied during model inference; more flexible than fine-tuned custom models because keyterms can be updated per-request without retraining.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with Deepgram API, ranked by overlap. Discovered automatically through the match graph.
ElevenLabs
Ultra-realistic AI voice synthesis with cloning and multilingual TTS.
OpenAI: GPT Audio
The gpt-audio model is OpenAI's first generally available audio model. The new snapshot features an upgraded decoder for more natural sounding voices and maintains better voice consistency. Audio is priced...
Limitless
An AI memory assistant for recording conversations and meetings, generating summaries, and searching past interactions across apps and an optional wearable.
ElevenLabs API
Most realistic AI voice API — TTS, voice cloning, 29 languages, streaming, dubbing.
MiniMax
Multimodal foundation models for text, speech, video, and music generation
Rev AI
Speech-to-text API built on decade of human transcription data.
Best For
- ✓voice agent developers building conversational AI
- ✓contact center platforms requiring live call transcription
- ✓teams building real-time meeting transcription tools
- ✓podcast and audio content platforms
- ✓legal and compliance teams processing recorded depositions or interviews
- ✓research teams analyzing multi-speaker audio datasets
- ✓enterprise organizations with large proprietary audio datasets
- ✓specialized industries (medical, legal, technical) with domain-specific speech
Known Limitations
- ⚠WebSocket connections limited to 150 concurrent (free/pay-as-you-go) or 225 (Growth tier) — scaling beyond requires multiple API keys or tier upgrade
- ⚠Latency metrics not publicly specified — 'ultra-low latency' is marketing claim without SLA guarantees
- ⚠Audio format support and sample rate constraints not documented
- ⚠No built-in persistence — transcripts must be captured and stored by client application
- ⚠REST API limited to 50 concurrent requests (free/pay-as-you-go) or 50 (Growth tier) — no increase with tier upgrade for REST
- ⚠Maximum audio duration not specified — may require chunking for very long files
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
AI speech-to-text and text-to-speech API. Nova-2 model with industry-leading accuracy. Features real-time streaming, speaker diarization, sentiment analysis, topic detection, and summarization. Supports 36+ languages.
Categories
Alternatives to Deepgram API
Are you the builder of Deepgram API?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →