What can Deepgram API do?

streaming-speech-to-text-transcription-with-real-time-processing, batch-audio-transcription-with-speaker-diarization, custom-model-training-for-proprietary-speech-patterns, smart-formatting-for-readable-transcripts, multi-language-support-within-single-conversation-stream, concurrent-connection-management-with-tiered-rate-limits, freemium-tier-with-200-dollar-credit-and-no-expiration, pay-as-you-go-pricing-with-growth-tier-discounts, deepgram-cli-with-28-api-commands-and-mcp-server, sdk-support-across-five-languages-with-feature-parity, automatic-language-detection-and-multilingual-transcription, conversational-turn-detection-and-interruption-handling, unified-voice-agent-orchestration-with-stт-llm-tts-integration, text-to-speech-synthesis-with-streaming-input, sentiment-analysis-on-transcribed-speech, topic-detection-and-content-categorization, automatic-summarization-of-audio-conversations, keyterm-prompting-for-domain-specific-accuracy

Deepgram API

APIFree

Speech-to-text API — Nova-2, real-time streaming, diarization, sentiment, 36+ languages.

/ 100

18 capabilities

Capabilities18 decomposed

streaming-speech-to-text-transcription-with-real-time-processing

Medium confidence

Converts live audio streams to text via WebSocket (WSS) protocol with ultra-low latency processing. Deepgram's Flux models process audio chunks incrementally, detecting natural speech boundaries and returning partial transcripts in real-time without waiting for audio completion. Supports 150-225 concurrent WebSocket connections depending on tier, enabling high-throughput voice applications.

Solves for

I need to transcribe live phone calls or voice conversations as they happenI want to build a real-time voice assistant that responds to user speech immediatelyI need to process multiple concurrent audio streams from different users simultaneouslyI want to detect when a speaker finishes talking to trigger downstream LLM processing

Best for

voice agent developers building conversational AI

contact center platforms requiring live call transcription

teams building real-time meeting transcription tools

Requires

API key from Deepgram (free tier includes $200 credit with no expiration)

WebSocket client library (native browser WebSocket or Node.js ws module)

Audio input source (microphone, audio stream, or hardware device)

Limitations

WebSocket connections limited to 150 concurrent (free/pay-as-you-go) or 225 (Growth tier) — scaling beyond requires multiple API keys or tier upgrade

Latency metrics not publicly specified — 'ultra-low latency' is marketing claim without SLA guarantees

Audio format support and sample rate constraints not documented

What makes it unique

Flux models are purpose-built for conversational speech with turn-taking detection and interruption handling, processing audio incrementally via WebSocket to return partial results before audio ends — unlike batch-only APIs. Supports 10-language multilingual conversations within a single stream without language switching overhead.

vs alternatives

Faster real-time response than Google Cloud Speech-to-Text or AWS Transcribe because Flux models emit partial transcripts mid-speech rather than waiting for audio completion, enabling immediate downstream processing.

batch-audio-transcription-with-speaker-diarization

Medium confidence

Processes pre-recorded audio files via REST API with automatic speaker identification and segmentation. Nova-3 models analyze complete audio files to detect multiple speakers, assign speaker labels, and return structured transcripts with speaker turns and timing information. Handles background noise, crosstalk, and far-field audio through deep learning-based noise robustness.

Solves for

I need to transcribe recorded meetings or interviews and identify who said whatI want to process large audio files without real-time constraintsI need to extract speaker segments for downstream analysis or editingI want automatic language detection across 45+ languages without pre-specifying the language

Best for

podcast and audio content platforms

legal and compliance teams processing recorded depositions or interviews

research teams analyzing multi-speaker audio datasets

Requires

API key from Deepgram

Pre-recorded audio file (format and codec support not documented)

HTTP client library (curl, requests, axios, etc.)

Limitations

REST API limited to 50 concurrent requests (free/pay-as-you-go) or 50 (Growth tier) — no increase with tier upgrade for REST

Maximum audio duration not specified — may require chunking for very long files

Speaker diarization accuracy depends on audio quality and speaker count — no published accuracy metrics

What makes it unique

Nova-3 Multilingual model automatically detects language across 45+ languages without pre-configuration, and speaker diarization works across all supported languages — enabling single API call for multilingual multi-speaker content. Handles far-field and noisy audio through specialized training.

vs alternatives

More cost-effective than Whisper Cloud for batch processing (Nova-3 pricing undercuts Whisper), and includes speaker diarization natively without separate API calls or post-processing.

custom-model-training-for-proprietary-speech-patterns

Medium confidence

Deepgram offers custom model training for organizations with proprietary speech patterns, accents, or domain-specific audio characteristics. Custom models are trained on customer-provided datasets and deployed as dedicated endpoints. Enables organizations to achieve higher accuracy on edge-case audio (heavy accents, background noise, specialized vocabulary) that generic models struggle with.

Solves for

I have a large dataset of customer audio and want to train a model specific to our user baseI need to handle regional accents or speech patterns that generic models misrecognizeI want to improve accuracy on noisy or far-field audio specific to my environmentI need a proprietary model that competitors cannot access

Best for

enterprise organizations with large proprietary audio datasets

specialized industries (medical, legal, technical) with domain-specific speech

global companies with regional accent variations

Requires

API key from Deepgram

Large labeled audio dataset (size UNKNOWN)

Contact with Deepgram sales for custom model engagement

Limitations

Custom model pricing not provided — requires sales contact, likely expensive

Minimum dataset size not specified — unclear how much training data is required

Training timeline not documented — unclear how long model development takes

What makes it unique

Custom models are trained on customer data and deployed as isolated endpoints, ensuring proprietary speech patterns remain private and not mixed into public models. Deepgram handles full training pipeline including data validation, model optimization, and endpoint provisioning.

vs alternatives

More private than using public models (no data leakage to competitors); more cost-effective than building in-house speech recognition infrastructure; faster than training custom models from scratch because Deepgram provides pre-trained foundation.

smart-formatting-for-readable-transcripts

Medium confidence

Automatically applies formatting rules to transcripts to improve readability without manual post-processing. Converts numbers to digits, adds punctuation, capitalizes proper nouns, and formats currency/dates according to locale. Smart formatting operates on raw transcription output, transforming 'one thousand two hundred thirty four dollars' to '$1,234' and 'the meeting is on january fifteenth' to 'The meeting is on January 15th'.

Solves for

I want transcripts that are immediately readable without manual cleanupI need formatted numbers, dates, and currency in transcripts for reportsI want proper capitalization and punctuation without post-processingI need locale-specific formatting (US vs. UK date formats, etc.)

Best for

content platforms publishing transcripts to users

compliance and legal teams requiring clean transcript records

customer-facing applications where transcript quality matters

Requires

API key from Deepgram

STT request with smart formatting enabled (parameter name UNKNOWN)

Audio input (streaming or batch)

Limitations

Formatting rules and customization options not documented

Locale support not specified — unclear which regions/date formats are supported

Formatting accuracy not published — edge cases with ambiguous numbers unclear

What makes it unique

Smart formatting is applied during transcription post-processing, not as separate API call — integrated into response pipeline to avoid latency. Handles multiple formatting types (numbers, dates, currency, punctuation) in single pass.

vs alternatives

More efficient than calling separate text formatting API because formatting is built into Deepgram's response; more accurate than regex-based post-processing because formatting rules understand speech context.

multi-language-support-within-single-conversation-stream

Medium confidence

Flux Multilingual model supports 10 languages (English, Spanish, German, French, Hindi, Russian, Portuguese, Japanese, Italian, Dutch) within a single WebSocket stream, automatically detecting language switches mid-conversation. Enables applications to handle multilingual users without requiring separate connections or language pre-specification. Language detection happens continuously throughout the stream.

Solves for

I want to support users who switch between languages mid-conversationI need a single voice agent that handles multiple languages without user selectionI want to transcribe international meetings where participants speak different languagesI need to avoid connection overhead from language switching

Best for

international voice applications and platforms

multilingual customer support systems

global meeting transcription services

Requires

API key from Deepgram

Flux Multilingual model specified in request

WebSocket connection (WSS protocol)

Limitations

Limited to 10 languages (Flux Multilingual) — Nova-3 supports 45+ but only in batch mode

Language switching detection accuracy not published — edge cases with code-switching unclear

Flux Multilingual pricing higher than English-only Flux ($0.0078/min vs. $0.0065/min)

What makes it unique

Flux Multilingual detects language switches continuously within a single stream without reconnection or model switching — language detection is per-segment, not per-stream. Enables seamless multilingual conversations without user intervention.

vs alternatives

More seamless than competitors requiring separate API calls per language or manual language selection; lower latency than sequential language detection because detection is integrated into transcription model.

concurrent-connection-management-with-tiered-rate-limits

Medium confidence

Deepgram enforces concurrent connection limits that vary by API type and subscription tier. WebSocket STT supports 150 (free/pay-as-you-go) or 225 (Growth tier) concurrent connections; REST STT/TTS limited to 50 concurrent; Voice Agent API limited to 45 (free) or 60 (Growth) concurrent; Audio Intelligence limited to 10 concurrent regardless of tier. Developers must manage connection pooling and queuing to respect these limits.

Solves for

I need to understand how many simultaneous users my application can supportI want to implement connection pooling and queuing for high-traffic applicationsI need to plan infrastructure scaling based on concurrency limitsI want to know when to upgrade to Growth tier for higher concurrency

Best for

platform engineers planning capacity and scaling

teams building high-concurrency voice applications

SaaS providers offering voice features to multiple customers

Requires

API key from Deepgram

Understanding of application's expected concurrent user count

Connection pooling or queuing implementation in client code

Limitations

Audio Intelligence capped at 10 concurrent regardless of tier — severe bottleneck for sentiment/topic/summarization at scale

REST API concurrency does not increase with Growth tier — only WebSocket benefits from tier upgrade

No documented queuing or backpressure mechanism — unclear how requests beyond limit are handled

What makes it unique

Concurrency limits are enforced per API type and tier, with WebSocket getting higher limits than REST — reflects Deepgram's architecture where WebSocket is more efficient for streaming. Audio Intelligence has universal 10-concurrent cap, creating asymmetric bottleneck.

vs alternatives

More transparent than some competitors about concurrency limits; Growth tier upgrade provides meaningful concurrency increase for WebSocket (150→225) but not for REST or Audio Intelligence.

freemium-tier-with-200-dollar-credit-and-no-expiration

Medium confidence

Deepgram offers free tier with $200 credit that never expires, no credit card required to sign up. Free tier includes access to all public models (Flux, Nova-3) and all endpoints (STT, TTS, Voice Agent, Audio Intelligence) at full concurrency limits (150 WebSocket STT, 50 REST, etc.). Developers can build and test production applications without payment until credit is exhausted.

Solves for

I want to prototype a voice application without upfront costI need to evaluate Deepgram's quality and latency before committing to paid tierI want to build a small-scale application that fits within free tier usageI need to avoid credit card requirement for initial development

Best for

individual developers and startups

teams evaluating Deepgram before enterprise commitment

hobby projects and proof-of-concepts

Requires

Email address to sign up (no credit card required)

API key provisioned after account creation

Limitations

$200 credit is fixed — no monthly refresh or additional free tier after credit exhausted

Credit expiration not specified — documentation says 'no expiration' but unclear if this is permanent

No free tier for custom models — custom training requires paid engagement

What makes it unique

Non-expiring $200 credit is unusual in the industry — most competitors offer monthly free tier or time-limited trial. No credit card requirement lowers barrier to entry for developers.

vs alternatives

More generous than Google Cloud Speech-to-Text free tier (60 minutes/month) or AWS Transcribe free tier (250 minutes/month); non-expiring credit is better than time-limited trials because developers can work at their own pace.

pay-as-you-go-pricing-with-growth-tier-discounts

Medium confidence

Deepgram offers two pricing models: pay-as-you-go (per-minute consumption) and Growth tier (pre-paid annual credits with 10-20% discount). Pay-as-you-go pricing ranges from $0.0048/min (Nova-3 Monolingual) to $0.0078/min (Flux Multilingual) for STT. Growth tier offers same models at discounted rates ($0.0042-$0.0068/min) with pre-paid annual commitment. Pricing is per-minute of audio processed, not per request.

Solves for

I want to understand the cost of my voice application at different scalesI need to calculate ROI for voice features in my productI want to optimize costs by choosing the right model for my use caseI need to budget for annual voice infrastructure costs

Best for

product managers and finance teams planning voice feature costs

startups evaluating unit economics of voice applications

enterprises negotiating volume discounts

Requires

Estimate of monthly audio volume (in minutes)

Model selection (Flux vs. Nova-3, English vs. Multilingual)

Tier selection (pay-as-you-go vs. Growth)

Limitations

TTS and Voice Agent pricing not provided in documentation — cost structure unknown

Audio Intelligence pricing not provided — sentiment/topic/summarization costs unclear

Pricing per minute of audio, not per request — long audio files are expensive

What makes it unique

Pricing is per-minute of audio processed, not per API call — transparent and predictable for high-volume applications. Growth tier discount (10-20%) is modest compared to some competitors but no minimum commitment required.

vs alternatives

More transparent than competitors with opaque enterprise pricing; per-minute pricing is fairer than per-request for long-form audio; Growth tier discount is smaller than some competitors (AWS, Google) but no long-term contract lock-in.

deepgram-cli-with-28-api-commands-and-mcp-server

Medium confidence

Deepgram CLI is a command-line tool with 28 built-in commands for transcription, synthesis, and management tasks. Includes integrated MCP (Model Context Protocol) server, enabling AI agents to call Deepgram APIs directly without custom integration code. CLI supports both interactive and scripted usage, with output formatting options (JSON, text, etc.).

Solves for

I want to test Deepgram APIs from the command line without writing codeI need to integrate Deepgram into shell scripts or CI/CD pipelinesI want to give AI agents direct access to Deepgram capabilities via MCPI need to batch process audio files from the command line

Best for

developers testing APIs during development

DevOps engineers integrating voice into CI/CD pipelines

AI agent builders using MCP for tool integration

Requires

Deepgram CLI installed (installation method UNKNOWN)

API key from Deepgram (configured via environment variable or config file)

Shell environment (bash, zsh, PowerShell, etc.)

Limitations

28 commands is modest — unclear which operations are supported vs. full API

MCP server implementation details not documented — unclear which Deepgram features are exposed

CLI output formatting options not specified — unclear what formats are available

What makes it unique

Built-in MCP server enables AI agents to call Deepgram without custom integration — agents can use Deepgram as a native tool via MCP protocol. CLI includes 28 commands covering common operations, reducing need for custom scripts.

vs alternatives

More convenient than calling REST API directly from shell scripts; MCP integration is more modern than webhook-based integrations, enabling AI agents to use Deepgram as a native capability.

sdk-support-across-five-languages-with-feature-parity

Medium confidence

Deepgram provides official SDKs for Python, JavaScript, Go, .NET, and Java. SDKs abstract HTTP/WebSocket complexity, handle authentication, manage connection pooling, and provide language-idiomatic APIs. Feature parity across SDKs is claimed but not verified — specific version numbers and supported features per SDK not documented.

Solves for

I want to integrate Deepgram into my Python/JavaScript/Go/Java applicationI need to handle WebSocket connections without managing raw socketsI want language-idiomatic APIs rather than raw HTTP callsI need automatic retry logic and error handling

Best for

developers building applications in Python, JavaScript, Go, .NET, or Java

teams wanting to avoid low-level HTTP/WebSocket management

applications requiring robust error handling and retry logic

Requires

SDK for target language (Python, JavaScript, Go, .NET, or Java)

API key from Deepgram

Language runtime (Python 3.x, Node.js, Go 1.x, .NET Core, Java 8+)

Limitations

SDK version numbers not documented — unclear which versions are current or stable

Feature parity not verified — unclear if all SDKs support all Deepgram features

SDK documentation quality not assessed — may vary by language

What makes it unique

SDKs are available for five major languages, providing language-idiomatic APIs rather than forcing developers to use raw HTTP. WebSocket connection management is abstracted, reducing complexity for streaming use cases.

vs alternatives

More convenient than raw HTTP clients because SDKs handle authentication, connection pooling, and error handling; available across more languages than some competitors (e.g., ElevenLabs).

automatic-language-detection-and-multilingual-transcription

Medium confidence

Automatically identifies spoken language from audio without pre-configuration, supporting 45+ languages in Nova-3 Multilingual model or 10 languages in Flux Multilingual for real-time. Detection happens during initial audio processing; language is returned in response metadata and used to optimize transcription accuracy for that language's phonetics and vocabulary.

Solves for

I need to transcribe audio without knowing the language in advanceI want to support global users speaking different languages in a single applicationI need to handle code-switched or multilingual conversations automaticallyI want to avoid manual language selection in my UI

Best for

international platforms serving users across multiple countries

contact centers with multilingual customer bases

global media and broadcasting platforms

Requires

API key from Deepgram

Audio input (streaming or batch)

No language parameter required (detection is automatic)

Limitations

Flux Multilingual limited to 10 languages (English, Spanish, German, French, Hindi, Russian, Portuguese, Japanese, Italian, Dutch) — Nova-3 supports 45+ but only in batch mode

Language detection confidence not returned in API response — no way to assess detection reliability

Code-switching (mixing languages mid-sentence) support not documented

What makes it unique

Nova-3 Multilingual detects from 45+ languages automatically, while Flux Multilingual handles 10 languages in real-time streaming — Deepgram's approach embeds language detection into the transcription model rather than as a separate preprocessing step, reducing latency.

vs alternatives

Faster than Google Cloud Speech-to-Text's language detection because detection and transcription happen in a single model pass rather than sequential API calls; supports more languages than most competitors' auto-detection (45+ vs. typical 20-30).

conversational-turn-detection-and-interruption-handling

Medium confidence

Flux models detect natural speech boundaries and turn-taking in conversations, automatically identifying when a speaker has finished talking and when another speaker begins. Built-in interruption handling allows overlapping speech to be processed without requiring explicit silence detection thresholds. Enables voice agents to know when to stop listening and trigger response generation without timeout-based heuristics.

Solves for

I want my voice agent to respond naturally without waiting for silence timeoutsI need to detect speaker turns in multi-party conversations automaticallyI want to handle interruptions gracefully in voice interactionsI need to trigger downstream LLM processing at natural conversation boundaries

Best for

voice agent and conversational AI developers

real-time meeting transcription platforms

customer service voice bots

Requires

API key from Deepgram

Flux English or Flux Multilingual model (specified in request)

WebSocket connection (WSS protocol)

Limitations

Turn detection accuracy not published — no metrics on false positives/negatives

Interruption handling behavior not formally specified — edge cases with rapid back-and-forth unclear

Only available in Flux models (real-time streaming) — not available in Nova-3 batch models

What makes it unique

Flux models are trained specifically on conversational speech patterns to detect natural turn boundaries without explicit silence thresholds — unlike generic STT models that require fixed timeout windows. Handles overlapping speech (interruptions) as a first-class feature rather than edge case.

vs alternatives

More natural than Whisper or Google Cloud Speech-to-Text because turn detection is built into the model rather than requiring post-processing heuristics; eliminates latency from silence timeout windows.

unified-voice-agent-orchestration-with-stт-llm-tts-integration

Medium confidence

Voice Agent API combines speech-to-text, LLM integration, and text-to-speech in a single WebSocket connection, orchestrating the full conversational loop. Audio input flows to Flux STT model, transcript is sent to configured LLM (provider UNKNOWN), LLM response is streamed to TTS model, and synthesized audio is returned to client — all within one persistent connection without intermediate API calls.

Solves for

I want to build a voice agent without managing multiple API connectionsI need end-to-end voice conversation with minimal latency between STT and LLM and TTSI want to handle language detection and turn-taking across the entire conversation pipelineI need a single authentication point for voice agent infrastructure

Best for

startups building voice assistant MVPs

teams wanting unified voice agent infrastructure without orchestration complexity

applications requiring sub-second latency across STT→LLM→TTS pipeline

Requires

API key from Deepgram

LLM API key (provider and format UNKNOWN)

WebSocket client library

Limitations

LLM provider options not documented — unclear which models are supported (OpenAI, Anthropic, etc.)

Pricing for Voice Agent API not provided in documentation — cost structure unknown

Concurrency limited to 45 (free/pay-as-you-go) or 60 (Growth tier) WebSocket connections

What makes it unique

Single WebSocket connection handles STT→LLM→TTS pipeline without intermediate REST calls, reducing latency and connection overhead. Flux models' turn detection integrates with LLM triggering — agent knows when to stop listening and start generating response.

vs alternatives

Simpler than building voice agents with separate Deepgram STT + OpenAI LLM + ElevenLabs TTS APIs because orchestration is built-in; lower latency than sequential API calls because all components share one connection.

text-to-speech-synthesis-with-streaming-input

Medium confidence

Converts text to natural-sounding audio via REST or WebSocket API. Supports streaming text input (partial text can be sent before full response is available), enabling real-time audio generation as LLM generates response tokens. Multiple voices and languages available (specific count and list not documented). Synthesized audio is returned as audio stream (format UNKNOWN).

Solves for

I want to convert LLM responses to speech in real-time as tokens arriveI need multiple voice options for different user personas or languagesI want to stream audio output to users without waiting for full text generationI need TTS integrated with my voice agent pipeline

Best for

voice agent developers

accessibility platforms converting text content to audio

interactive voice response (IVR) systems

Requires

API key from Deepgram

Text input (format and encoding UNKNOWN)

HTTP client (REST) or WebSocket client (streaming)

Limitations

Voice options and language support not documented — unclear which voices/languages available

Maximum text length not specified — no guidance on chunking long responses

TTS pricing not provided in documentation

What makes it unique

Supports streaming text input via WebSocket, enabling audio generation to begin before full text is available — useful for real-time LLM response streaming. Integration with Voice Agent API allows TTS to receive LLM output directly without intermediate buffering.

vs alternatives

Streaming text input is less common than competitors (ElevenLabs, Google Cloud TTS) — enables lower latency for LLM-to-speech pipelines by starting audio generation before LLM completes.

sentiment-analysis-on-transcribed-speech

Medium confidence

Audio Intelligence endpoint analyzes transcribed speech to detect emotional tone and sentiment (positive, negative, neutral). Processes audio or transcript to extract sentiment signals, returning sentiment labels and confidence scores. Operates as post-processing on transcription output or as standalone analysis on pre-transcribed text.

Solves for

I want to detect customer satisfaction or frustration in support callsI need to monitor agent performance based on conversation toneI want to identify high-priority or escalation-worthy interactions automaticallyI need sentiment trends across multiple conversations for analytics

Best for

contact center analytics platforms

customer experience monitoring tools

voice of customer (VoC) programs

Requires

API key from Deepgram

Transcribed text or audio input

HTTP client for REST API

Limitations

Sentiment model accuracy not published — no benchmarks or confidence thresholds documented

Sentiment granularity unclear — only positive/negative/neutral or more nuanced emotions?

REST API limited to 10 concurrent requests (no tier upgrade) — severe bottleneck for high-volume analysis

What makes it unique

Sentiment analysis operates on speech audio directly (not just text), capturing vocal tone and prosody cues that text-only sentiment misses. Integrates with speaker diarization to attribute sentiment to specific speakers.

vs alternatives

More accurate than text-only sentiment because it captures vocal tone, emphasis, and prosody; integrated with Deepgram's transcription pipeline so no separate audio upload needed.

topic-detection-and-content-categorization

Medium confidence

Audio Intelligence endpoint automatically identifies topics and themes discussed in audio conversations. Analyzes transcribed speech to extract key topics, categorize conversation content, and return topic labels with relevance scores. Enables automatic routing, content classification, and conversation summarization without manual tagging.

Solves for

I want to automatically categorize support tickets by topic (billing, technical, account, etc.)I need to route conversations to appropriate teams based on detected topicsI want to extract key discussion themes from meeting recordingsI need to build topic-based analytics dashboards from conversation data

Best for

contact center routing and IVR systems

meeting intelligence and analytics platforms

content management and archival systems

Requires

API key from Deepgram

Transcribed text or audio input

HTTP client for REST API

Limitations

Topic taxonomy not documented — unclear which topics are recognized or how custom topics are defined

Topic detection accuracy not published — no benchmarks or confidence thresholds

REST API limited to 10 concurrent requests — severe bottleneck for real-time routing

What makes it unique

Topic detection integrates with speaker diarization and sentiment analysis to provide multi-dimensional conversation analysis in single API call. Operates on speech audio directly, capturing context from tone and pacing that text-only approaches miss.

vs alternatives

More efficient than separate text classification APIs because topics are extracted during transcription processing rather than requiring separate text analysis pass.

automatic-summarization-of-audio-conversations

Medium confidence

Audio Intelligence endpoint generates abstractive summaries of audio conversations, condensing key points and action items from transcribed speech. Summarization operates on full transcript or speaker segments, extracting essential information and generating concise natural language summaries without manual review.

Solves for

I want to generate meeting notes automatically from recorded callsI need to extract action items and decisions from conversationsI want to create executive summaries of long customer interactionsI need to archive conversation summaries for compliance and knowledge management

Best for

meeting transcription and note-taking platforms

contact center quality assurance and training

legal and compliance document management

Requires

API key from Deepgram

Complete transcribed text or audio input

HTTP client for REST API

Limitations

Summarization model and approach not documented — unclear if extractive or abstractive, or how length is controlled

Summary quality and accuracy not published — no benchmarks or user satisfaction metrics

REST API limited to 10 concurrent requests — bottleneck for batch summarization

What makes it unique

Summarization operates on speech audio with speaker context (from diarization) and sentiment (from sentiment analysis), enabling summaries that attribute statements to speakers and highlight emotional context. Single API call generates summary without separate LLM call.

vs alternatives

More integrated than calling separate LLM for summarization because summary generation is optimized for speech patterns and includes speaker attribution natively.

keyterm-prompting-for-domain-specific-accuracy

Medium confidence

Allows developers to provide domain-specific keywords or phrases that the STT model should prioritize during transcription. Keyterm prompting biases the model's decoding toward specified terms, improving accuracy for technical jargon, product names, or domain-specific vocabulary that might otherwise be misrecognized. Implemented as optional parameter in transcription requests.

Solves for

I want to improve transcription accuracy for technical terms or product names in my domainI need to ensure specific keywords are recognized correctly in medical or legal contextsI want to reduce hallucinations of similar-sounding words in specialized vocabulariesI need to customize recognition for company-specific terminology or acronyms

Best for

medical and legal transcription services

technical support and engineering teams

industry-specific voice applications (finance, pharma, etc.)

Requires

API key from Deepgram

List of domain-specific keywords or phrases

STT request with keyterms parameter (format UNKNOWN)

Limitations

Keyterm impact on accuracy not quantified — no metrics on improvement per term

Maximum number of keyterms not specified — unclear if there's a limit or performance degradation

Keyterm weighting mechanism not documented — unclear how strongly terms are prioritized

What makes it unique

Keyterm prompting is built into Deepgram's STT models as a native feature, not post-processing — the model's decoding process directly incorporates keyterm bias during transcription rather than correcting afterward. Works across all languages and models.

vs alternatives

More effective than post-processing keyword replacement because bias is applied during model inference; more flexible than fine-tuned custom models because keyterms can be updated per-request without retraining.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with Deepgram API, ranked by overlap. Discovered automatically through the match graph.

Product57

ElevenLabs

Ultra-realistic AI voice synthesis with cloning and multilingual TTS.

batch-speech-to-text-transcription-with-advanced-audio-taggingreal-time-speech-to-text-transcription-with-entity-detection

2 shared capabilities

Model21

OpenAI: GPT Audio

The gpt-audio model is OpenAI's first generally available audio model. The new snapshot features an upgraded decoder for more natural sounding voices and maintains better voice consistency. Audio is priced...

speech-to-text transcription with speaker diarization

1 shared capability

Product24

Limitless

An AI memory assistant for recording conversations and meetings, generating summaries, and searching past interactions across apps and an optional wearable.

real-time speech-to-text transcription with speaker diarization

1 shared capability

API55

ElevenLabs API

Most realistic AI voice API — TTS, voice cloning, 29 languages, streaming, dubbing.

multilingual speech-to-text transcription with speaker diarization

1 shared capability

Product20

MiniMax

Multimodal foundation models for text, speech, video, and music generation

speech-to-text transcription with speaker diarization and language detection

1 shared capability

API55

Rev AI

Speech-to-text API built on decade of human transcription data.

asynchronous audio-to-text transcription with speaker diarization

1 shared capability

Best For

✓voice agent developers building conversational AI
✓contact center platforms requiring live call transcription
✓teams building real-time meeting transcription tools
✓podcast and audio content platforms
✓legal and compliance teams processing recorded depositions or interviews
✓research teams analyzing multi-speaker audio datasets
✓enterprise organizations with large proprietary audio datasets
✓specialized industries (medical, legal, technical) with domain-specific speech

Known Limitations

⚠WebSocket connections limited to 150 concurrent (free/pay-as-you-go) or 225 (Growth tier) — scaling beyond requires multiple API keys or tier upgrade
⚠Latency metrics not publicly specified — 'ultra-low latency' is marketing claim without SLA guarantees
⚠Audio format support and sample rate constraints not documented
⚠No built-in persistence — transcripts must be captured and stored by client application
⚠REST API limited to 50 concurrent requests (free/pay-as-you-go) or 50 (Growth tier) — no increase with tier upgrade for REST
⚠Maximum audio duration not specified — may require chunking for very long files

Requirements

API key from Deepgram (free tier includes $200 credit with no expiration)WebSocket client library (native browser WebSocket or Node.js ws module)Audio input source (microphone, audio stream, or hardware device)Network connectivity with WSS (secure WebSocket) supportAPI key from DeepgramPre-recorded audio file (format and codec support not documented)HTTP client library (curl, requests, axios, etc.)File hosting or ability to send audio via multipart/form-data or URL reference

Input / Output

Accepts: audio/stream (real-time PCM or compressed audio), audio/raw (raw audio bytes), audio/file (pre-recorded, format UNKNOWN), audio/url (remote file reference), audio/dataset (labeled training audio), metadata/annotations (transcripts and labels for training data), audio/stream or audio/file (speech to transcribe), audio/stream (real-time audio with language switching), configuration (tier selection, API key management), account-creation (email), configuration (model, tier, estimated volume), command-line-arguments (audio file path, model selection, etc.), audio-file (for transcription commands), audio/stream or audio/file (passed to SDK methods), audio/stream or audio/file (any supported language), audio/stream (real-time audio with natural speech patterns), audio/stream (real-time user speech), text/plain (text to synthesize), text/stream (partial text arriving incrementally), text/transcript (transcribed speech), audio/file (pre-recorded audio), text/transcript (full transcribed conversation), text/list (keyterms to prioritize)

Produces: JSON (partial and final transcripts with confidence scores), structured metadata (speaker identification, timing, alternatives), JSON (transcript with speaker labels, timing, confidence), structured speaker segments (speaker ID, start time, end time, text), custom-model/endpoint (dedicated API endpoint for inference), JSON (formatted transcript with numbers, dates, currency, punctuation), JSON (transcript with detected_language field, language per segment), rate-limit-headers (HTTP headers indicating remaining concurrency), api-key (provisioned for free tier access), cost-estimate (monthly or annual cost), text/json (command output, formatted per user selection), audio-file (for TTS commands), language-native objects (SDK returns typed objects, not raw JSON), JSON (detected_language field in response metadata), transcript (optimized for detected language), JSON (transcript with turn-detection metadata), signal (implicit — final transcript indicates turn completion), audio/stream (synthesized agent response), JSON metadata (transcript, language, turn detection), audio/stream (synthesized audio, format UNKNOWN), JSON (sentiment label, confidence score), structured metadata (sentiment per speaker, sentiment over time), JSON (topic labels, relevance scores), structured metadata (topics per segment, topic timeline), text/summary (natural language summary), structured metadata (key points, action items, participants), JSON (transcript with improved accuracy for keyterms)

UnfragileRank

Adoption70%(25% weight)

Quality90%(25% weight)

Ecosystem15%(10% weight)

Match Graph25%(35% weight)

Freshness100%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

From $0.0043/min

Type: API

18 capabilities

Visit Deepgram API→

About

AI speech-to-text and text-to-speech API. Nova-2 model with industry-leading accuracy. Features real-time streaming, speaker diarization, sentiment analysis, topic detection, and summarization. Supports 36+ languages.

Alternatives to Deepgram API

Whisper Large v359Model

OpenAI's best speech recognition model for 100+ languages.

Compare →

Kokoro TTS59Model

Lightweight 82M parameter open-source TTS with high-quality output.

Compare →

Whisper CLI58CLI Tool

OpenAI speech recognition CLI.

Compare →

Whisper58Model

OpenAI's open-source speech recognition — 99 languages, translation, timestamps, runs locally.

Compare →

Are you the builder of Deepgram API?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

seed developer essentials

Looking for something else?

Search →

Capabilities18 decomposed

streaming-speech-to-text-transcription-with-real-time-processing

Medium confidence

Solves for

Best for

voice agent developers building conversational AI

contact center platforms requiring live call transcription

teams building real-time meeting transcription tools

Requires

API key from Deepgram (free tier includes $200 credit with no expiration)

WebSocket client library (native browser WebSocket or Node.js ws module)

Audio input source (microphone, audio stream, or hardware device)

Limitations

WebSocket connections limited to 150 concurrent (free/pay-as-you-go) or 225 (Growth tier) — scaling beyond requires multiple API keys or tier upgrade

Latency metrics not publicly specified — 'ultra-low latency' is marketing claim without SLA guarantees

Audio format support and sample rate constraints not documented

What makes it unique

vs alternatives

batch-audio-transcription-with-speaker-diarization

Medium confidence

Solves for

Best for

podcast and audio content platforms

legal and compliance teams processing recorded depositions or interviews

research teams analyzing multi-speaker audio datasets

Requires

API key from Deepgram

Pre-recorded audio file (format and codec support not documented)

HTTP client library (curl, requests, axios, etc.)

Limitations

REST API limited to 50 concurrent requests (free/pay-as-you-go) or 50 (Growth tier) — no increase with tier upgrade for REST

Maximum audio duration not specified — may require chunking for very long files

Speaker diarization accuracy depends on audio quality and speaker count — no published accuracy metrics

What makes it unique

vs alternatives

More cost-effective than Whisper Cloud for batch processing (Nova-3 pricing undercuts Whisper), and includes speaker diarization natively without separate API calls or post-processing.

custom-model-training-for-proprietary-speech-patterns

Medium confidence

Solves for

Best for

enterprise organizations with large proprietary audio datasets

specialized industries (medical, legal, technical) with domain-specific speech

global companies with regional accent variations

Requires

API key from Deepgram

Large labeled audio dataset (size UNKNOWN)

Contact with Deepgram sales for custom model engagement

Limitations

Custom model pricing not provided — requires sales contact, likely expensive

Minimum dataset size not specified — unclear how much training data is required

Training timeline not documented — unclear how long model development takes

What makes it unique

vs alternatives

smart-formatting-for-readable-transcripts

Medium confidence

Solves for

Best for

content platforms publishing transcripts to users

compliance and legal teams requiring clean transcript records

customer-facing applications where transcript quality matters

Requires

API key from Deepgram

STT request with smart formatting enabled (parameter name UNKNOWN)

Audio input (streaming or batch)

Limitations

Formatting rules and customization options not documented

Locale support not specified — unclear which regions/date formats are supported

Formatting accuracy not published — edge cases with ambiguous numbers unclear

What makes it unique

vs alternatives

multi-language-support-within-single-conversation-stream

Medium confidence

Solves for

Best for

international voice applications and platforms

multilingual customer support systems

global meeting transcription services

Requires

API key from Deepgram

Flux Multilingual model specified in request

WebSocket connection (WSS protocol)

Limitations

Limited to 10 languages (Flux Multilingual) — Nova-3 supports 45+ but only in batch mode

Language switching detection accuracy not published — edge cases with code-switching unclear

Flux Multilingual pricing higher than English-only Flux ($0.0078/min vs. $0.0065/min)

What makes it unique

vs alternatives

concurrent-connection-management-with-tiered-rate-limits

Medium confidence

Solves for

Best for

platform engineers planning capacity and scaling

teams building high-concurrency voice applications

SaaS providers offering voice features to multiple customers

Requires

API key from Deepgram

Understanding of application's expected concurrent user count

Connection pooling or queuing implementation in client code

Limitations

Audio Intelligence capped at 10 concurrent regardless of tier — severe bottleneck for sentiment/topic/summarization at scale

REST API concurrency does not increase with Growth tier — only WebSocket benefits from tier upgrade

No documented queuing or backpressure mechanism — unclear how requests beyond limit are handled

What makes it unique

vs alternatives

More transparent than some competitors about concurrency limits; Growth tier upgrade provides meaningful concurrency increase for WebSocket (150→225) but not for REST or Audio Intelligence.

freemium-tier-with-200-dollar-credit-and-no-expiration

Medium confidence

Solves for

Best for

individual developers and startups

teams evaluating Deepgram before enterprise commitment

hobby projects and proof-of-concepts

Requires

Email address to sign up (no credit card required)

API key provisioned after account creation

Limitations

$200 credit is fixed — no monthly refresh or additional free tier after credit exhausted

Credit expiration not specified — documentation says 'no expiration' but unclear if this is permanent

No free tier for custom models — custom training requires paid engagement

What makes it unique

Non-expiring $200 credit is unusual in the industry — most competitors offer monthly free tier or time-limited trial. No credit card requirement lowers barrier to entry for developers.

vs alternatives

pay-as-you-go-pricing-with-growth-tier-discounts

Medium confidence

Solves for

Best for

product managers and finance teams planning voice feature costs

startups evaluating unit economics of voice applications

enterprises negotiating volume discounts

Requires

Estimate of monthly audio volume (in minutes)

Model selection (Flux vs. Nova-3, English vs. Multilingual)

Tier selection (pay-as-you-go vs. Growth)

Limitations

TTS and Voice Agent pricing not provided in documentation — cost structure unknown

Audio Intelligence pricing not provided — sentiment/topic/summarization costs unclear

Pricing per minute of audio, not per request — long audio files are expensive

What makes it unique

vs alternatives

deepgram-cli-with-28-api-commands-and-mcp-server

Medium confidence

Solves for

Best for

developers testing APIs during development

DevOps engineers integrating voice into CI/CD pipelines

AI agent builders using MCP for tool integration

Requires

Deepgram CLI installed (installation method UNKNOWN)

API key from Deepgram (configured via environment variable or config file)

Shell environment (bash, zsh, PowerShell, etc.)

Limitations

28 commands is modest — unclear which operations are supported vs. full API

MCP server implementation details not documented — unclear which Deepgram features are exposed

CLI output formatting options not specified — unclear what formats are available

What makes it unique

vs alternatives

More convenient than calling REST API directly from shell scripts; MCP integration is more modern than webhook-based integrations, enabling AI agents to use Deepgram as a native capability.

sdk-support-across-five-languages-with-feature-parity

Medium confidence

Solves for

Best for

developers building applications in Python, JavaScript, Go, .NET, or Java

teams wanting to avoid low-level HTTP/WebSocket management

applications requiring robust error handling and retry logic

Requires

SDK for target language (Python, JavaScript, Go, .NET, or Java)

API key from Deepgram

Language runtime (Python 3.x, Node.js, Go 1.x, .NET Core, Java 8+)

Limitations

SDK version numbers not documented — unclear which versions are current or stable

Feature parity not verified — unclear if all SDKs support all Deepgram features

SDK documentation quality not assessed — may vary by language

What makes it unique

vs alternatives

More convenient than raw HTTP clients because SDKs handle authentication, connection pooling, and error handling; available across more languages than some competitors (e.g., ElevenLabs).

automatic-language-detection-and-multilingual-transcription

Medium confidence

Solves for

Best for

international platforms serving users across multiple countries

contact centers with multilingual customer bases

global media and broadcasting platforms

Requires

API key from Deepgram

Audio input (streaming or batch)

No language parameter required (detection is automatic)

Limitations

Flux Multilingual limited to 10 languages (English, Spanish, German, French, Hindi, Russian, Portuguese, Japanese, Italian, Dutch) — Nova-3 supports 45+ but only in batch mode

Language detection confidence not returned in API response — no way to assess detection reliability

Code-switching (mixing languages mid-sentence) support not documented

What makes it unique

vs alternatives

conversational-turn-detection-and-interruption-handling

Medium confidence

Solves for

Best for

voice agent and conversational AI developers

real-time meeting transcription platforms

customer service voice bots

Requires

API key from Deepgram

Flux English or Flux Multilingual model (specified in request)

WebSocket connection (WSS protocol)

Limitations

Turn detection accuracy not published — no metrics on false positives/negatives

Interruption handling behavior not formally specified — edge cases with rapid back-and-forth unclear

Only available in Flux models (real-time streaming) — not available in Nova-3 batch models

What makes it unique

vs alternatives

unified-voice-agent-orchestration-with-stт-llm-tts-integration

Medium confidence

Solves for

Best for

startups building voice assistant MVPs

teams wanting unified voice agent infrastructure without orchestration complexity

applications requiring sub-second latency across STT→LLM→TTS pipeline

Requires

API key from Deepgram

LLM API key (provider and format UNKNOWN)

WebSocket client library

Limitations

LLM provider options not documented — unclear which models are supported (OpenAI, Anthropic, etc.)

Pricing for Voice Agent API not provided in documentation — cost structure unknown

Concurrency limited to 45 (free/pay-as-you-go) or 60 (Growth tier) WebSocket connections

What makes it unique

vs alternatives

text-to-speech-synthesis-with-streaming-input

Medium confidence

Solves for

Best for

voice agent developers

accessibility platforms converting text content to audio

interactive voice response (IVR) systems

Requires

API key from Deepgram

Text input (format and encoding UNKNOWN)

HTTP client (REST) or WebSocket client (streaming)

Limitations

Voice options and language support not documented — unclear which voices/languages available

Maximum text length not specified — no guidance on chunking long responses

TTS pricing not provided in documentation

What makes it unique

vs alternatives

Streaming text input is less common than competitors (ElevenLabs, Google Cloud TTS) — enables lower latency for LLM-to-speech pipelines by starting audio generation before LLM completes.

sentiment-analysis-on-transcribed-speech

Medium confidence

Solves for

Best for

contact center analytics platforms

customer experience monitoring tools

voice of customer (VoC) programs

Requires

API key from Deepgram

Transcribed text or audio input

HTTP client for REST API

Limitations

Sentiment model accuracy not published — no benchmarks or confidence thresholds documented

Sentiment granularity unclear — only positive/negative/neutral or more nuanced emotions?

REST API limited to 10 concurrent requests (no tier upgrade) — severe bottleneck for high-volume analysis

What makes it unique

vs alternatives

More accurate than text-only sentiment because it captures vocal tone, emphasis, and prosody; integrated with Deepgram's transcription pipeline so no separate audio upload needed.

topic-detection-and-content-categorization

Medium confidence

Solves for

Best for

contact center routing and IVR systems

meeting intelligence and analytics platforms

content management and archival systems

Requires

API key from Deepgram

Transcribed text or audio input

HTTP client for REST API

Limitations

Topic taxonomy not documented — unclear which topics are recognized or how custom topics are defined

Topic detection accuracy not published — no benchmarks or confidence thresholds

REST API limited to 10 concurrent requests — severe bottleneck for real-time routing

What makes it unique

vs alternatives

More efficient than separate text classification APIs because topics are extracted during transcription processing rather than requiring separate text analysis pass.

automatic-summarization-of-audio-conversations

Medium confidence

Solves for

Best for

meeting transcription and note-taking platforms

contact center quality assurance and training

legal and compliance document management

Requires

API key from Deepgram

Complete transcribed text or audio input

HTTP client for REST API

Limitations

Summarization model and approach not documented — unclear if extractive or abstractive, or how length is controlled

Summary quality and accuracy not published — no benchmarks or user satisfaction metrics

REST API limited to 10 concurrent requests — bottleneck for batch summarization

What makes it unique

vs alternatives

More integrated than calling separate LLM for summarization because summary generation is optimized for speech patterns and includes speaker attribution natively.

keyterm-prompting-for-domain-specific-accuracy

Medium confidence

Solves for

Best for

medical and legal transcription services

technical support and engineering teams

industry-specific voice applications (finance, pharma, etc.)

Requires

API key from Deepgram

List of domain-specific keywords or phrases

STT request with keyterms parameter (format UNKNOWN)

Limitations

Keyterm impact on accuracy not quantified — no metrics on improvement per term

Maximum number of keyterms not specified — unclear if there's a limit or performance degradation

Keyterm weighting mechanism not documented — unclear how strongly terms are prioritized

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to Deepgram API

Whisper Large v359Model

OpenAI's best speech recognition model for 100+ languages.

Compare →

Kokoro TTS59Model

Lightweight 82M parameter open-source TTS with high-quality output.

Compare →

Whisper CLI58CLI Tool

OpenAI speech recognition CLI.

Compare →

Whisper58Model

OpenAI's open-source speech recognition — 99 languages, translation, timestamps, runs locally.

Compare →

Deepgram API

Capabilities18 decomposed

streaming-speech-to-text-transcription-with-real-time-processing

batch-audio-transcription-with-speaker-diarization

custom-model-training-for-proprietary-speech-patterns

smart-formatting-for-readable-transcripts

multi-language-support-within-single-conversation-stream

concurrent-connection-management-with-tiered-rate-limits

freemium-tier-with-200-dollar-credit-and-no-expiration

pay-as-you-go-pricing-with-growth-tier-discounts

deepgram-cli-with-28-api-commands-and-mcp-server

sdk-support-across-five-languages-with-feature-parity

automatic-language-detection-and-multilingual-transcription

conversational-turn-detection-and-interruption-handling

unified-voice-agent-orchestration-with-stт-llm-tts-integration

text-to-speech-synthesis-with-streaming-input

sentiment-analysis-on-transcribed-speech

topic-detection-and-content-categorization

automatic-summarization-of-audio-conversations

keyterm-prompting-for-domain-specific-accuracy

Related Artifactssharing capabilities

ElevenLabs

OpenAI: GPT Audio

Limitless

ElevenLabs API

MiniMax

Rev AI

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Deepgram API

Are you the builder of Deepgram API?

Get the weekly brief

Data Sources

Deepgram API

Capabilities18 decomposed

streaming-speech-to-text-transcription-with-real-time-processing

batch-audio-transcription-with-speaker-diarization

custom-model-training-for-proprietary-speech-patterns

smart-formatting-for-readable-transcripts

multi-language-support-within-single-conversation-stream

concurrent-connection-management-with-tiered-rate-limits

freemium-tier-with-200-dollar-credit-and-no-expiration

pay-as-you-go-pricing-with-growth-tier-discounts

deepgram-cli-with-28-api-commands-and-mcp-server

sdk-support-across-five-languages-with-feature-parity

automatic-language-detection-and-multilingual-transcription

conversational-turn-detection-and-interruption-handling

unified-voice-agent-orchestration-with-stт-llm-tts-integration

text-to-speech-synthesis-with-streaming-input

sentiment-analysis-on-transcribed-speech

topic-detection-and-content-categorization

automatic-summarization-of-audio-conversations

keyterm-prompting-for-domain-specific-accuracy

Related Artifactssharing capabilities

ElevenLabs

OpenAI: GPT Audio

Limitless

ElevenLabs API

MiniMax

Rev AI

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to Deepgram API

Are you the builder of Deepgram API?

Get the weekly brief

Data Sources