AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)
Product* ⭐ 09/2022: [AudioGen: Textually Guided Audio Generation (AudioGen)](https://arxiv.org/abs/2209.15352)
Capabilities8 decomposed
hybrid-tokenization audio encoding with dual-stream representation
Medium confidenceConverts raw audio waveforms into discrete token sequences using a hybrid scheme combining masked language model activations (for long-term coherence and semantic structure) with neural audio codec codes (for acoustic fidelity). This dual-stream tokenization enables the language model to capture both structural continuity and high-quality synthesis, avoiding the quality degradation that occurs when using either codec tokens or LM tokens alone. The tokenization process discretizes continuous audio representations into a vocabulary suitable for autoregressive language modeling.
Uses a hybrid dual-stream tokenization combining masked LM activations with neural codec codes, rather than relying on a single tokenization source. This architectural choice explicitly addresses the trade-off between structural coherence (from LM tokens) and acoustic quality (from codec tokens) that single-stream approaches face.
Outperforms single-codec tokenization approaches (like Jukebox's VQ-VAE) by preserving long-term semantic structure through LM tokens, while maintaining acoustic quality through codec tokens—a design choice not present in prior audio generation systems.
autoregressive audio continuation generation from prompt conditioning
Medium confidenceGenerates coherent audio continuations by treating audio generation as a language modeling task: tokenizes a short audio prompt using the hybrid scheme, then autoregressively samples tokens from a transformer-based language model conditioned on the prompt tokens, finally decoding the generated token sequence back to raw waveform. The model learns to predict statistically plausible next tokens given preceding context, enabling it to extend audio with natural prosody, speaker consistency, and structural coherence without requiring transcripts or symbolic representations.
Applies language modeling directly to raw audio tokens rather than requiring intermediate representations (text, phonemes, MIDI, or symbolic notation). The model learns audio structure end-to-end from raw waveforms, enabling it to capture prosodic and acoustic patterns that symbolic approaches miss.
Generates more natural prosody and speaker consistency than text-to-speech baselines because it conditions directly on audio rather than text, and maintains longer-term coherence than codec-only models because it uses LM tokens that capture semantic structure.
speaker-identity preservation across unseen speaker continuations
Medium confidenceMaintains consistent speaker identity when generating audio continuations for speakers not seen during training, achieved through the language model's learned ability to capture speaker-specific acoustic patterns in the token sequence. The hybrid tokenization preserves speaker characteristics in both the masked LM tokens (which encode prosodic and structural patterns) and codec tokens (which encode acoustic timbre), allowing the model to implicitly learn speaker embeddings without explicit speaker conditioning or speaker ID inputs.
Achieves speaker identity preservation implicitly through the language model's learned token distributions, without requiring explicit speaker embeddings, speaker ID conditioning, or speaker-specific fine-tuning. The hybrid tokenization naturally encodes speaker characteristics in both semantic (LM) and acoustic (codec) token streams.
Outperforms speaker-agnostic baselines and matches or exceeds speaker-conditional models while requiring no explicit speaker metadata or conditioning mechanisms, making it more practical for zero-shot speaker adaptation scenarios.
prosody-aware speech generation with intonation and rhythm preservation
Medium confidenceGenerates speech continuations that preserve and extend the prosodic characteristics (intonation patterns, rhythm, stress, and timing) of the input prompt by learning prosodic patterns implicitly through the language model's token predictions. The masked LM tokens capture long-term prosodic structure (sentence-level intonation contours, stress patterns), while codec tokens preserve fine-grained acoustic prosody (pitch trajectories, duration variations). The autoregressive generation process naturally extends these prosodic patterns into the continuation.
Preserves prosody implicitly through dual-stream tokenization rather than using explicit prosody features or separate prosody models. The language model learns to predict prosodic continuations as part of the token sequence, enabling natural prosody extension without separate prosody conditioning.
Generates more natural prosody than text-to-speech systems because it learns from raw audio patterns rather than text, and avoids the prosody artifacts common in concatenative or unit-selection synthesis approaches.
piano music generation from raw audio without symbolic representation
Medium confidenceGenerates coherent piano music continuations from short audio prompts by applying the same language modeling approach used for speech, but trained on piano music audio without requiring MIDI, sheet music, or symbolic notation. The model learns musical structure (harmony, melody, rhythm, phrasing) directly from raw waveforms, discovering patterns in the acoustic signal that correspond to musical concepts. Generation proceeds autoregressively by predicting next tokens conditioned on the prompt, producing audio that maintains harmonic consistency and musical coherence.
Generates music directly from raw audio without symbolic representation (MIDI, sheet music), learning musical structure end-to-end from acoustic patterns. This approach captures acoustic properties (timbre, dynamics, articulation) that symbolic approaches lose, but sacrifices explicit control over musical parameters.
Captures acoustic nuances and performance characteristics that symbolic music generation systems miss, but lacks the fine-grained control and interpretability of MIDI-based approaches like MuseNet or Jukebox.
long-context audio coherence through masked language model pre-training
Medium confidenceMaintains long-term coherence and semantic plausibility in generated audio by leveraging a masked language model pre-trained on audio, which learns to predict missing tokens in the middle of sequences. This pre-training objective forces the model to understand long-range dependencies and global structure in audio, enabling it to generate continuations that are not just locally plausible but globally coherent. The masked LM tokens in the hybrid representation explicitly encode this long-range structure, which the autoregressive generation process extends naturally.
Uses masked language model pre-training on audio to explicitly learn long-range dependencies, rather than relying solely on autoregressive training which can suffer from exposure bias and local coherence bias. The hybrid tokenization preserves these learned long-range patterns through dedicated LM tokens.
Maintains longer-range coherence than pure codec-based or autoregressive-only approaches because the masked LM pre-training objective explicitly optimizes for understanding global structure, not just local acoustic plausibility.
transcript-free audio generation without annotation requirements
Medium confidenceGenerates audio without requiring transcripts, phonetic annotations, or any text-based metadata, operating entirely on raw waveforms. This eliminates the annotation bottleneck present in text-to-speech and phoneme-based systems, allowing the model to learn directly from unlabeled audio corpora. The language model operates on discrete audio tokens that implicitly encode linguistic and acoustic information, discovering phonetic and linguistic structure without explicit supervision.
Eliminates transcript and annotation requirements by learning directly from raw audio, using self-supervised pre-training (masked language modeling) to discover linguistic and acoustic structure without explicit supervision. This is a fundamental architectural choice that differs from text-to-speech and phoneme-based approaches.
Scales to unlabeled audio corpora that would be prohibitively expensive to transcribe, and avoids transcription errors that degrade text-to-speech quality, but sacrifices explicit content control that text-based systems provide.
end-to-end raw waveform processing without intermediate representations
Medium confidenceProcesses audio entirely in raw waveform form without converting to spectrograms, mel-frequency cepstral coefficients (MFCCs), or other intermediate acoustic features. The tokenization step converts raw waveforms directly to discrete tokens, and generation produces raw waveforms directly from tokens, avoiding information loss and artifacts introduced by intermediate representations. This end-to-end approach preserves fine-grained acoustic details and enables the model to learn directly from the raw signal.
Operates entirely on raw waveforms without intermediate acoustic feature extraction, using neural codecs to discretize the signal directly. This architectural choice differs from spectrogram-based approaches and preserves acoustic details that feature-based methods lose.
Preserves fine-grained acoustic details and avoids spectrogram reconstruction artifacts, but requires more computational resources and careful codec design compared to spectrogram-based approaches like WaveGlow or Glow-TTS.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with AudioLM: a Language Modeling Approach to Audio Generation (AudioLM), ranked by overlap. Discovered automatically through the match graph.
F5-TTS
text-to-speech model by undefined. 6,61,227 downloads.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)
* ⭐ 01/2023: [MusicLM: Generating Music From Text (MusicLM)](https://arxiv.org/abs/2301.11325)
AudioCraft
A single-stop code base for generative audio needs, by Meta. Includes MusicGen for music and AudioGen for sounds. #opensource
ChatTTS
A generative speech model for daily dialogue.
speecht5_tts
text-to-speech model by undefined. 2,22,752 downloads.
XTTS-v2
text-to-speech model by undefined. 69,91,040 downloads.
Best For
- ✓researchers building audio generation systems using language modeling approaches
- ✓teams developing audio continuation and in-painting applications
- ✓audio ML engineers exploring discrete representation learning for synthesis
- ✓audio production workflows requiring speech/music continuation and in-painting
- ✓speech synthesis applications needing speaker-consistent generation
- ✓music composition tools for piano or other instruments trained on raw audio
- ✓research into long-term coherence in neural audio generation
- ✓speech synthesis and voice cloning applications
Known Limitations
- ⚠Tokenization scheme specifics (vocabulary size, token rate, quantization levels) not disclosed in paper
- ⚠Trade-off between codec fidelity and LM structure quality not quantified with metrics
- ⚠Unclear how tokenization generalizes to audio domains beyond speech and piano music
- ⚠No information on computational cost of dual-stream encoding relative to single-stream alternatives
- ⚠Requires audio prompt input; cannot generate audio from scratch or from text descriptions
- ⚠Prompt length requirements not specified; unclear if model supports variable-length conditioning
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
* ⭐ 09/2022: [AudioGen: Textually Guided Audio Generation (AudioGen)](https://arxiv.org/abs/2209.15352)
Categories
Alternatives to AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)
Are you the builder of AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →