AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)

Product

* ⭐ 09/2022: [AudioGen: Textually Guided Audio Generation (AudioGen)](https://arxiv.org/abs/2209.15352)

/ 100

8 capabilities

Capabilities8 decomposed

hybrid-tokenization audio encoding with dual-stream representation

Medium confidence

Converts raw audio waveforms into discrete token sequences using a hybrid scheme combining masked language model activations (for long-term coherence and semantic structure) with neural audio codec codes (for acoustic fidelity). This dual-stream tokenization enables the language model to capture both structural continuity and high-quality synthesis, avoiding the quality degradation that occurs when using either codec tokens or LM tokens alone. The tokenization process discretizes continuous audio representations into a vocabulary suitable for autoregressive language modeling.

Solves for

I need to represent audio as discrete tokens so I can apply language modeling techniques to audio generationI want to preserve both long-term audio structure and acoustic quality in a single token representationI need to convert raw waveforms into a format that a transformer-based language model can process autoregressively

Best for

researchers building audio generation systems using language modeling approaches

teams developing audio continuation and in-painting applications

audio ML engineers exploring discrete representation learning for synthesis

Requires

Pre-trained masked language model on audio (architecture and weights not specified)

Neural audio codec (specific codec architecture not disclosed)

Raw audio waveforms at unspecified sample rate

Limitations

Tokenization scheme specifics (vocabulary size, token rate, quantization levels) not disclosed in paper

Trade-off between codec fidelity and LM structure quality not quantified with metrics

Unclear how tokenization generalizes to audio domains beyond speech and piano music

What makes it unique

Uses a hybrid dual-stream tokenization combining masked LM activations with neural codec codes, rather than relying on a single tokenization source. This architectural choice explicitly addresses the trade-off between structural coherence (from LM tokens) and acoustic quality (from codec tokens) that single-stream approaches face.

vs alternatives

Outperforms single-codec tokenization approaches (like Jukebox's VQ-VAE) by preserving long-term semantic structure through LM tokens, while maintaining acoustic quality through codec tokens—a design choice not present in prior audio generation systems.

autoregressive audio continuation generation from prompt conditioning

Medium confidence

Generates coherent audio continuations by treating audio generation as a language modeling task: tokenizes a short audio prompt using the hybrid scheme, then autoregressively samples tokens from a transformer-based language model conditioned on the prompt tokens, finally decoding the generated token sequence back to raw waveform. The model learns to predict statistically plausible next tokens given preceding context, enabling it to extend audio with natural prosody, speaker consistency, and structural coherence without requiring transcripts or symbolic representations.

Solves for

I want to extend a short audio clip with a natural-sounding continuation that matches the speaker/styleI need to generate audio completions that maintain long-term coherence and semantic plausibilityI want to generate audio without providing text transcripts or musical notation—just raw audio prompts

Best for

audio production workflows requiring speech/music continuation and in-painting

speech synthesis applications needing speaker-consistent generation

music composition tools for piano or other instruments trained on raw audio

Requires

Pre-trained language model on discrete audio tokens (model size, architecture unknown)

Tokenizer and decoder (hybrid scheme as described in tokenization capability)

Short audio prompt (format, sample rate, minimum/maximum length unknown)

Limitations

Requires audio prompt input; cannot generate audio from scratch or from text descriptions

Prompt length requirements not specified; unclear if model supports variable-length conditioning

Generation length not specified; maximum continuation duration unknown

What makes it unique

Applies language modeling directly to raw audio tokens rather than requiring intermediate representations (text, phonemes, MIDI, or symbolic notation). The model learns audio structure end-to-end from raw waveforms, enabling it to capture prosodic and acoustic patterns that symbolic approaches miss.

vs alternatives

Generates more natural prosody and speaker consistency than text-to-speech baselines because it conditions directly on audio rather than text, and maintains longer-term coherence than codec-only models because it uses LM tokens that capture semantic structure.

speaker-identity preservation across unseen speaker continuations

Medium confidence

Maintains consistent speaker identity when generating audio continuations for speakers not seen during training, achieved through the language model's learned ability to capture speaker-specific acoustic patterns in the token sequence. The hybrid tokenization preserves speaker characteristics in both the masked LM tokens (which encode prosodic and structural patterns) and codec tokens (which encode acoustic timbre), allowing the model to implicitly learn speaker embeddings without explicit speaker conditioning or speaker ID inputs.

Solves for

I need to generate speech continuations that sound like the original speaker, even if that speaker wasn't in the training dataI want to extend audio clips while maintaining consistent voice identity and acoustic characteristicsI need speaker-consistent generation without requiring explicit speaker embeddings or speaker ID metadata

Best for

speech synthesis and voice cloning applications

audio editing and in-painting workflows requiring speaker consistency

personalized audio generation systems

Requires

Pre-trained language model on diverse speaker audio

Short audio prompt from target speaker (length unknown)

Limitations

Mechanism for speaker identity preservation not detailed; unclear if it's implicit or explicit

Generalization to highly diverse speaker populations not evaluated

Performance on speakers with strong accents, speech disorders, or non-standard prosody unknown

What makes it unique

Achieves speaker identity preservation implicitly through the language model's learned token distributions, without requiring explicit speaker embeddings, speaker ID conditioning, or speaker-specific fine-tuning. The hybrid tokenization naturally encodes speaker characteristics in both semantic (LM) and acoustic (codec) token streams.

vs alternatives

Outperforms speaker-agnostic baselines and matches or exceeds speaker-conditional models while requiring no explicit speaker metadata or conditioning mechanisms, making it more practical for zero-shot speaker adaptation scenarios.

prosody-aware speech generation with intonation and rhythm preservation

Medium confidence

Generates speech continuations that preserve and extend the prosodic characteristics (intonation patterns, rhythm, stress, and timing) of the input prompt by learning prosodic patterns implicitly through the language model's token predictions. The masked LM tokens capture long-term prosodic structure (sentence-level intonation contours, stress patterns), while codec tokens preserve fine-grained acoustic prosody (pitch trajectories, duration variations). The autoregressive generation process naturally extends these prosodic patterns into the continuation.

Solves for

I need to generate speech that maintains the emotional tone and intonation of the original speakerI want to extend audio with natural rhythm and stress patterns that match the inputI need to generate speech continuations that sound fluent and natural, not robotic or monotone

Best for

high-quality speech synthesis and voice conversion applications

audiobook and podcast production requiring consistent narration style

dialogue generation systems needing natural prosody

Requires

Pre-trained language model on diverse prosodic speech

Audio prompt with clear prosodic characteristics

Limitations

Prosody preservation mechanism not explicitly detailed in paper

No evaluation metrics provided (e.g., prosody similarity scores, MOS ratings)

Unclear how well prosody generalizes across different speaking contexts or emotional states

What makes it unique

Preserves prosody implicitly through dual-stream tokenization rather than using explicit prosody features or separate prosody models. The language model learns to predict prosodic continuations as part of the token sequence, enabling natural prosody extension without separate prosody conditioning.

vs alternatives

Generates more natural prosody than text-to-speech systems because it learns from raw audio patterns rather than text, and avoids the prosody artifacts common in concatenative or unit-selection synthesis approaches.

piano music generation from raw audio without symbolic representation

Medium confidence

Generates coherent piano music continuations from short audio prompts by applying the same language modeling approach used for speech, but trained on piano music audio without requiring MIDI, sheet music, or symbolic notation. The model learns musical structure (harmony, melody, rhythm, phrasing) directly from raw waveforms, discovering patterns in the acoustic signal that correspond to musical concepts. Generation proceeds autoregressively by predicting next tokens conditioned on the prompt, producing audio that maintains harmonic consistency and musical coherence.

Solves for

I want to generate piano music continuations from audio prompts without needing MIDI or sheet musicI need to extend piano compositions while maintaining harmonic and melodic coherenceI want to explore music generation using raw audio rather than symbolic representations

Best for

music composition and arrangement tools

audio-based music generation research

musicians and composers exploring AI-assisted composition

Requires

Pre-trained language model on piano music audio

Short piano audio prompt

Limitations

Limited to piano music; generalization to other instruments or ensembles unknown

No control over musical parameters (tempo, key, style, dynamics) mentioned

Harmonic and melodic coherence not quantified with music-specific metrics

What makes it unique

Generates music directly from raw audio without symbolic representation (MIDI, sheet music), learning musical structure end-to-end from acoustic patterns. This approach captures acoustic properties (timbre, dynamics, articulation) that symbolic approaches lose, but sacrifices explicit control over musical parameters.

vs alternatives

Captures acoustic nuances and performance characteristics that symbolic music generation systems miss, but lacks the fine-grained control and interpretability of MIDI-based approaches like MuseNet or Jukebox.

long-context audio coherence through masked language model pre-training

Medium confidence

Maintains long-term coherence and semantic plausibility in generated audio by leveraging a masked language model pre-trained on audio, which learns to predict missing tokens in the middle of sequences. This pre-training objective forces the model to understand long-range dependencies and global structure in audio, enabling it to generate continuations that are not just locally plausible but globally coherent. The masked LM tokens in the hybrid representation explicitly encode this long-range structure, which the autoregressive generation process extends naturally.

Solves for

I need to generate audio that maintains coherence over long sequences, not just local plausibilityI want to avoid audio that sounds natural locally but incoherent globallyI need to generate audio with semantic structure that matches the input prompt

Best for

long-form audio generation (speeches, podcasts, music compositions)

applications requiring semantic coherence beyond local acoustic plausibility

research into long-range dependencies in audio

Requires

Pre-trained masked language model on large audio corpus

Language model trained on hybrid tokens from tokenization capability

Limitations

Maximum coherence window not specified; unclear how long-range dependencies are maintained

Pre-training objective and corpus not detailed; generalization unknown

No quantitative evaluation of coherence (e.g., semantic consistency metrics)

What makes it unique

Uses masked language model pre-training on audio to explicitly learn long-range dependencies, rather than relying solely on autoregressive training which can suffer from exposure bias and local coherence bias. The hybrid tokenization preserves these learned long-range patterns through dedicated LM tokens.

vs alternatives

Maintains longer-range coherence than pure codec-based or autoregressive-only approaches because the masked LM pre-training objective explicitly optimizes for understanding global structure, not just local acoustic plausibility.

transcript-free audio generation without annotation requirements

Medium confidence

Generates audio without requiring transcripts, phonetic annotations, or any text-based metadata, operating entirely on raw waveforms. This eliminates the annotation bottleneck present in text-to-speech and phoneme-based systems, allowing the model to learn directly from unlabeled audio corpora. The language model operates on discrete audio tokens that implicitly encode linguistic and acoustic information, discovering phonetic and linguistic structure without explicit supervision.

Solves for

I want to train audio generation models on large unlabeled audio corpora without manual transcriptionI need to generate audio for languages or domains where transcripts are unavailable or expensiveI want to avoid the annotation overhead of text-to-speech systems

Best for

low-resource languages or domains lacking transcription data

large-scale audio generation training on unlabeled corpora

applications where transcript quality is uncertain or variable

Requires

Raw audio waveforms (no transcripts required)

Pre-trained language model on unlabeled audio

Limitations

Without transcripts, no explicit control over generated content (cannot specify what to say)

Unclear how linguistic structure is discovered without explicit phonetic supervision

Generalization to languages with different phonetic inventories unknown

What makes it unique

Eliminates transcript and annotation requirements by learning directly from raw audio, using self-supervised pre-training (masked language modeling) to discover linguistic and acoustic structure without explicit supervision. This is a fundamental architectural choice that differs from text-to-speech and phoneme-based approaches.

vs alternatives

Scales to unlabeled audio corpora that would be prohibitively expensive to transcribe, and avoids transcription errors that degrade text-to-speech quality, but sacrifices explicit content control that text-based systems provide.

end-to-end raw waveform processing without intermediate representations

Medium confidence

Processes audio entirely in raw waveform form without converting to spectrograms, mel-frequency cepstral coefficients (MFCCs), or other intermediate acoustic features. The tokenization step converts raw waveforms directly to discrete tokens, and generation produces raw waveforms directly from tokens, avoiding information loss and artifacts introduced by intermediate representations. This end-to-end approach preserves fine-grained acoustic details and enables the model to learn directly from the raw signal.

Solves for

I need to preserve fine-grained acoustic details that intermediate representations loseI want to avoid spectrogram artifacts and reconstruction errors from inverse transformsI need to process audio at the signal level without feature engineering

Best for

high-fidelity audio generation requiring acoustic detail preservation

research into end-to-end audio processing

applications where intermediate representation artifacts are problematic

Requires

Neural audio codec capable of discretizing raw waveforms

Tokenizer operating on raw waveforms

Raw audio waveforms at consistent sample rate

Limitations

Raw waveform processing is computationally expensive; no latency information provided

Tokenization of raw waveforms requires neural codec; codec architecture not disclosed

Unclear how sample rate affects tokenization and generation quality

What makes it unique

Operates entirely on raw waveforms without intermediate acoustic feature extraction, using neural codecs to discretize the signal directly. This architectural choice differs from spectrogram-based approaches and preserves acoustic details that feature-based methods lose.

vs alternatives

Preserves fine-grained acoustic details and avoids spectrogram reconstruction artifacts, but requires more computational resources and careful codec design compared to spectrogram-based approaches like WaveGlow or Glow-TTS.

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with AudioLM: a Language Modeling Approach to Audio Generation (AudioLM), ranked by overlap. Discovered automatically through the match graph.

Model45

F5-TTS

text-to-speech model by undefined. 6,61,227 downloads.

real-time voice conversion and style morphing between speakerscontrollable prosody and style transfer from reference audio

2 shared capabilities

Product21

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

* ⭐ 01/2023: [MusicLM: Generating Music From Text (MusicLM)](https://arxiv.org/abs/2301.11325)

speaker-conditioned autoregressive speech generation

1 shared capability

Model23

AudioCraft

A single-stop code base for generative audio needs, by Meta. Includes MusicGen for music and AudioGen for sounds. #opensource

autoregressive audio token generation with long-term dependency modeling

1 shared capability

Agent51

ChatTTS

A generative speech model for daily dialogue.

discrete audio token generation with speaker embedding control

1 shared capability

Model42

speecht5_tts

text-to-speech model by undefined. 2,22,752 downloads.

speaker embedding extraction and speaker-conditional audio generation

1 shared capability

Model52

XTTS-v2

text-to-speech model by undefined. 69,91,040 downloads.

reference-audio-conditioned voice adaptation

1 shared capability

Best For

✓researchers building audio generation systems using language modeling approaches
✓teams developing audio continuation and in-painting applications
✓audio ML engineers exploring discrete representation learning for synthesis
✓audio production workflows requiring speech/music continuation and in-painting
✓speech synthesis applications needing speaker-consistent generation
✓music composition tools for piano or other instruments trained on raw audio
✓research into long-term coherence in neural audio generation
✓speech synthesis and voice cloning applications

Known Limitations

⚠Tokenization scheme specifics (vocabulary size, token rate, quantization levels) not disclosed in paper
⚠Trade-off between codec fidelity and LM structure quality not quantified with metrics
⚠Unclear how tokenization generalizes to audio domains beyond speech and piano music
⚠No information on computational cost of dual-stream encoding relative to single-stream alternatives
⚠Requires audio prompt input; cannot generate audio from scratch or from text descriptions
⚠Prompt length requirements not specified; unclear if model supports variable-length conditioning

Requirements

Pre-trained masked language model on audio (architecture and weights not specified)Neural audio codec (specific codec architecture not disclosed)Raw audio waveforms at unspecified sample ratePre-trained language model on discrete audio tokens (model size, architecture unknown)Tokenizer and decoder (hybrid scheme as described in tokenization capability)Short audio prompt (format, sample rate, minimum/maximum length unknown)Pre-trained language model on diverse speaker audioShort audio prompt from target speaker (length unknown)

Input / Output

Accepts: raw audio waveforms, raw audio waveform (prompt), raw audio waveform from target speaker, raw audio waveform with prosodic patterns, raw audio waveform (piano music), raw audio waveform

Produces: discrete token sequences, integer token IDs, raw audio waveform (continuation), raw audio waveform with preserved speaker identity, raw audio waveform with extended prosodic patterns, raw audio waveform (piano music continuation), raw audio waveform with long-range coherence, raw audio waveform

UnfragileRank

Adoption15%(25% weight)

Quality25%(25% weight)

Ecosystem15%(10% weight)

Match Graph25%(35% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Product

8 capabilities

Visit AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)→

About

* ⭐ 09/2022: [AudioGen: Textually Guided Audio Generation (AudioGen)](https://arxiv.org/abs/2209.15352)

Alternatives to AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)

IntelliCode46Extension

AI-assisted development

Compare →

GitHub Copilot Chat49Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot48Extension

Your AI pair programmer

Compare →

Claude Code for VS Code48Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Are you the builder of AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github awesome

Looking for something else?

Search →

Capabilities8 decomposed

hybrid-tokenization audio encoding with dual-stream representation

Medium confidence

Solves for

Best for

researchers building audio generation systems using language modeling approaches

teams developing audio continuation and in-painting applications

audio ML engineers exploring discrete representation learning for synthesis

Requires

Pre-trained masked language model on audio (architecture and weights not specified)

Neural audio codec (specific codec architecture not disclosed)

Raw audio waveforms at unspecified sample rate

Limitations

Tokenization scheme specifics (vocabulary size, token rate, quantization levels) not disclosed in paper

Trade-off between codec fidelity and LM structure quality not quantified with metrics

Unclear how tokenization generalizes to audio domains beyond speech and piano music

What makes it unique

vs alternatives

autoregressive audio continuation generation from prompt conditioning

Medium confidence

Solves for

Best for

audio production workflows requiring speech/music continuation and in-painting

speech synthesis applications needing speaker-consistent generation

music composition tools for piano or other instruments trained on raw audio

Requires

Pre-trained language model on discrete audio tokens (model size, architecture unknown)

Tokenizer and decoder (hybrid scheme as described in tokenization capability)

Short audio prompt (format, sample rate, minimum/maximum length unknown)

Limitations

Requires audio prompt input; cannot generate audio from scratch or from text descriptions

Prompt length requirements not specified; unclear if model supports variable-length conditioning

Generation length not specified; maximum continuation duration unknown

What makes it unique

vs alternatives

speaker-identity preservation across unseen speaker continuations

Medium confidence

Solves for

Best for

speech synthesis and voice cloning applications

audio editing and in-painting workflows requiring speaker consistency

personalized audio generation systems

Requires

Pre-trained language model on diverse speaker audio

Short audio prompt from target speaker (length unknown)

Limitations

Mechanism for speaker identity preservation not detailed; unclear if it's implicit or explicit

Generalization to highly diverse speaker populations not evaluated

Performance on speakers with strong accents, speech disorders, or non-standard prosody unknown

What makes it unique

vs alternatives

prosody-aware speech generation with intonation and rhythm preservation

Medium confidence

Solves for

Best for

high-quality speech synthesis and voice conversion applications

audiobook and podcast production requiring consistent narration style

dialogue generation systems needing natural prosody

Requires

Pre-trained language model on diverse prosodic speech

Audio prompt with clear prosodic characteristics

Limitations

Prosody preservation mechanism not explicitly detailed in paper

No evaluation metrics provided (e.g., prosody similarity scores, MOS ratings)

Unclear how well prosody generalizes across different speaking contexts or emotional states

What makes it unique

vs alternatives

piano music generation from raw audio without symbolic representation

Medium confidence

Solves for

Best for

music composition and arrangement tools

audio-based music generation research

musicians and composers exploring AI-assisted composition

Requires

Pre-trained language model on piano music audio

Short piano audio prompt

Limitations

Limited to piano music; generalization to other instruments or ensembles unknown

No control over musical parameters (tempo, key, style, dynamics) mentioned

Harmonic and melodic coherence not quantified with music-specific metrics

What makes it unique

vs alternatives

long-context audio coherence through masked language model pre-training

Medium confidence

Solves for

Best for

long-form audio generation (speeches, podcasts, music compositions)

applications requiring semantic coherence beyond local acoustic plausibility

research into long-range dependencies in audio

Requires

Pre-trained masked language model on large audio corpus

Language model trained on hybrid tokens from tokenization capability

Limitations

Maximum coherence window not specified; unclear how long-range dependencies are maintained

Pre-training objective and corpus not detailed; generalization unknown

No quantitative evaluation of coherence (e.g., semantic consistency metrics)

What makes it unique

vs alternatives

transcript-free audio generation without annotation requirements

Medium confidence

Solves for

Best for

low-resource languages or domains lacking transcription data

large-scale audio generation training on unlabeled corpora

applications where transcript quality is uncertain or variable

Requires

Raw audio waveforms (no transcripts required)

Pre-trained language model on unlabeled audio

Limitations

Without transcripts, no explicit control over generated content (cannot specify what to say)

Unclear how linguistic structure is discovered without explicit phonetic supervision

Generalization to languages with different phonetic inventories unknown

What makes it unique

vs alternatives

end-to-end raw waveform processing without intermediate representations

Medium confidence

Solves for

Best for

high-fidelity audio generation requiring acoustic detail preservation

research into end-to-end audio processing

applications where intermediate representation artifacts are problematic

Requires

Neural audio codec capable of discretizing raw waveforms

Tokenizer operating on raw waveforms

Raw audio waveforms at consistent sample rate

Limitations

Raw waveform processing is computationally expensive; no latency information provided

Tokenization of raw waveforms requires neural codec; codec architecture not disclosed

Unclear how sample rate affects tokenization and generation quality

What makes it unique

vs alternatives

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Alternatives to AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)

IntelliCode46Extension

AI-assisted development

Compare →

GitHub Copilot Chat49Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot48Extension

Your AI pair programmer

Compare →

Claude Code for VS Code48Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)

Capabilities8 decomposed

hybrid-tokenization audio encoding with dual-stream representation

autoregressive audio continuation generation from prompt conditioning

speaker-identity preservation across unseen speaker continuations

prosody-aware speech generation with intonation and rhythm preservation

piano music generation from raw audio without symbolic representation

long-context audio coherence through masked language model pre-training

transcript-free audio generation without annotation requirements

end-to-end raw waveform processing without intermediate representations

Related Artifactssharing capabilities

F5-TTS

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

AudioCraft

ChatTTS

speecht5_tts

XTTS-v2

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)

Are you the builder of AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)?

Get the weekly brief

Data Sources

AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)

Capabilities8 decomposed

hybrid-tokenization audio encoding with dual-stream representation

autoregressive audio continuation generation from prompt conditioning

speaker-identity preservation across unseen speaker continuations

prosody-aware speech generation with intonation and rhythm preservation

piano music generation from raw audio without symbolic representation

long-context audio coherence through masked language model pre-training

transcript-free audio generation without annotation requirements

end-to-end raw waveform processing without intermediate representations

Related Artifactssharing capabilities

F5-TTS

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)

AudioCraft

ChatTTS

speecht5_tts

XTTS-v2

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

About

Categories

Alternatives to AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)

Are you the builder of AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)?

Get the weekly brief

Data Sources