AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)Product24/100 via “autoregressive audio continuation generation from prompt conditioning”
* ⭐ 09/2022: [AudioGen: Textually Guided Audio Generation (AudioGen)](https://arxiv.org/abs/2209.15352)
Unique: Applies language modeling directly to raw audio tokens rather than requiring intermediate representations (text, phonemes, MIDI, or symbolic notation). The model learns audio structure end-to-end from raw waveforms, enabling it to capture prosodic and acoustic patterns that symbolic approaches miss.
vs others: Generates more natural prosody and speaker consistency than text-to-speech baselines because it conditions directly on audio rather than text, and maintains longer-term coherence than codec-only models because it uses LM tokens that capture semantic structure.