Encodec Based Neural Audio Waveform Reconstruction

1

BarkRepository55/100

via “encodec-based neural audio waveform reconstruction”

Open-source text-to-audio — speech, music, sound effects, 13+ languages, runs locally.

Unique: Leverages Facebook's EnCodec neural codec for efficient, high-quality waveform reconstruction from discrete tokens, enabling end-to-end generative audio without traditional vocoder artifacts

vs others: Neural codec approach produces fewer artifacts than traditional vocoders (WaveGlow, HiFi-GAN); learned compression maintains perceptual quality at lower bitrates than hand-crafted codecs

2

AudioCraftRepository55/100

via “neural audio compression with encodec”

Meta's library for music and audio generation.

Unique: Uses residual vector quantization across multiple codebooks (typically 4) to represent audio at different frequency bands and temporal resolutions, enabling variable bitrate compression while maintaining perceptual quality. Trained end-to-end with adversarial loss for realistic reconstruction.

vs others: Achieves better perceptual quality than traditional codecs (MP3, AAC) at equivalent bitrates and enables discrete token representation required for language model-based generation; more efficient than raw waveform processing.

3

chatterboxModel49/100

via “neural vocoding with waveform reconstruction”

text-to-speech model by undefined. 21,08,297 downloads.

Unique: Uses a pre-trained, frozen neural vocoder rather than training vocoding jointly with TTS, enabling modular architecture where vocoder can be swapped without retraining the TTS model. Vocoder is optimized for mel-spectrogram inversion specifically, not general audio generation.

vs others: Faster and higher quality than Griffin-Lim phase reconstruction (traditional signal processing approach) but slower and less controllable than end-to-end neural waveform models like WaveNet or Glow-TTS that generate waveforms directly from text.

4

OmniVoiceModel49/100

via “neural vocoder integration for waveform generation”

text-to-speech model by undefined. 20,90,369 downloads.

Unique: Integrates modular neural vocoder architecture (HiFi-GAN) with acoustic model, enabling vocoder swapping for quality/latency optimization without retraining acoustic components

vs others: Achieves audio quality comparable to end-to-end models (Glow-TTS + vocoder) while maintaining modularity for vocoder experimentation and optimization, vs. monolithic end-to-end architectures

5

VibeVoice-Realtime-0.5BModel48/100

via “mel-spectrogram to waveform vocoding with neural upsampling”

text-to-speech model by undefined. 11,52,993 downloads.

Unique: Uses learned neural vocoding instead of traditional signal processing (Griffin-Lim, WORLD) — enables end-to-end differentiable TTS pipeline and better generalization to diverse speaker characteristics. Optimized for 0.5B-scale inference with depthwise-separable convolutions and pruned residual blocks, achieving <100ms latency on mobile GPUs.

vs others: Faster and more natural-sounding than Griffin-Lim (traditional) while using 10x fewer parameters than HiFi-GAN or UnivNet, making it suitable for edge deployment where model size and latency are critical.

6

Kokoro-82M-bf16Model43/100

via “mel-spectrogram to waveform vocoding”

text-to-speech model by undefined. 4,69,583 downloads.

Unique: Uses a non-autoregressive vocoder (likely HiFi-GAN variant) that generates entire waveforms in a single forward pass, achieving 50-100x speedup compared to autoregressive alternatives like WaveNet. The vocoder is optimized for MLX inference, leveraging GPU acceleration to produce 22050 Hz audio at real-time or faster-than-real-time speeds.

vs others: Faster than WaveGlow or WaveNet vocoders while maintaining comparable audio quality; more efficient than traditional signal processing vocoders (WORLD, STRAIGHT) because neural vocoding requires no explicit pitch extraction or spectral envelope modeling.

7

Fun-CosyVoice3-0.5B-2512Model43/100

via “neural vocoder waveform synthesis”

text-to-speech model by undefined. 2,67,330 downloads.

Unique: Employs a lightweight flow-matching or diffusion-based vocoder architecture (vs. traditional GAN-based vocoders like HiFi-GAN) that achieves comparable quality at 0.5B parameters through iterative refinement rather than single-pass generation, enabling better convergence on edge devices with limited training data

vs others: More parameter-efficient than HiFi-GAN (10M parameters) while maintaining comparable audio quality; faster inference than autoregressive vocoders (WaveNet) due to parallel generation; more stable training than GAN-based approaches, reducing mode collapse artifacts

8

mms-tts-hatModel42/100

via “neural vocoder integration for waveform synthesis”

text-to-speech model by undefined. 4,36,984 downloads.

Unique: Integrates a multilingual neural vocoder trained on diverse language acoustic characteristics, enabling consistent waveform quality across 1100+ languages without language-specific vocoder variants — most TTS systems either use language-specific vocoders or apply generic vocoders that may not handle tonal languages or unusual phonetic features well

vs others: Produces higher-quality waveforms than traditional DSP-based vocoders (Griffin-Lim, WORLD) and maintains quality across diverse languages, though with higher computational cost than lightweight vocoders like WaveRNN

9

MeloTTS-JapaneseModel40/100

via “mel-spectrogram to waveform vocoding with neural upsampling”

text-to-speech model by undefined. 2,10,673 downloads.

Unique: Uses a pre-trained HiFi-GAN vocoder optimized for Japanese speech characteristics, with transposed convolution layers trained on Japanese phonetic distributions to minimize artifacts specific to Japanese phoneme transitions (e.g., geminate consonants, pitch accent patterns). The vocoder is fine-tuned on mel-spectrograms from the TTS encoder, ensuring tight integration and minimal spectral mismatch.

vs others: Faster than WaveNet or WaveGlow vocoders (100-200x speedup) while maintaining comparable audio quality; more efficient than Griffin-Lim phase reconstruction (eliminates iterative optimization); produces cleaner audio than simple linear interpolation by learning non-linear upsampling patterns from data.

10

AudioCraftRepository26/100

via “audio codec compression with discrete token representation”

A single-stop code base for generative audio needs, by Meta. Includes MusicGen for music and AudioGen for sounds. #opensource

Unique: Combines convolutional autoencoders with vector quantization to create a learned codec that produces discrete tokens suitable for language model training, rather than using traditional codecs (MP3, AAC) or continuous latent representations that don't integrate naturally with transformer architectures

vs others: More efficient than raw waveform generation because it reduces sequence length by 50-100x, and more flexible than traditional audio codecs because the discrete representation is learned end-to-end for the downstream task rather than optimized for human perception alone

11

High Fidelity Neural Audio Compression (EnCodec)Product22/100

via “real-time streaming audio encoding with quantized latent representation”

* ⭐ 12/2022: [Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)](https://arxiv.org/abs/2212.04356)

Unique: Uses a single multiscale spectrogram adversary instead of traditional multi-discriminator approaches, combined with a novel loss balancer mechanism that decouples loss weight from loss scale, enabling more stable training of the quantized latent space. Streaming architecture supports real-time encoding/decoding without buffering entire audio segments.

vs others: Outperforms baseline codecs across speech, noisy speech, and music domains according to MUSHRA subjective evaluation, while maintaining real-time performance on standard hardware — a capability gap for traditional neural codecs that typically require offline processing or significant computational overhead.

12

BarkRepository21/100

via “encodec-based audio tokenization and reconstruction”

A transformer-based text-to-audio model. #opensource

13

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)Model17/100

via “neural vocoder-based waveform reconstruction from discrete tokens”

* ⭐ 01/2023: [MusicLM: Generating Music From Text (MusicLM)](https://arxiv.org/abs/2301.11325)

Unique: Decouples vocoding from token prediction, allowing the vocoder to be trained independently on high-quality audio and enabling efficient parallel processing, unlike end-to-end models where waveform generation is tightly coupled to acoustic modeling

vs others: Faster and more stable than WaveNet-style autoregressive vocoders (parallel generation instead of sequential) and produces higher quality audio than simple upsampling or interpolation methods because it learns the complex mapping from discrete tokens to natural waveforms

Top Matches

Also Known As

Company