Phoneme Level Duration And Pitch Prediction With Linguistic Features

1

higgs-audio-v2-generation-3B-baseModel48/100

via “mel-spectrogram generation with duration and pitch prediction”

text-to-speech model by undefined. 2,95,715 downloads.

Unique: Uses auxiliary prediction heads for duration and pitch jointly trained with the main decoder, enabling implicit prosody learning without explicit phoneme-frame alignment annotations, and allows inference-time prosody scaling by modulating predicted values

vs others: More flexible than fixed-duration TTS (e.g., Glow-TTS) and avoids the alignment brittleness of older Tacotron models by learning duration distributions end-to-end; more controllable than end-to-end models (Glow-TTS, FastSpeech) that don't expose pitch/duration predictions

2

Qwen3-TTS-12Hz-1.7B-VoiceDesignModel45/100

via “efficient transformer-based acoustic feature prediction”

text-to-speech model by undefined. 5,14,586 downloads.

Unique: Achieves multilingual acoustic prediction in a single 1.7B model rather than language-specific variants, suggesting shared linguistic-acoustic representations learned across languages. The architecture likely uses cross-lingual attention or shared embeddings to generalize prosodic patterns across typologically different languages.

vs others: More parameter-efficient than separate language-specific TTS models (e.g., separate models for English, Mandarin, Spanish) while maintaining competitive quality, reducing deployment complexity and memory footprint compared to alternatives like Tacotron2 or Transformer-TTS which require language-specific training.

3

MeloTTS-JapaneseModel41/100

via “phoneme-level duration and pitch prediction with linguistic features”

text-to-speech model by undefined. 2,10,673 downloads.

Unique: Implements duration and pitch prediction as separate feed-forward networks operating on linguistic embeddings from the text encoder, enabling joint optimization with the mel-spectrogram decoder via multi-task learning. The pitch predictor generates frame-level F0 values that are directly supervised during training, allowing the model to learn Japanese pitch accent patterns from data rather than relying on rule-based accent assignment.

vs others: More flexible than rule-based prosody systems (e.g., Festival, MARY TTS) by learning prosody patterns from data; faster than sequence-to-sequence pitch prediction models (feed-forward vs. RNN/Transformer) while maintaining comparable accuracy; enables fine-grained prosody control that commercial APIs typically don't expose.

Top Matches

Also Known As

Company