BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for ASR (BigSSL)
Product* ⭐ 08/2022: [MuLan: A Joint Embedding of Music Audio and Natural Language (MuLan)](https://arxiv.org/abs/2208.12415)
Capabilities5 decomposed
large-scale semi-supervised asr pre-training with unlabeled audio
Medium confidencePre-trains Conformer models (up to 8 billion parameters) on approximately 1 million hours of unlabeled audio using self-supervised learning objectives to learn generalizable speech representations. The approach combines SSL pre-training with subsequent self-training (pseudo-labeling) and fine-tuning stages, enabling downstream ASR tasks to achieve state-of-the-art performance with dramatically reduced labeled data requirements (demonstrated at 3% of typical supervised training data).
Combines three-stage pipeline (SSL pre-training → self-training → fine-tuning) on 8B-parameter Conformer models trained on 1M hours of unlabeled audio, achieving state-of-the-art ASR with only 3% of typical labeled training data; specific SSL objective and self-training methodology not disclosed but represents frontier-scale semi-supervised approach for speech
Achieves better ASR performance than supervised-only baselines while requiring 97% less labeled data, outperforming prior state-of-the-art when using full training sets; advantage over alternatives depends on access to massive unlabeled audio corpora and computational resources
cross-domain speech representation transfer learning
Medium confidenceLearns generalizable speech representations during pre-training that transfer effectively across diverse downstream tasks spanning multiple speech domains, dataset sizes (multiple orders of magnitude variation), and non-ASR applications. The pre-trained representations enable fine-tuning on downstream tasks with minimal labeled data, demonstrating broad generalization across wide range of speech characteristics and task types.
Pre-trained representations generalize across 'wide range of speech domains' and 'multiple orders of magnitudes of dataset sizes' without documented domain-specific tuning; specific domains and generalization boundaries not disclosed, but represents claim of broad cross-domain transferability rare in speech models
Generalizes across more diverse speech domains and dataset sizes than task-specific supervised models, but specific comparative benchmarks and failure modes unknown from abstract
self-training with pseudo-labeling for unlabeled audio
Medium confidenceApplies pseudo-labeling to unlabeled audio using the pre-trained model to generate synthetic transcriptions, then uses these pseudo-labeled examples as additional training signal during fine-tuning. This self-training stage bridges the gap between pre-training and task-specific fine-tuning, leveraging the model's own predictions on unlabeled data to improve downstream performance without requiring human annotation.
Integrates pseudo-labeling as middle stage between SSL pre-training and supervised fine-tuning in three-stage pipeline; specific pseudo-label generation and filtering mechanisms not disclosed, but represents systematic approach to leveraging unlabeled data in semi-supervised ASR
More systematic than ad-hoc pseudo-labeling by grounding in pre-trained representations; effectiveness vs alternatives depends on undisclosed pseudo-label quality control mechanisms
state-of-the-art asr performance benchmarking on public datasets
Medium confidenceAchieves state-of-the-art results on unspecified public ASR benchmarks, demonstrating that the semi-supervised approach outperforms prior best-known results. The paper reports SoTA performance both when using only 3% of typical labeled training data (34k hours on tested task) and when using full training sets, indicating the approach improves over prior work across different data regimes.
Demonstrates SoTA on public benchmarks using semi-supervised approach with 8B-parameter Conformer; specific benchmarks and performance metrics not disclosed, limiting ability to assess magnitude of improvement
Outperforms prior state-of-the-art on unspecified benchmarks; comparative advantage unclear without benchmark and baseline details
data-efficient asr with 97% labeled data reduction
Medium confidenceAchieves state-of-the-art ASR performance using only 3% of the labeled training data required by supervised baselines (demonstrated on 34k-hour task), representing a 97% reduction in annotation requirements. This data efficiency is achieved through the combination of SSL pre-training on 1M hours of unlabeled audio and self-training, enabling organizations to build high-quality ASR systems with minimal human annotation.
Achieves 97% reduction in labeled data requirements (3% of supervised baseline) through combination of 1M-hour SSL pre-training and self-training; specific baseline and task characteristics not disclosed, but represents significant claimed efficiency improvement
Requires substantially less labeled data than supervised-only ASR baselines; advantage magnitude depends on unlabeled data availability and computational resources for pre-training
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for ASR (BigSSL), ranked by overlap. Discovered automatically through the match graph.
wav2vec2-base-960h
automatic-speech-recognition model by undefined. 11,95,671 downloads.
wav2vec2-large-xlsr-53-russian
automatic-speech-recognition model by undefined. 50,44,932 downloads.
w2v-bert-2.0
feature-extraction model by undefined. 32,25,462 downloads.
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language... (SpeechT5)
* ⭐ 06/2022: [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing (WavLM)](https://ieeexplore.ieee.org/abstract/document/9814838)
wav2vec2-large-xlsr-korean
automatic-speech-recognition model by undefined. 12,62,349 downloads.
wav2vec2-large-xlsr-53-chinese-zh-cn
automatic-speech-recognition model by undefined. 19,93,708 downloads.
Best For
- ✓ML research teams with access to large unlabeled audio datasets (1M+ hours)
- ✓Organizations building multilingual or multi-domain ASR systems seeking to minimize labeled data dependency
- ✓Teams with sufficient computational infrastructure to train 8B-parameter models
- ✓Teams building multi-domain speech systems (e.g., customer service, medical transcription, broadcast media)
- ✓Researchers evaluating transfer learning effectiveness across heterogeneous speech tasks
- ✓Organizations with limited labeled data in specific domains but access to general pre-trained models
- ✓Teams with abundant unlabeled audio in target domain but limited labeled data
- ✓Organizations seeking to improve ASR on specific domains (medical, legal, customer service) with minimal annotation
Known Limitations
- ⚠Requires approximately 1 million hours of unlabeled audio for effective pre-training; effectiveness with smaller datasets unknown
- ⚠Computational cost and training time for 8B-parameter Conformer models not specified; likely requires weeks of GPU/TPU compute
- ⚠No documented failure modes, domain shift robustness limits, or performance degradation patterns
- ⚠Inference memory requirements for 8B-parameter models substantial (estimated 16-32GB VRAM minimum); streaming inference capability unknown
- ⚠Specific self-supervised learning objective used in pre-training not disclosed in abstract; reproducibility requires full paper
- ⚠Specific downstream tasks and domains evaluated not disclosed; generalization claims cannot be independently verified from abstract
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
* ⭐ 08/2022: [MuLan: A Joint Embedding of Music Audio and Natural Language (MuLan)](https://arxiv.org/abs/2208.12415)
Categories
Alternatives to BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for ASR (BigSSL)
Are you the builder of BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for ASR (BigSSL)?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →