Capability
Multimodal Observation Tokenization With Flexible Sensor Composition
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
Generalist robot policy model from Open X-Embodiment.
Unique: Implements a modular tokenizer architecture where image tokenizers (learned codebooks or pretrained vision models) and proprioception tokenizers (linear/MLP projections) are independently trained and composed, allowing flexible sensor configuration without retraining the transformer backbone. Supports variable numbers of cameras through dynamic token concatenation.
vs others: More flexible than end-to-end vision models that require fixed camera configurations, and more efficient than raw pixel processing by reducing observation dimensionality 100-1000x while preserving task-relevant information through learned tokenization.