Capability

Multimodal Observation Tokenization With Flexible Sensor Composition

2 artifacts provide this capability.

Want a personalized recommendation?

Top Matches

Generalist robot policy model from Open X-Embodiment.

Unique: Implements a modular tokenizer architecture where image tokenizers (learned codebooks or pretrained vision models) and proprioception tokenizers (linear/MLP projections) are independently trained and composed, allowing flexible sensor configuration without retraining the transformer backbone. Supports variable numbers of cameras through dynamic token concatenation.

vs others: More flexible than end-to-end vision models that require fixed camera configurations, and more efficient than raw pixel processing by reducing observation dimensionality 100-1000x while preserving task-relevant information through learned tokenization.

Multimodal Observation Tokenization With Flexible Sensor Composition

Top Matches

Also Known As

Company