Capability
General Instruction Following And Human Preference Alignment
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
Alibaba's 32B reasoning model with chain-of-thought.
Unique: Uses a two-stage RL training approach where the second stage applies a general reward model and rule-based verifiers to align with human preferences across diverse tasks, enabling reasoning models to maintain instruction-following capability beyond specialized domains
vs others: Balances strong reasoning capability with general instruction-following through preference-aligned training, enabling use cases that require both transparent reasoning and practical task execution without requiring separate specialized models