Capability
General Purpose Reranking With Instruction Following Capability
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →Top Matches
via “general instruction following and human preference alignment”
Alibaba's 32B reasoning model with chain-of-thought.
Unique: Uses a two-stage RL training approach where the second stage applies a general reward model and rule-based verifiers to align with human preferences across diverse tasks, enabling reasoning models to maintain instruction-following capability beyond specialized domains
vs others: Balances strong reasoning capability with general instruction-following through preference-aligned training, enabling use cases that require both transparent reasoning and practical task execution without requiring separate specialized models