Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment.

Abstract

We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style-Conditioned Implicit Q-Learning (SCIQL), which leverages offline goal-conditioned RL techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments demonstrate that SCIQL achieves superior performance on both objectives compared to prior offline methods.

Style Alignment

SCIQL displays significantly higher style alignment performance than the previous style conditioned offline reinforcement learning and imitation learning methods. Select bellow environments and a style labels to visualize the stylized rollouts performed at inference by our SCIQL($\lambda$) trained policies:

Style Conditioned Task Performance Optimization

Thanks to its Gated Advantage Weighted Regression mechanism, SCIQL is capable of optimizing task performance while preserving style alignment, even on high-dimensional tasks. For instance, for the head height labels in HumEnv-Simple, SORL ($\beta=0$) struggles to maintain standing positions using label information alone. The inclusion of reward signals, which correlate with the standing posture required for the sprint, helps SORL ($\beta > 0$) recover alignment on Label 1, but with limited task performance. In contrast, SCIQL demonstrates near-perfect alignment in all configurations. Furthermore, SCIQL surpasses SORL in task performance across variants while strictly preserving style alignment.

Label 0

Label 1

Baselines

SORL ($\beta=0$)

SORL ($\beta=3$)

Without Task Optimization

With Task Optimization
(run faster)

Ours

SCIQL ($\lambda$)

SCIQL ($\lambda > r$)

Without Task Optimization

With Task Optimization
(run faster)

Challenges

Style definition: Existing approaches trade off interpretability, labeling cost, alignment measurement, and credit assignment, making a general definition difficult.
Distribution shift: Style conditioning exacerbates offline RL distribution shift, creating mismatches between visited states and target styles that hinder robust alignment.
Task–style misalignment: Task performance and style alignment often conflict, and prior solutions sacrifice alignment when optimizing for task performance.

Contributions

General formulation: We cast stylized policy learning as a generalization of goal-conditioned RL, showing that style alignment corresponds to optimizing a style occupancy measure.
Data programming with labeling functions: We instantiate our definition using labeling functions on trajectory windows, which mitigates credit assignment challenges and enables fast, interpretable style annotations for both training and evaluation.
SCIQL algorithm: We introduce Style-Conditioned Implicit Q-Learning (SCIQL), which leverages advantage signals, style relabeling, and trajectory stitching to achieve robust style alignment.
GAWR method: We propose Gated Advantage Weighted Regression (GAWR), using advantage functions as gates to improve task performance while preserving style alignment.
Empirical validation: We provide diverse stylized RL tasks and show through extensive experiments that SCIQL outperforms prior work on both style alignment and style-conditioned task performance optimization, with clean JAX implementations and datasets.

Algorithm

Task Value Learning (IQL): We employ Implicit Q-Learning to learn the task utility. By using expectile regression, we estimate the value of the best actions within the dataset support without querying out-of-distribution samples, providing a stable signal for task performance optimization.

Task-value objective (IQL head) — Task values

Style Value Estimation: In parallel, we train a style-conditioned value function. This critic estimates the cumulative probability or "occupancy" of satisfying the specific behavioral constraints defined by our labeling functions, allowing the agent to quantify its alignment with the desired style.

Gated Policy Extraction (GAWR): To balance the trade-off, we propose Gated Advantage Weighted Regression. This step uses the style value as a hard gate (filtering out actions that do not meet the style threshold) and the task advantage as a soft weight, ensuring the policy maximizes task reward only among style-compliant actions.

Gated Advantage Weighted Regression (GAWR)

SCIQL Pipeline: The full architecture integrates offline learning with style relabeling to improve task performance while preserving style alignment. In practice, this can be done within one global loop.

SCIQL pipeline — SCIQL with GAWR pipeline

Experimental Results

Style Alignment Comparison: SCIQL achieves superior style alignment across all datasets, significantly outperforming baselines by leveraging value-based trajectory stitching.
- Necessity of Conditioning: The large gap between BC and CBC confirms that explicit style conditioning is essential.
- Limits of Baselines: Methods like SORL ($\beta=0$) and BCPMI perform similarly to CBC as they lack effective style relabeling mechanisms.
- Value Learning Advantage: While SCBC improves upon baselines via stitching, SCIQL's dominance proves that value learning significantly enhances policy extraction.
- Robustness to Noise: In halfcheetah-vary, SCIQL maintains high alignment despite noisy styles, whereas baselines suffer performance drops.

Style-Conditioned Task Performance Optimization (SCTPO): We compare SCIQL variants (using GAWR) against SORL baselines, demonstrating that GAWR effectively decouples task performance maximization from style alignment degradation.
- Superior Trade-off ($\lambda > r$): Significantly improves task performance over the base model ($\lambda$) while maintaining better style alignment than all SORL variants.
- Prevention of Style Collapse: Unlike SORL, GAWR enables SCIQL to aggressively learn the task without sacrificing the behavioral prior.
- High-Performance Mode ($r > \lambda$): Achieves task performance on par with or superior to the strongest SORL baselines.

Style-conditioned task performance optimization (SCTPO)

Pareto Analysis & Metric Trade-offs: We quantify the asymmetric trade-offs between task and style using complementary metrics, confirming that SCIQL shifts the frontier closer to theoretical perfection.
- Hypervolume (HV): SCIQL achieves a substantial improvement of +41.2% to +163.9% compared to SORL.
- Distance to ideal point: The $\lambda > r$ configuration reduces the Euclidean distance to the ideal point $(100, 100)$ by 18 to 28% compared to the best SORL variant.
- Asymmetric Optimization: Crucially, SCIQL ($\lambda > r$) boosts task performance while preserving the high style alignment of the base policy.

Pareto fronts and hypervolumes of SORL and SCIQL

Conclusion

We propose a novel general definition of behavior styles within the sequential decision making framework and instantiate it by the use of labeling functions to learn interpretable styles with a low labeling cost and easy alignment measurement while effectively avoiding unnecessary credit assignment issues by relying on subtrajectories labeling.

We then present the SCIQL algorithm which leverages Gated AWR to solve long-term decision making and trajectory stitching challenges while providing superior performance in both style alignment and style-conditioned task performance compared to previous work.

We think that our framework opens the door to several interesting research directions:

Multiplicity of criteria: An interesting next step would be to find ways to scale the framework to a multiplicity of criteria.
Enhanced representations: Finding mechanisms to enhance the representation span of labeling functions could also be interesting.
Zero-shot capabilities: Finally, integrating zero-shot capabilities to generate on the fly style-conditioned reinforcement learning policies would be worthwhile to explore.