Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Jade Zou1,2,*,‡, Tao Huang2,*,‡, Weijie Kong2,*, Junzhe Li1,2, Yue Wu2, Qi Tian2, Jiangfeng Xiong2, Jianwei Zhang2,†, Liefeng Bo2, Zhao Zhong2,§
1Peking University 2Tencent Hunyuan
*Equal contribution Work done during internship at Tencent Hunyuan Project leader §Corresponding author
Overview of Precise sampler design and faster reward optimization
13.1–53.2%
less wall-clock time
to match best prior sampler
SOTA
PickScore & HPSv2.1
on SD3.5-M and FLUX.2 Klein
6–30 NFE
SDE-consistent few-step sampler
robust across NFE and exploration strength

Abstract

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers.

Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1–53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

Motivation and design space

Introduction & analysis (figures)

Method

A logSNR-balanced exploration schedule + an SDE-consistent closed-form transition, coupled through a frozen clean-latent posterior mean approximation.

1. Setup: flow-matching path and reverse SDE

We use the linear flow-matching interpolation $\mathbf{z}_t=\alpha_t\mathbf{z}_0+\sigma_t\boldsymbol{\epsilon}$ with $\alpha_t=1-t$, $\sigma_t=t$, and $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$. A flow-matching network predicts the target velocity $\mathbf{u}_t=\boldsymbol{\epsilon}-\mathbf{z}_0$ from $(\mathbf{z}_t,t)$:

\[ \mathbf{z}_t \;=\; (1-t)\,\mathbf{z}_0 + t\,\boldsymbol{\epsilon}. \]

To enable on-policy RL, Flow-GRPO and Dance-GRPO replace the deterministic ODE with the reverse-time SDE

\[ \mathrm{d}\mathbf{z}_t \;=\; \Bigl( \mathbf{u}_t \;-\; \tfrac{1}{2}\,\varepsilon_t^{2}\,\nabla_{\mathbf{z}}\log p_t(\mathbf{z}_t) \Bigr)\mathrm{d}t \;+\; \varepsilon_t\,\mathrm{d}\mathbf{w}_t, \]

where $\varepsilon_t \ge 0$ is the exploration schedule and $\mathbf{w}_t$ is Brownian motion. Defining the clean-latent posterior mean $\hat{\mathbf{z}}_0(t)\triangleq\mathbb{E}[\mathbf{z}_0\mid\mathbf{z}_t]$, the noise and score directions admit closed forms

\[ \hat{\boldsymbol{\epsilon}}(t) \;=\; \frac{\mathbf{z}_t-(1-t)\hat{\mathbf{z}}_0(t)}{t}, \qquad \nabla_{\mathbf{z}}\log p_t(\mathbf{z}_t) \;=\; \frac{(1-t)\hat{\mathbf{z}}_0(t)-\mathbf{z}_t}{t^{2}}. \]

Two coupled choices then define the on-policy sampler: the schedule $\varepsilon_t$, and the finite-step rule that discretizes the reverse SDE at the small NFE budget used in RL.

2. Exploration schedule: constant score-to-velocity progress ratio

We measure denoising progress with the log signal-to-noise ratio $\lambda_t=\log\bigl((1-t)^{2}/t^{2}\bigr)$. A first-order expansion of an infinitesimal reverse step shows that the velocity term contributes $\Delta\lambda_{\mathrm{vel}}=2\Delta t/(t(1-t))$ to the logSNR, while the score and stochastic terms cancel each other at first order with magnitudes $\pm\,\varepsilon_t^{2}\Delta t/t^{2}$. The net first-order progress is therefore independent of $\varepsilon_t$, but the score-induced "extra denoising work" asked of the model scales like $\varepsilon_t^{2}/t^{2}$. We pin this work to be a constant fraction of the velocity progress, $R(t)\triangleq\Delta\lambda_{\mathrm{sco}}/\Delta\lambda_{\mathrm{vel}}=\mathrm{const}$, giving

\[ \boxed{\;\varepsilon_t \;=\; \eta\sqrt{\dfrac{t}{1-t}},\qquad R(t)\;=\;\dfrac{\eta^{2}}{2}.\;} \]

The scalar $\eta>0$ thus directly controls the score-to-velocity progress ratio. In practice values around $\eta\!\approx\!\sqrt{2}$ make the score-induced progress comparable to the velocity progress and work well across backbones and rewards.

3. Failure modes of prior samplers

Euler–Maruyama injects excess discretization noise. The standard Euler step used by Flow-GRPO and Dance-GRPO freezes the velocity and score at time $t$, ignoring the feedback of the freshly injected noise on $\mathbf{u}_t$ and $\nabla\log p_t$. On the toy point-mass data law $\mathbf{z}_\star$, plugging the corresponding $(\mathbf{u}_t,\nabla\log p_t)$ into the Euler step yields

\[ \mathbf{z}_{t'}^{\mathrm{Euler}} \;=\; (1-t')\mathbf{z}_\star +\Bigl(t' - \tfrac{\varepsilon_t^{2}\Delta t}{2t}\Bigr)\boldsymbol{\epsilon} +\varepsilon_t\sqrt{\Delta t}\,\mathbf{w}. \]

Because $\mathbf{w}\!\perp\!\boldsymbol{\epsilon}$, the effective noise coefficient becomes $\sqrt{(t'-\varepsilon_t^{2}\Delta t/(2t))^{2}+\varepsilon_t^{2}\Delta t}>t'$ whenever $\varepsilon_t>0$— the sampler adds excess noise even with a perfect model.

CPS preserves coefficients but contracts the marginal. Coefficient-preserving samplers use posterior-mean updates of the form $\mathbf{z}_{t'}^{\mathrm{CPS}}=(1-t')\hat{\mathbf{z}}_0(t)+k_1\hat{\boldsymbol{\epsilon}}(t)+k_2\mathbf{w}$ with $k_1^{2}+k_2^{2}={t'}^{2}$. They are exact under point-mass data, but for non-degenerate distributions the posterior mean discards residual uncertainty in $\mathbf{z}_0\mid\mathbf{z}_t$, so the covariance contracts by exactly

\[ \mathrm{Cov}_{\text{target}}-\mathrm{Cov}_{\text{CPS}} \;=\; (1-t')^{2}\,\mathbb{E}_{\mathbf{z}_t}\!\bigl[\mathrm{Cov}(\mathbf{z}_0\mid\mathbf{z}_t)\bigr], \]

which manifests as a persistent marginal bias on the double-ring toy (see Motivation carousel).

4. Precise: a frozen-posterior-mean closed-form transition

The Euler failure is feedback blindness; the CPS failure is variance collapse. We want a finite-step rule that keeps the reverse-SDE dynamics yet has a stable per-step anchor. Our key observation: the clean-latent posterior mean $\hat{\mathbf{z}}_0(t)$ moves only mildly under isotropic perturbations of $\mathbf{z}_t$, because estimating it is exactly the denoising task the network was pretrained on. We therefore freeze $\hat{\mathbf{z}}_0(t)$ over a single reverse step — but, unlike CPS, we use it only as a linearization anchor; the residual stochastic variance is still set by the SDE transition. Rewriting the reverse SDE around this anchor gives

\[ \mathrm{d}\mathbf{z}_t \;=\; \Bigl( \frac{\mathbf{z}_t-\hat{\mathbf{z}}_0(t)}{t} +\frac{\varepsilon_t^{2}}{2t^{2}}\bigl(\mathbf{z}_t-(1-t)\hat{\mathbf{z}}_0(t)\bigr) \Bigr)\mathrm{d}t +\varepsilon_t\,\mathrm{d}\mathbf{w}_t. \]

SD3.5-M clean-latent stability under forward-style renoising vs logSNR
Empirical check. On SD3.5-M, the RMS movement of $\hat{\mathbf{z}}_0$ under forward-style renoising stays below the movement of the noisy latent across the entire logSNR range — the frozen-mean approximation is locally valid.
Assumption (Frozen posterior mean on one step). For a reverse step $t\!\to\!t'

Under this assumption the SDE becomes linear in $\mathbf{z}_t$ and can be solved in closed form.

Theorem (Exact transition under frozen posterior mean). With $A(t',t)\triangleq\int_{t'}^{t}\varepsilon_s^{2}/s^{2}\,\mathrm{d}s$ and $\mathbf{w}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ independent of $\mathbf{z}_t$, \[ \mathbf{z}_{t'} \;=\; (1-t')\hat{\mathbf{z}}_0(t) \;+\;\frac{t'}{t}\,e^{-A(t',t)/2}\bigl(\mathbf{z}_t-(1-t)\hat{\mathbf{z}}_0(t)\bigr) \;+\; t'\sqrt{1-e^{-A(t',t)}}\,\mathbf{w}. \]

Plugging in the logSNR-balanced schedule $\varepsilon_t=\eta\sqrt{t/(1-t)}$ gives $A(t',t)=\eta^{2}\log\!\bigl(t(1-t')/(t'(1-t))\bigr)$, and the transition collapses into the clean update rule actually used by Precise:

\[ \boxed{\; \mathbf{z}_{t'} \;=\; (1-t')\,\hat{\mathbf{z}}_0(t) \;+\; t'\,\rho(t',t)\,\hat{\boldsymbol{\epsilon}}(t) \;+\; t'\sqrt{1-\rho(t',t)^{2}}\,\mathbf{w}, \qquad \rho(t',t) \;=\; \Bigl(\dfrac{t'(1-t)}{t(1-t')}\Bigr)^{\eta^{2}/2}. \;} \]

The two terms after $\hat{\mathbf{z}}_0(t)$ have squared norms that sum to ${t'}^{2}$, which exactly matches the flow-matching marginal at $t'$: no excess discretization noise. And because $\hat{\mathbf{z}}_0(t)$ is used only as a step-local anchor, posterior covariance is preserved—no CPS-style marginal contraction.

5. Algorithm

Algorithm 1  Precise one-step reverse update
Require: current latent $\mathbf{z}_t$, current time $t$, next time $t' < t$, exploration strength $\eta$
Require: flow-matching velocity model $v_\theta(\cdot,\cdot)$
1: $\hat{\mathbf{u}}_t \leftarrow v_\theta(\mathbf{z}_t,\,t)$ // model velocity
2: $\hat{\mathbf{z}}_0(t) \leftarrow \mathbf{z}_t - t\,\hat{\mathbf{u}}_t$ // clean-latent prediction
3: $\hat{\boldsymbol{\epsilon}}(t) \leftarrow \mathbf{z}_t + (1-t)\,\hat{\mathbf{u}}_t$ // noise prediction
4: $\rho \leftarrow \bigl(t'(1-t)\,/\,t(1-t')\bigr)^{\eta^{2}/2}$ // closed-form coupling
5: sample $\mathbf{w}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$
6: $\mathbf{z}_{t'} \leftarrow (1-t')\,\hat{\mathbf{z}}_0(t) + t'\rho\,\hat{\boldsymbol{\epsilon}}(t) + t'\sqrt{1-\rho^{2}}\,\mathbf{w}$
7: return $\mathbf{z}_{t'}$

SD3.5-M main results

Experiments — main table, training curves, and head-to-head (replace placeholder table PNGs from your LaTeX build).

FLUX.2 Klein

20 NFE / 1000 iterations

Ablations and robustness

Schedule + discretization; NFE and η sweeps

Qualitative comparisons

Appendix-style contact sheets (PickScore training)

BibTeX

@misc{zou2026precisesdeconsistentstochasticsampling,
  title={Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models},
  author={Jade Zou and Tao Huang and Weijie Kong and Junzhe Li and Yue Wu and Qi Tian and Jiangfeng Xiong and Jianwei Zhang and Liefeng Bo and Zhao Zhong},
  year={2026},
  eprint={2605.23522},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2605.23522},
}