PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

Anonymous Authors^*

Anonymous Institution
Preprint 2026

Code (TBD) arXiv (TBD) 🤗 Models (TBD)

ABC models arbitrary cylinder marginal distributions of a continuous-time, continuous-space stochastic process. This generalizes autoregressive modeling to continuous-time, and generalizes diffusion models to the non-Markovian case.

Abstract

Generative modeling continuous-time, continuous-space stochastic processes (e.g. videos, weather forecasts) conditioned on partial observations (e.g. first and last frames) is a fundamental challenge. Existing approaches, (e.g. diffusion models), suffer from key limitations: (1) noise-to-data evolution fails to capture structural similarity between states close in physical time and has unstable integration in low-step regimes; (2) random noise injected is insensitive to the physical process's time elapsed, resulting in incorrect dynamics; (3) they overlook conditioning on arbitrary subsets of states (e.g. irregularly sampled timesteps, future observations). We propose ABC: Any-Subset Autoregressive Models via Non-Markovian Diffusion Bridges in Continuous Time and Space. Crucially, we model the process with one continual SDE whose time variable and intermediate states track the real time and process states. This has provable advantages: (1) the starting point for generating future states is the already-close previous state, rather than uninformative noise; (2) random noise injection scales with physical time elapsed, encouraging physically plausible dynamics with similar time-adjacent states. We derive SDE dynamics via changes-of-measure on path space, yielding another advantage: (3) path-dependent conditioning on arbitrary subsets of the state history and/or future. To learn these dynamics, we derive a path- and time-dependent extension of denoising score matching. Our experiments show ABC's superiority to competing methods on multiple domains, including video generation and weather forecasting.

Non-Causal Generation Results

Infilling results, with prompt every eight frames. As theoretically predicted, ABC's inductive biases of time-adaptive volatility and data-to-data transitions provide clear benefits over competing methods that lack these characteristics. The autoregressively conditioned diffusion bridges have flickering artifacts (due to the one-size-fits-all volatility coefficient), while the noise-to-data diffusion is incoherent.

Video Generation Parameters

Sky Videos: Latent diffusion on Sky-Timelapse dataset. Top row is Ground Truth; second row is ABC (σ = 0.4, no Brownian Bridge); third row is Autoregressively Chained Diffusion Bridges (σ = 0.4); fourth row is Noise-to-Data Diffusion (exponentially decaying noise, B=4.0, K=2.5). Generate 32 frames, with ground truth conditioning given at every eight frames, plus the final frame. We allow overwriting of the conditioning frames by the SDEs, to test the consistency of the model trajectory at conditioning times. Samples are generated with 250 discretziation steps of the SDE. All methods use same DiT architecture.
Talking Head Videos: Latent diffusion on CelebV-HQ dataset. Top row is Ground Truth; second row is ABC (σ = 0.5, no Brownian Bridge); third row is Autoregressively Chained Diffusion Bridges (σ = 0.5); fourth row is Noise-to-Data Diffusion (cosine decaying noise, α = 3.0, ε = 0.04). Generate 32 frames, with ground truth conditioning given at every four frames, plus the final frame. We teacher force on the conditioning frames. Samples are generated with 250 discretziation steps of the SDE. All methods use same DiT architecture.

Causal Generation and Alternative Prompting Strategies

Comparison of methods on causal generation results, with prompt first four frames. As in the previous results, ABC is the best. Conditional diffusion bridges still suffer from flickering and texture loss artifacts, while the noise-to-data diffusion completely degenerates.

Ablation on prompting strategy. Full conditioning on path history is best. Conditioning on Monte Carlo sample of path trajectory displays temporal inconsistencies (first two columns implausibly reverse direction). Conditioning on only the most recent frame (plus initial frame) shares the temporal inconsistencies, plus artifacts from lack of history (third column's clouds dissolve, and fourth column's clouds lose texture). Conditioning on only eight-frame prefix without updating memory is worst.

Video Generation Parameters

Comparison of Methods on Causal Generation of Sky Videos (Left): Latent diffusion on Sky-Timelapse dataset. Top row is Ground Truth; second row is ABC (σ = 0.4, no Brownian Bridge); third row is Autoregressively Chained Diffusion Bridges (σ = 0.4); fourth row is Noise-to-Data Diffusion (cosine decaying noise, α=3.0, ε=0.04). Generate 28 frames, conditioned on a 4-frame prefix. Samples are generated with 250 discretziation steps of the SDE. All methods use same DiT architecture.
Ablation on Prompting Strategy (Right): Also latent diffusion on Sky-Timelapse. Uses ABC (σ = 0.4, no Brownian Bridge) for all frames, with prompting first eight frames in causal rollout, and 500 diffusion steps. Top row is full path autoregressive conditioning; second row is Monte Carlo sample (initial frame, most recent frame, and randomly selected frame in between) of the path trajectory as conditioning; third row is near-Markovian conditioning (initial frame and most recent frame); fourth row is prefix conditioning (first eight frames only).

BibTeX

@article{abc2026,
  title={ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space},
  author={Anonymous Authors},
  journal={Preprint},
  year={2026},
  url={https://abc-diffusion.github.io/}
}