introduce figcaption

This commit is contained in:
Yan Lin 2026-02-06 09:11:05 +01:00
parent 05dea86964
commit d8ea74211f
14 changed files with 81 additions and 63 deletions

View file

@ -26,11 +26,11 @@ The goal of a multi-modal Transformer is to create a model that can accept multi
![](multi-modal-fusion.webp)
> An example of "conventional" multi-modal fusion. Different modality is processed by separate models and fused at some point. Source: *Xiang, Hao, Runsheng Xu, and Jiaqi Ma. "HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer." CVPR, 2023.*
{% cap() %}An example of "conventional" multi-modal fusion. Different modality is processed by separate models and fused at some point. Source: *Xiang, Hao, Runsheng Xu, and Jiaqi Ma. "HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer." CVPR, 2023.*{% end %}
![](video-poet.webp)
> An example of a Transformer that can handle multi-modal inputs and outputs. Different modalities are all projected into tokens and subsequently processed by a unified Transformer encoder. Source: *Kondratyuk, Dan, Lijun Yu, et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML, 2024.*
{% cap() %}An example of a Transformer that can handle multi-modal inputs and outputs. Different modalities are all projected into tokens and subsequently processed by a unified Transformer encoder. Source: *Kondratyuk, Dan, Lijun Yu, et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML, 2024.*{% end %}
Beyond multi-modal processing, a multi-function Transformer can, for example, function as both a language model (auto-regressive generation) and diffusion denoiser (score-matching generation) simultaneously, supporting two of the most common generation schemes used today.
@ -40,13 +40,13 @@ A fundamental challenge in unifying multiple modalities within a single Transfor
![](qkv-attention.webp)
> Illustration of the QKV self-attention mechanism in Transformer. [Source](https://en.wikipedia.org/wiki/Attention_(machine_learning))
{% cap() %}Illustration of the QKV self-attention mechanism in Transformer. [Source](https://en.wikipedia.org/wiki/Attention_(machine_learning)){% end %}
The most common method for mapping language into the embedding space is through tokenization and token embedding. A tokenizer maps a word or word fragment into a discrete token index, and an index-fetching embedding layer (implemented in frameworks like PyTorch with `nn.Embedding`) maps this index into a fixed-dimension embedding vector. In principle, all discrete features can be mapped into the embedding space using this approach.
![](token-embedding.webp)
> Visualization of tokenizer and index-fetching embedding layer. [Source](https://medium.com/@hunter-j-phillips/the-embedding-layer-27d9c980d124)
{% cap() %}Visualization of tokenizer and index-fetching embedding layer. [Source](https://medium.com/@hunter-j-phillips/the-embedding-layer-27d9c980d124){% end %}
### Vector Quantization
@ -121,7 +121,7 @@ One approach to reverse vector quantization is readily available in VQ-VAE, sinc
![](magvit.webp)
> The encoder-decoder structure of MAGVIT (*Yu et al., "MAGVIT"*), a visual VQ-VAE model. A 3D-VQ encoder quantizes a video into discrete tokens, and a 3D-VQ decoder maps them back to the pixel space.
{% cap() %}The encoder-decoder structure of MAGVIT (*Yu et al., "MAGVIT"*), a visual VQ-VAE model. A 3D-VQ encoder quantizes a video into discrete tokens, and a 3D-VQ decoder maps them back to the pixel space.{% end %}
### Efficiency Enhancement
@ -133,7 +133,7 @@ Another workaround follows the idea of compression. Take video generation as an
![](video-lavit.webp)
> Keys frames and motion vectors used in *Jin et al., "Video-LaVIT."*
{% cap() %}Keys frames and motion vectors used in *Jin et al., "Video-LaVIT."*{% end %}
## Fuse with Diffusion Models
@ -143,7 +143,7 @@ An intriguing question arises: why not integrate the structures of language mode
![](transfusion.webp)
> A Transformer capable of function as a language model and a diffusion denoiser at the same time. Source: *Zhou, Chunting, Lili Yu, et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model," ICLR, 2025.*
{% cap() %}A Transformer capable of function as a language model and a diffusion denoiser at the same time. Source: *Zhou, Chunting, Lili Yu, et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model," ICLR, 2025.*{% end %}
## Conclusion

View file

@ -16,12 +16,12 @@ $$
where $\mu$ is the drift component that is deterministic, and $\sigma$ is the diffusion term driven by Brownian motion (denoted by $W_t$) that is stochastic. This differential equation specifies a *time-dependent vector (velocity) field* telling how a data point $x_t$ should be moved as time $t$ evolves from $t=0$ to $t=1$ (i.e., a *flow* from $x_0$ to $x_1$). Below we give an illustration where $x_t$ is 1-dimensional:
![Vector field between two distributions](vector-field.webp)
> Vector field between two distributions specified by a differential equation.
{% cap() %}Vector field between two distributions specified by a differential equation.{% end %}
When $\sigma(x_t,t)\equiv 0$, we get an *ordinary differential equation (ODE)* where the vector field is deterministic, i.e., the movement of $x_t$ is fully determined by $\mu$ and $t$. Otherwise, we get a *stochastic differential equation (SDE)* where the movement of $x_t$ has a certain level of randomness. Extending the previous illustration, below we show the difference in flow of $x_t$ under ODE and SDE:
![ODE vs SDE movements](ode-sde-difference.webp)
> Difference of movements in vector fields specified by ODE and SDE. *Source: Song, Yang, et al. "Score-based generative modeling through stochastic differential equations."* Note that their time is reversed.
{% cap() %}Difference of movements in vector fields specified by ODE and SDE. *Source: Song, Yang, et al. "Score-based generative modeling through stochastic differential equations."* Note that their time is reversed.{% end %}
As you would imagine, once we manage to solve the differential equation, even if we still cannot have a closed form of $p(x_1)$, we can sample from $p(x_1)$ by sampling a data point $x_0$ from $p(x_0)$ and get the generated data point $x_1$ by calculating the following forward-time integral with an integration technique of our choice:
@ -32,7 +32,7 @@ $$
Or more intuitively, moving $x_0$ towards $x_1$ along time in the vector field:
![Flow of data point](flow-data-point.webp)
> A flow of data point moving from $x_0$ towards $x_1$ in the vector field.
{% cap() %}A flow of data point moving from $x_0$ towards $x_1$ in the vector field.{% end %}
## ODE and Flow Matching
@ -81,12 +81,12 @@ $$
Although the ground truth vector field is designed to be straight, in practice it usually is not. When the data space is high-dimensional and the target distribution $p(x_1)$ is complex, there will be multiple pairs of $(x_0, x_1)$ that result in the same intermediate data point $x_t$, thus multiple velocities $x_1-x_0$. At the end of the day, the actual ground truth velocity at $x_t$ will be the average of all possible velocities $x_1-x_0$ that pass through $x_t$. This will lead to a "curvy" vector field, illustrated as follows:
![Curvy vector field](curvy-vector-field.webp)
> Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. *Source: Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."* Note $z_t$ and $v$ in the figure correspond to $x_t$ and $\mu$ in this post, respectively.
{% cap() %}Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. *Source: Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."* Note $z_t$ and $v$ in the figure correspond to $x_t$ and $\mu$ in this post, respectively.{% end %}
As we discussed, when you calculate the ODE integral, you are using the instantaneous velocity--tangent of the curves in the vector field--of each step. You would imagine this will lead to subpar performance when using a small number $N$ of steps, as demonstrated below:
![Few-step sampling failure](few-step-sampling.webp)
> Native flow matching models fail at few-step sampling. *Source: Frans, Kevin, et al. "One step diffusion via shortcut models."*
{% cap() %}Native flow matching models fail at few-step sampling. *Source: Frans, Kevin, et al. "One step diffusion via shortcut models."*{% end %}
### Shortcut Vector Field
@ -130,14 +130,14 @@ $$
Where $\text{sg}$ is stop gradient, i.e., detach $\mathbf{u}_\text{target}$ from back propagation, making it a pseudo ground truth. Below is an illustration of the training process provided in the original paper.
![Shortcut model training](shortcut-training.webp)
> Training of the shortcut models with self-consistency loss.
{% cap() %}Training of the shortcut models with self-consistency loss.{% end %}
#### Mean Flow
Mean flow is another work sharing the idea of learning velocities that take large step size shortcuts but with a stronger theoretical foundation and a different approach to training.
![Average velocity illustration](average-velocity.webp)
> Illustration of the average velocity provided in the original paper.
{% cap() %}Illustration of the average velocity provided in the original paper.{% end %}
Mean flow defines an *average velocity* as a shortcut between times $t$ and $r$ where $t$ and $r$ are independent:
@ -257,7 +257,7 @@ One caveat of training a "shortcut SDE" is that the ideal result of one-step sam
Below are some preliminary results I obtained from a set of amorphous material generation experiments. You don't need to understand the figure--just know that it shows that applying the idea of learning shortcuts to SDE does yield better results compared to the vanilla SDE when using few-step sampling.
![SDE shortcut results](sde-results.webp)
> Structural functions of generated materials, sampled in 10 steps.
{% cap() %}Structural functions of generated materials, sampled in 10 steps.{% end %}
---

View file

@ -15,7 +15,7 @@ Most diffusion models work by coupling a forward diffusion process and a reverse
![](diffusion-process.webp)
> The two processes in a typical diffusion model. *Source: Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models."*
{% cap() %}The two processes in a typical diffusion model. *Source: Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models."*{% end %}
### Understanding DMs
@ -23,7 +23,7 @@ There are many ways to understand how Diffusion Models (DMs) work. One of the mo
![](ode-sde-flow.webp)
> Illustrated ODE and SDE flow of a diffusion model on 1-dimensional data. *Source: Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."*
{% cap() %}Illustrated ODE and SDE flow of a diffusion model on 1-dimensional data. *Source: Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."*{% end %}
### DMs Scale Poorly with Few Steps
@ -37,13 +37,13 @@ Nevertheless, it is observed that their performance typically suffers catastroph
![](few-steps-results.webp)
> Images generated by conventional DMs with only a few steps of reverse process. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*
{% cap() %}Images generated by conventional DMs with only a few steps of reverse process. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*{% end %}
To understand why DMs scale poorly with few reverse process steps, we can return to the vector field perspective of DMs. When the target data distribution is complex, the vector field typically contains numerous intersections. When a given $X_t$ and $t$ is at these intersections, the vector points to the averaged direction of all candidates. This causes the generated data to approach the mean of the training data when only a few reverse process steps are used. Another explanation is that the learned vector field is highly curved. Using only a few reverse process steps means attempting to approximate these curves with polylines, which is inherently difficult.
![](dm-scale-poorly.webp)
> Illustration of the why DMs scale poorly with few reverse process steps. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*
{% cap() %}Illustration of the why DMs scale poorly with few reverse process steps. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*{% end %}
We will introduce two branches of methods that aim to scale DMs to few or even reverse process steps: **distillation-based**, which distillates a pre-trained DM into a one-step model; and **end-to-end-based**, which trains a one-step DM from scratch.
@ -73,7 +73,7 @@ This procedure produces increasingly straight flows that can be simulated with v
![](reflow-iterations.webp)
> Illustrations of vector fields after different times of reflow processes. *Source: Liu, Gong, and Liu, "Flow Straight and Fast."*
{% cap() %}Illustrations of vector fields after different times of reflow processes. *Source: Liu, Gong, and Liu, "Flow Straight and Fast."*{% end %}
In practice, distillation-based methods are usually trained in two stages: first train a normal DM, and later distill one-step capabilities into it. This introduces additional computational overhead and complexity.
@ -93,7 +93,7 @@ In theory, without altering the fundamental formulation of DMs, the learnable de
![](consistency-model.webp)
> A consistency model that learns to map any point on the ODE trajectory to the clean sample. *Source: Song et al., "Consistency Models."*
{% cap() %}A consistency model that learns to map any point on the ODE trajectory to the clean sample. *Source: Song et al., "Consistency Models."*{% end %}
Formally, CMs learn a function $f_\theta(x_t,t)$ that maps noisy data $x_t$ at time $t$ directly to the clean data $x_0$, satisfying:
@ -135,6 +135,6 @@ Based on this insight, on top of $x_t$ and $t$, shortcut models additionally inc
![](shortcut-training.webp)
> Illustration of the training process of shortcut models. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*
{% cap() %}Illustration of the training process of shortcut models. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*{% end %}
Both consistency models and shortcut models can be seamlessly scaled between one-step and multi-step generation to balance quality and efficiency.