diff --git a/claude.md b/claude.md new file mode 100644 index 0000000..b12c21e --- /dev/null +++ b/claude.md @@ -0,0 +1,26 @@ +I am in the process of migrating content from my previous quartz 4-based blog site in /Users/yanlin/Documents/Projects/personal-blog to this Zola-based blog post. + +For each blog post, follow the following process of migration: +1. Create an empty bundle (directory with the same name as the old markdown file) and under section of the same original one +2. Copy the old markdown file to the bundle as index.md (first copy the file using `cp` directly, then edit) +3. Edit the frontmatter: + +``` ++++ +title = "(the original title)" +date = (the old created field) +description = "(leave blank)" ++++ +``` + +4. Find, copy, and rename the images used in the post to the bundle +5. Replace the old Obsidian-flavor markdown links (images ![[]] and internal links [[]]) with standard markdown links +6. Turn callout blocks into standard markdown quote blocks, e.g., >[!note], >[!TLDR], >[!quote] → > **Note:**, > **TL;DR:**, > **References:**; e.g. > [!tip] Videos -> > **Videos:**, > [!info] Extended Reading -> > **Extended Reading** +7. For multiline math equations (those with \\), wrap the whole equation like below to avoid Zola's processing: + +``` +{% math() %} +f_{\{q,k\}}(x_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} W_{\{q,k\}}^{(11)} & W_{\{q,k\}}^{(12)} \\ W_{\{q,k\}}^{(21)} & W_{\{q,k\}}^{(22)} \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} +{% end %} +``` + diff --git a/content/ais/_index.md b/content/ai-system/_index.md similarity index 100% rename from content/ais/_index.md rename to content/ai-system/_index.md diff --git a/content/ml-tech/multi-modal-transformer/index.md b/content/ml-tech/multi-modal-transformer/index.md new file mode 100644 index 0000000..97c5e01 --- /dev/null +++ b/content/ml-tech/multi-modal-transformer/index.md @@ -0,0 +1,152 @@ ++++ +title = "Multi-modal Transformers" +date = 2025-06-06 +description = "" ++++ + +Transformers have gained immense popularity within deep learning and AI communities in recent years. Since their introduction in *Vaswani et al., "Attention Is All You Need"*, they have proven to be powerful sequential models across diverse domains, with thousands of variations and "improved versions." The rise of Large Language Models (LLMs), which largely use Transformers as their foundation, has led to another surge in research around this architecture. This trend has even led graph learning and Computer Vision (CV) communities to move beyond their established foundation models (i.e., GNNs and CNNs) and embrace Transformers. This explains the increasing prevalence of graph Transformers and image Transformers today. + +> Han et al., "A Survey on Vision Transformer"; Khan et al., "Transformers in Vision"; Yun et al., "Graph Transformer Networks." + +Beyond "chasing the trend," using Transformer as a unified foundation model offers several advantages: + +- Transformers excel at capturing long-term dependencies. Unlike GNNs and CNNs which require deeper network structures for longer context, Transformers natively support global dependency modeling through their self-attention mechanism. They also avoid global smoothing and vanishing gradient problems that hinder context length scaling in other network architectures. +- Transformers process sequences in parallel rather than sequentially, enabling full utilization of GPU acceleration. This advantage can be further enhanced with techniques like those described in *Dao et al., "FlashAttention."* +- Transformers are flexible network structures. They don't inherently enforce sequentiality—without positional encoding, the ordering of input steps to Transformers is equivalent. Through strategic permutation and positional encoding, Transformers can adapt to a wide range of structured and unstructured data. +- The development of LLMs has made many open-weight Transformer models available with strong natural language understanding capabilities. These Transformers can be prompted and fine-tuned to model other modalities such as spatiotemporal data and images while retaining their language modeling abilities, creating opportunities for developing multi-modal foundation models. +- From a practical perspective, using Transformer as a foundation allows reuse of technical infrastructure and optimizations developed over years, including efficient architecture designs, training pipelines, and specialized hardware. + +In this article, we will briefly explore techniques for unifying multiple modalities (e.g., natural language and images) and multiple functionalities (e.g., language models and diffusion denoisers) within a single Transformer. These techniques are largely sourced from recent oral papers presented at ICML, ICLR, and CVPR conferences. I assume readers have general knowledge of basic concepts in ML and neural networks, Transformers, LLMs, and diffusion models. + +Since images and language modalities represent continuous and discrete data respectively, we will use them as examples throughout this article. Keep in mind that the techniques introduced can be readily extended to other modalities, including spatiotemporal data. + +## General Goal + +The goal of a multi-modal Transformer is to create a model that can accept multi-modal inputs and produce multi-modal outputs. For example, instead of using a CNN-based image encoder and a Transformer-based language encoder to map image and language modalities to the latent space separately, a multi-modal Transformer would be able to process the combination of image and language (sentence) as a single sequence. + +![](multi-modal-fusion.png) + +> An example of "conventional" multi-modal fusion. Different modality is processed by separate models and fused at some point. Source: *Xiang, Hao, Runsheng Xu, and Jiaqi Ma. "HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer." CVPR, 2023.* + +![](video-poet.png) + +> An example of a Transformer that can handle multi-modal inputs and outputs. Different modalities are all projected into tokens and subsequently processed by a unified Transformer encoder. Source: *Kondratyuk, Dan, Lijun Yu, et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML, 2024.* + +Beyond multi-modal processing, a multi-function Transformer can, for example, function as both a language model (auto-regressive generation) and diffusion denoiser (score-matching generation) simultaneously, supporting two of the most common generation schemes used today. + +## Modality Embedding + +A fundamental challenge in unifying multiple modalities within a single Transformer is how to represent different modalities in the same embedding space. For the "QKV" self-attention mechanism to work properly, each item in the input sequence must be represented by an embedding vector of the same dimension, matching the "model dimension" of the Transformer. + +![](qkv-attention.png) + +> Illustration of the QKV self-attention mechanism in Transformer. [Source](https://en.wikipedia.org/wiki/Attention_(machine_learning)) + +The most common method for mapping language into the embedding space is through tokenization and token embedding. A tokenizer maps a word or word fragment into a discrete token index, and an index-fetching embedding layer (implemented in frameworks like PyTorch with `nn.Embedding`) maps this index into a fixed-dimension embedding vector. In principle, all discrete features can be mapped into the embedding space using this approach. + +![](token-embedding.png) + +> Visualization of tokenizer and index-fetching embedding layer. [Source](https://medium.com/@hunter-j-phillips/the-embedding-layer-27d9c980d124) + +### Vector Quantization + +For continuous features, one intuitive approach is to first tokenize them into discrete tokens, thereby unifying the embedding process across both discrete and continuous features. **Vector quantization**, introduced in VQ-VAE, is one of the most common methods for this purpose. + +> Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." NeurIPS, 2017. + +Vector quantization maintains a "codebook" $\boldsymbol C \in \mathbb R^{n\times d}$, which functions similarly to the index-fetching embedding layer, where $n$ is the total number of unique tokens, and $d$ is the embedding size. A given continuous vector $\boldsymbol{z}\in\mathbb R^{d}$ is quantized into a discrete value $i\in\mathbb [0,n-1]$ by finding the closest row vector in $\boldsymbol C$ to $\boldsymbol{z}$, and that row vector $\boldsymbol C_i$ is fetched as the embedding for $\boldsymbol{z}$. Formally: +$$ +i = \arg\min_j ||\boldsymbol z - \boldsymbol C_j||_2 +$$ +![](vector-quantization.png) + +### Lookup-Free Quantization + +A significant limitation of vector quantization is that it requires calculating distances between the given continuous vectors and the entire codebook, which becomes computationally expensive for large-scale codebooks. This creates tension with the need for expanded codebooks to represent complex modalities such as images and videos. Research has shown that simply increasing the number of unique tokens doesn't always improve codebook performance. + +> "A simple trick for training a larger codebook involves decreasing the code embedding dimension when increasing the vocabulary size." Source: *Yu, Lijun, Jose Lezama, et al. "Language Model Beats Diffusion - Tokenizer Is Key to Visual Generation," ICLR, 2024.* + +Building on this insight, **Lookup-Free Quantization** (LFQ) eliminates the embedding dimension of codebooks (essentially reducing the embedding dimension to 0) and directly calculates the discrete index $i$ by individually quantizing each dimension of $\boldsymbol z$ into a binary digit. The index $i$ can then be computed by converting the binary representation to decimal. Formally: +$$ +i=\sum_{j=1}^{d} 2^{(j-1)}\cdot 𝟙(z_j > 0) +$$ + +> For example, given a continuous vector $\boldsymbol z=\langle -0.52, 1.50, 0.53, -1.32\rangle$, we first quantize each dimension into $\langle 0, 1, 1, 0\rangle$, based on the sign of each dimension. The token index of $\boldsymbol z$ is simply the decimal equivalent of the binary 0110, which is 6. + +However, this approach introduces another challenge: we still need an index-fetching embedding layer to map these token indices into embedding vectors for the Transformer. This, combined with the typically large number of unique tokens when using LFQ—a 32-dimensional $\boldsymbol z$ will result in $2^{32}=4,294,967,296$ unique tokens—creates significant efficiency problems. One solution is to factorize the token space. Effectively, this means splitting the binary digits into multiple parts, embedding each part separately, and concatenating the resulting embedding vectors. For example, with a 32-dimensional $\boldsymbol z$, if we quantize and embed its first and last 16 dimensions separately, we "only" need to handle $2^{16}*2= 131,072$ unique tokens. + +Note that this section doesn't extensively explain how to map raw continuous features into the vector $\boldsymbol{z}$, as these techniques are relatively straightforward and depend on the specific feature type—for example, fully-connected layers for numerical features, or CNN/GNN with feature flattening for structured data. + +### Quantization over Linear Projection + +You might be asking—why can't we simply use linear projections to map the raw continuous features into the embedding space? What are the benefits of quantizing continuous features into discrete tokens? + +Although Transformers are regarded as universal sequential models, they were designed for discrete tokens in their first introduction in *Vaswani et al., "Attention Is All You Need"*. Empirically, they have optimal performance when dealing with tokens, compared to continuous features. This is supported by many research papers claiming that quantizing continuous features improves the performance of Transformers, and works demonstrating Transformers' subpar performance when applied directly to continuous features. + +> Mao, Chengzhi, Lu Jiang, Mostafa Dehghani, Carl Vondrick, Rahul Sukthankar, and Irfan Essa. "Discrete Representations Strengthen Vision Transformer Robustness," ICLR, 2022. + +> Ilbert, Romain, Ambroise Odonnat, et al. "SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention," ICML, 2024. + +On the other hand, unifying different modalities into tokens is especially beneficial in the context of Transformer-based "foundation models," since it preserves the auto-regressive next-token prediction architecture of LLMs. Combined with special tokens such as "start of sentence" and "end of sentence," the Transformer model is flexible in generating contents of mixed modalities with varied length. + +> For example, by quantizing videos into discrete tokens and combining the token space of videos and language, one can create a unified Transformer model that generates both videos and language in one sequence. The start and end points of video and language sub-sequences are fully determined by the model, based on the specific input prompt. This structure would be difficult to replicate if we used tokenization for language but linear projection for videos. + +## Transformer Backbone + +After different modalities are mapped into the same embedding space, they can be arranged into a sequence of embedding vectors and input into a Transformer backbone. We don't discuss the variations of Transformer structure and improvement techniques here, as they are numerous, and ultimately function similarly as sequential models. + +> Lan et al., "ALBERT"; Ye et al., "Differential Transformer"; Kitaev, Kaiser, and Levskaya, "Reformer"; Su et al., "RoFormer"; Dai et al., "Transformer-XL." + +As we know, the "full" Transformer structure proposed in *Vaswani et al., "Attention Is All You Need"* includes an encoder and a decoder. They perform self-attention within their respective input sequences, and the decoder additionally performs cross-attention between its input sequence and the memory sequence derived from the encoder's output. Some early language models use encoder-only structure (like *Devlin et al., "BERT"*) focused on outputting embedding vectors or encoder-decoder structure (like *Chung et al., "Scaling Instruction-Finetuned Language Models"*) for generating natural language output. Most modern large language models and foundation models use decoder-only structure (like *Brown et al., "Language Models Are Few-Shot Learners"*), focusing on auto-regressive generation of language output. + +The encoder-only structure theoretically excels at representation learning, and its produced embedding vectors can be applied to various downstream tasks. Recent developments have gradually moved towards decoder-only structure, centered around the idea of building models that are capable of directly generating the required final output of every downstream task. + +> For example, to perform sentiment analysis, BERT will compute an embedding vector for the query sentence, and the embedding vector can be used in a dedicated classifier to predict the sentiment label. GPT, on the other hand, can directly answer the question "what is the sentiment associated with the query sentence?" Comparatively, GPT is more versatile in most cases and can easily perform zero-shot prediction. + +Nevertheless, representation learning is still a relevant topic. The general understanding is that decoder-only structure cannot perform conventional representation learning, for example mapping a sentence into a fixed-dimension embedding vector. Yet, there are a few works in the latest ICLR that shed light on the utilization of LLMs as representation learning or embedding models: + +> Gao, Leo, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. "Scaling and Evaluating Sparse Autoencoders," 2024. [Link](https://openreview.net/forum?id=tcsZt9ZNKD) + +> Li, Ziyue, and Tianyi Zhou. "Your Mixture-of-Experts LLM Is Secretly an Embedding Model for Free," 2024. [Link](https://openreview.net/forum?id=eFGQ97z5Cd) + +> Zhang, Jie, Dongrui Liu, Chen Qian, Linfeng Zhang, Yong Liu, Yu Qiao, and Jing Shao. "REEF: Representation Encoding Fingerprints for Large Language Models," 2024. [Link](https://openreview.net/forum?id=SnDmPkOJ0T) + +## Output Layer + +For language generation, Transformers typically use classifier output layers, mapping the latent vector of each item in the output sequence back to tokens. As we've established in the "modality embedding" section, the optimal method to embed continuous features is to quantize them into discrete tokens. Correspondingly, an intuitive method to output continuous features is to map these discrete tokens back to the continuous feature space, essentially reversing the vector quantization process. + +### Reverse Vector Quantization + +One approach to reverse vector quantization is readily available in VQ-VAE, since it is an auto-encoder. Given a token $i$, we can look up its embedding in the codebook as $\boldsymbol C_i$, then apply a decoder network to map $\boldsymbol C_i$ back to the continuous feature vector $\boldsymbol z$. The decoder network can be pre-trained in the VQ-VAE framework—pre-train the VQ-VAE tokenizer, encoder, and decoder using auto-encoding loss functions, or end-to-end trained along with the whole Transformer. In the NLP and CV communities, the pre-training approach is more popular, since there are many large-scale pre-trained auto-encoders available. + +![](magvit.png) + +> The encoder-decoder structure of MAGVIT (*Yu et al., "MAGVIT"*), a visual VQ-VAE model. A 3D-VQ encoder quantizes a video into discrete tokens, and a 3D-VQ decoder maps them back to the pixel space. + +### Efficiency Enhancement + +For continuous feature generation, unlike language generation where the output tokens are the final output, we are essentially representing the final output with a limited size token space. Thus, for complicated continuous features like images and videos, we have to expand the token space or use more tokens to represent one image or one video frame to improve generation quality, which can result in efficiency challenges. + +There are several workarounds to improve the efficiency of multi-modal outputs. One approach is to generate low-resolution outputs first, then use a separate super-resolution module to improve the quality of the output. This approach is explored in *Kondratyuk et al., "VideoPoet"* and *Tian et al., "Visual Autoregressive Modeling"*. Interestingly, the overall idea is very similar to nVidia's DLSS, where the graphics card renders a low-resolution frame (e.g., 1080p) using the conventional rasterization pipeline, then a super resolution model increases the frame's resolution (e.g., 4k) utilizing the graphics card's tensor hardware, improving games' overall frame rate. + +Another workaround follows the idea of compression. Take video generation as an example. The model generates full features for key frames, and light-weight features for motion vectors that describe subtle differences from those key frames. This is essentially how inter-frame compressed video codecs work, which takes advantage of temporal redundancy between neighboring frames. + +![](video-lavit.png) + +> Keys frames and motion vectors used in *Jin et al., "Video-LaVIT."* + +## Fuse with Diffusion Models + +Despite continuous efforts to enable representation and generation of images and videos with a language model structure (auto-regressive), current research indicates that diffusion models (more broadly speaking, score-matching generative models) outperform language models on continuous feature generation. Score-matching generative models have their own separate and substantial community, with strong theoretical foundations and numerous variations emerging each year, such as stochastic differential equations, bayesian flow, and rectified flow. In conclusion, score-matching generative models are clearly here to stay alongside language models. + +An intriguing question arises: why not integrate the structures of language models and diffusion models into one Transformer to reach the best of both worlds? *Zhou et al. in "Transfusion"* explored this idea. The approach is straightforward: build a Transformer that can handle both language and image inputs and outputs. The language component functions as a language model, while the image component serves as a denoiser network for diffusion models. The model is trained by combining the language modeling loss and DDPM loss, enabling it to function either as a language model or a text-to-image denoiser. + +![](transfusion.png) + +> A Transformer capable of function as a language model and a diffusion denoiser at the same time. Source: *Zhou, Chunting, Lili Yu, et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model," ICLR, 2025.* + +## Conclusion + +In conclusion, the evolution of Transformers into versatile foundation models capable of handling multiple modalities and functionalities represents a significant advancement in AI research. By enabling a single architecture to process diverse data types through techniques like vector quantization and lookup-free quantization, researchers have created models that can seamlessly integrate language, images, and other modalities within the same embedding space. + +In our research domain, we encounter even more diverse and domain-specific multi-modal data, such as traffic flows, trajectories, and real-world agent interactions. A unified Transformer for such data presents a promising solution for creating "foundation models" that generalize across diverse tasks and scenarios. However, domain-specific challenges, including data encoding and decoding, computational efficiency, and scalability, must be addressed to realize this potential. diff --git a/content/ml-tech/multi-modal-transformer/magvit.png b/content/ml-tech/multi-modal-transformer/magvit.png new file mode 100644 index 0000000..9e97efb Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/magvit.png differ diff --git a/content/ml-tech/multi-modal-transformer/multi-modal-fusion.png b/content/ml-tech/multi-modal-transformer/multi-modal-fusion.png new file mode 100644 index 0000000..7db5437 Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/multi-modal-fusion.png differ diff --git a/content/ml-tech/multi-modal-transformer/qkv-attention.png b/content/ml-tech/multi-modal-transformer/qkv-attention.png new file mode 100644 index 0000000..a7ef23b Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/qkv-attention.png differ diff --git a/content/ml-tech/multi-modal-transformer/token-embedding.png b/content/ml-tech/multi-modal-transformer/token-embedding.png new file mode 100644 index 0000000..9315270 Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/token-embedding.png differ diff --git a/content/ml-tech/multi-modal-transformer/transfusion.png b/content/ml-tech/multi-modal-transformer/transfusion.png new file mode 100644 index 0000000..47ff387 Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/transfusion.png differ diff --git a/content/ml-tech/multi-modal-transformer/vector-quantization.png b/content/ml-tech/multi-modal-transformer/vector-quantization.png new file mode 100644 index 0000000..0e41599 Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/vector-quantization.png differ diff --git a/content/ml-tech/multi-modal-transformer/video-lavit.png b/content/ml-tech/multi-modal-transformer/video-lavit.png new file mode 100644 index 0000000..e1c27c4 Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/video-lavit.png differ diff --git a/content/ml-tech/multi-modal-transformer/video-poet.png b/content/ml-tech/multi-modal-transformer/video-poet.png new file mode 100644 index 0000000..11d9b92 Binary files /dev/null and b/content/ml-tech/multi-modal-transformer/video-poet.png differ diff --git a/content/ml-tech/ode-sde/average-velocity.png b/content/ml-tech/ode-sde/average-velocity.png new file mode 100644 index 0000000..b827212 Binary files /dev/null and b/content/ml-tech/ode-sde/average-velocity.png differ diff --git a/content/ml-tech/ode-sde/curvy-vector-field.png b/content/ml-tech/ode-sde/curvy-vector-field.png new file mode 100644 index 0000000..cec64f6 Binary files /dev/null and b/content/ml-tech/ode-sde/curvy-vector-field.png differ diff --git a/content/ml-tech/ode-sde/few-step-sampling.png b/content/ml-tech/ode-sde/few-step-sampling.png new file mode 100644 index 0000000..c526615 Binary files /dev/null and b/content/ml-tech/ode-sde/few-step-sampling.png differ diff --git a/content/ml-tech/ode-sde/flow-data-point.png b/content/ml-tech/ode-sde/flow-data-point.png new file mode 100644 index 0000000..a28bb55 Binary files /dev/null and b/content/ml-tech/ode-sde/flow-data-point.png differ diff --git a/content/ml-tech/ode-sde/index.md b/content/ml-tech/ode-sde/index.md new file mode 100644 index 0000000..cded50e --- /dev/null +++ b/content/ml-tech/ode-sde/index.md @@ -0,0 +1,286 @@ ++++ +title = "Shortcuts in ODE and SDE" +date = 2025-07-01 +description = "" ++++ +> **TL;DR:** In the context of generative modeling, we examine ODEs, SDEs, and two recent works that share the idea of learning shortcuts that traverse through vector fields defined by ODEs faster. We then discuss the generalization of this idea to both ODE- and SDE-based models. + +## Differential Equations + +Let's start with a general scenario of **generative modeling**: suppose you want to generate data $x$ that follows a distribution $p(x)$. In many cases, the exact form of $p(x)$ is unknown. What you can do is follow the idea of *normalizing flow*: start from a very simple, closed-form distribution $p(x_0)$ (for example, a standard normal distribution), transform this distribution through time $t\in [0, 1]$ with intermediate distributions $p(x_t)$, and finally obtain the estimated distribution $p(x_1)$. By doing this, you are essentially trying to solve a *differential equation (DE)* that depends on time: + +$$ +dx_t=\mu(x_t,t)dt+\sigma(x_t,t)dW_t,\quad x_0\sim p(x_0) +$$ + +where $\mu$ is the drift component that is deterministic, and $\sigma$ is the diffusion term driven by Brownian motion (denoted by $W_t$) that is stochastic. This differential equation specifies a *time-dependent vector (velocity) field* telling how a data point $x_t$ should be moved as time $t$ evolves from $t=0$ to $t=1$ (i.e., a *flow* from $x_0$ to $x_1$). Below we give an illustration where $x_t$ is 1-dimensional: + +![Vector field between two distributions](vector-field.png) +> Vector field between two distributions specified by a differential equation. + +When $\sigma(x_t,t)\equiv 0$, we get an *ordinary differential equation (ODE)* where the vector field is deterministic, i.e., the movement of $x_t$ is fully determined by $\mu$ and $t$. Otherwise, we get a *stochastic differential equation (SDE)* where the movement of $x_t$ has a certain level of randomness. Extending the previous illustration, below we show the difference in flow of $x_t$ under ODE and SDE: + +![ODE vs SDE movements](ode-sde-difference.png) +> Difference of movements in vector fields specified by ODE and SDE. *Source: Song, Yang, et al. "Score-based generative modeling through stochastic differential equations."* Note that their time is reversed. + +As you would imagine, once we manage to solve the differential equation, even if we still cannot have a closed form of $p(x_1)$, we can sample from $p(x_1)$ by sampling a data point $x_0$ from $p(x_0)$ and get the generated data point $x_1$ by calculating the following forward-time integral with an integration technique of our choice: + +$$ +x_1 = x_0 + \int_0^1 \mu(x_t,t)dt + \int_0^1 \sigma(x_t,t)dW_t +$$ + +Or more intuitively, moving $x_0$ towards $x_1$ along time in the vector field: + +![Flow of data point](flow-data-point.png) +> A flow of data point moving from $x_0$ towards $x_1$ in the vector field. + +## ODE and Flow Matching + +### ODE in Generative Modeling + +For now, let's focus on the ODE formulation since it is notationally simpler compared to SDE. Recall the ODE of our generative model: + +$$ +\frac{dx_t}{dt}=\mu(x_t,t) +$$ + +Essentially, $\mu$ is the vector field. For every possible combination of data point $x_t$ and time $t$, $\mu(x_t,t)$ is the instantaneous velocity in which the point will move. To generate a data point $x_1$, we perform the integral: + +$$ +x_1=x_0+\int_0^1 \mu(x_t,t)dt +$$ + +To calculate this integral, a simple and widely adopted method is the Euler method. Choose $N$ time steps $0=t_0 **Note:** There are other methods to calculate the integral, of course. For example, one can use the solvers in the `torchdiffeq` Python package. + +### Flow Matching + +In many scenarios, the exact form of the vector field $\mu$ is unknown. The general idea of *flow matching* is to find a ground truth vector field that defines the *flow* transporting $p(x_0)$ to $p(x_1)$, and build a neural network $\mu_\theta$ that is trained to *match* the ground truth vector field, hence the name. In practice, this is usually done by independently sampling $x_0$ from the noise and $x_1$ from the training data, calculating the intermediate data point $x_t$ and the ground truth velocity $\mu(x_t,t)$, and minimizing the deviation between $\mu_\theta(x_t,t)$ and $\mu(x_t,t)$. + +Ideally, the ground truth vector field should be as straight as possible, so we can use a small number of $N$ steps to calculate the ODE integral. Thus, the ground truth velocity is usually defined following the optimal transport flow map: + +$$ +x_t=tx_1+(1-t)x_0,\quad\mu(x_t,t)=x_1-x_0 +$$ + +And a neural network $\mu_\theta$ is trained to match the ground truth vectors as: + +$$ +\mathcal L=\mathbb E_{x_0,x_1,t}\| \mu_\theta(x_t,t)-(x_1-x_0)\|^2 +$$ + +### Curvy Vector Field + +Although the ground truth vector field is designed to be straight, in practice it usually is not. When the data space is high-dimensional and the target distribution $p(x_1)$ is complex, there will be multiple pairs of $(x_0, x_1)$ that result in the same intermediate data point $x_t$, thus multiple velocities $x_1-x_0$. At the end of the day, the actual ground truth velocity at $x_t$ will be the average of all possible velocities $x_1-x_0$ that pass through $x_t$. This will lead to a "curvy" vector field, illustrated as follows: + +![Curvy vector field](curvy-vector-field.png) +> Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. *Source: Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."* Note $z_t$ and $v$ in the figure correspond to $x_t$ and $\mu$ in this post, respectively. + +As we discussed, when you calculate the ODE integral, you are using the instantaneous velocity--tangent of the curves in the vector field--of each step. You would imagine this will lead to subpar performance when using a small number $N$ of steps, as demonstrated below: + +![Few-step sampling failure](few-step-sampling.png) +> Native flow matching models fail at few-step sampling. *Source: Frans, Kevin, et al. "One step diffusion via shortcut models."* + +### Shortcut Vector Field + +If we cannot straighten the ground truth vector field, can we tackle the problem of few-step sampling by learning velocities that properly jump across long time steps instead of learning the instantaneous velocities? Yes, we can. + +#### Shortcut Models + +Shortcut models implement the above idea by training a network $u_\theta(x_t,t,\Delta t)$ to match the *velocities that jump across long time steps* (termed *shortcuts* in the paper). A ground truth shortcut $u(x_t,t,\Delta t)$ will be the velocity pointing from $x_t$ to $x_{t+\Delta t}$, formally: + +$$ +u(x_t,t,\Delta t)=\frac{1}{\Delta t}\int_t^{t+\Delta t}\mu(x_\tau,\tau)d\tau +$$ + +Ideally, you can transform $x_0$ to $x_1$ within one step with the learned shortcuts: + +$$ +x_1\approx x_0+u_\theta(x_0,0,1) +$$ + +> **Note:** Of course, in practice shortcut models face the same problem mentioned in the [Curvy Vector Field](#curvy-vector-field): the same data point $x_1$ corresponds to multiple shortcut velocities to different data points $x_0$, making the ground truth shortcut velocity at $x_1$ the average of all possibilities. So, shortcut models have a performance advantage with few sampling steps compared to conventional flow matching models, but typically don't have the same performance with one step versus more steps. + +The theory is quite straightforward. The tricky part is in the model training. First, the network expands from learning all possibilities of velocities at $(x_t,t)$ to all velocities at $(x_t,t, \Delta t)$ with $\Delta t\in [0, t]$. Essentially, the shortcut vector field has one more dimension than the instantaneous vector field, making the learning space larger. Second, calculating the ground truth shortcut involves calculating integral, which can be computationally heavy. + +To tackle these challenges, shortcut models introduce *self-consistency shortcuts*: one shortcut with step size $2\Delta t$ should equal two consecutive shortcuts both with step size $\Delta t$: + +$$ +u(x_t,t,2\Delta t)=u(x_t,t,\Delta t)/2+u(x_{t+\Delta t},t+\Delta t,\Delta t)/2 +$$ + +The model is then trained with the combination of matching instantaneous velocities and self-consistency shortcuts as below. Notice that we don't train a separate network for matching the instantaneous vectors but leverage the fact that the shortcut $u(x_t,t,\Delta t)$ is the instantaneous velocity when $\Delta t\rightarrow 0$. + +$$ +\mathcal{L} = \mathbb{E}_{x_0,x_1,t,\Delta t} [ \underbrace{\| u_\theta(x_t, t, 0) - (x_1 - x_0)\|^2}_{\text{Flow-Matching}} + +\underbrace{\|u_\theta(x_t, t, 2\Delta t) - \text{sg}(\mathbf{u}_{\text{target}})\|^2}_{\text{Self-Consistency}}], +$$ +$$ +\quad \mathbf{u}_{\text{target}} = u_\theta(x_t, t, \Delta t)/2 + u_\theta(x'_{t+\Delta t}, t + \Delta t, \Delta t)/2 \quad +\text{and} \quad x'_{t+\Delta t} = x_t + \Delta t \cdot u_\theta(x_t, t, \Delta t) +$$ + +Where $\text{sg}$ is stop gradient, i.e., detach $\mathbf{u}_\text{target}$ from back propagation, making it a pseudo ground truth. Below is an illustration of the training process provided in the original paper. + +![Shortcut model training](shortcut-training.png) +> Training of the shortcut models with self-consistency loss. + +#### Mean Flow + +Mean flow is another work sharing the idea of learning velocities that take large step size shortcuts but with a stronger theoretical foundation and a different approach to training. + +![Average velocity illustration](average-velocity.png) +> Illustration of the average velocity provided in the original paper. + +Mean flow defines an *average velocity* as a shortcut between times $t$ and $r$ where $t$ and $r$ are independent: + +$$ +u(x_t,r,t)=\frac{1}{t-r}\int_{r}^t \mu(x_\tau,\tau)d\tau +$$ + +This average velocity is essentially equivalent to a *shortcut* in shortcut models given $\Delta t=t-r$. What differentiates mean flow from shortcut models is that mean flow aims to provide a ground truth of the vector field defined by $u(x_t,r,t)$, and directly train a network $u_\theta(x_t,r,t)$ to match the ground truth. + +We transform the above equation by differentiate both sides with respect to $t$ and rearrange components, and get: + +$$ +u(x_t,r,t)=\mu(x_t,t)+(r-t)\frac{d}{dt}u(x_t,r,t) +$$ + +We get the average velocity on the left, and the instantaneous velocity and the time derivative components on the right. This defines the ground truth average vector field, and our goal now is to calculate the right side. We already know that the ground truth instantaneous velocity $\mu(x_t,t)=x_1-x_0$. To compute the time derivative component, we can expand it in terms of partial derivatives: + +$$ +\frac{d}{dt}u(x_t,r,t)=\frac{dx_t}{dt}\partial_x u+\frac{dr}{dt}\partial_r u+\frac{dt}{dt}\partial_t u +$$ + +From the ODE definition $dx_t/dt=\mu(x_t,t)$, and $dt/dt=1$. Since $t$ and $r$ are independent, ${dr}/{dt}=0$. Thus, we have: + +$$ +\frac{d}{dt}u(x_t,r,t)=\mu(x_t,t)\partial_x u+\partial_t u +$$ + +This means the time derivative component is the vector product between $[\partial_x u,\partial_r u,\partial_t u]$ and $[\mu,0,1]$. In practice, this can be computed using the Jacobian vector product (JVP) functions in NN libraries, such as the `torch.func.jvp` function in PyTorch. In summary, the mean flow loss function is: + +$$ +\mathcal L=\mathbb E_{x_t,r,t}\|u_\theta(x_t,r,t)-\text{sg}(\mu(x_t,t)+(r-t)(\mu(x_t,t)\partial_x u_\theta+\partial_t u_\theta))\|^2 +$$ + +Notice that the JVP computation inside $\text{sg}$ is performed with the network $u_\theta$ itself. In this regard, this loss function shares a similar idea with the self-consistency loss in shortcut models--supervising the network with output produced by the network itself. + +> **Note:** While the loss function of mean flow is directly derived from the integral definition of shortcuts/average velocities, the self-consistency loss in shortcut models is also implicitly simulating the integral definition. If we expand a shortcut $u(x_t,t,\Delta t)$ following the idea of self-consistency recursively by $n$ times: +> $$ +> u(x_t,t,\Delta t)=\sum_{k=0}^{n^2}\frac{1}{\Delta t/n^2}u(x_{t+k\Delta t/n^2},t+k\Delta t/n^2,\Delta t/n^2) +> $$ +> When $n\rightarrow +\infty$ we essentially get the integral definition. + +#### Extended Reading: Rectified Flow + +Both shortcut models and mean flow are built on top of the ground truth curvy ODE field. They don't modify the field $\mu$, but rather try to learn shortcut velocities that can traverse the field with fewer Euler steps. This is reflected in their loss function design: shortcut models' loss function specifically includes a standard flow matching component, and mean flow's loss function is derived from the relationship between vector fields $\mu$ and $u$. + +Rectified flow, another family of flow matching models that aims to achieve one-step sampling, is fundamentally different in this regard. It aims to replace the original ground truth ODE field with a new one with straight flows. Ideally, the resulting ODE field has zero curvature, enabling one-step integration with the simple Euler method. This usually involves augmentation of the training data and a repeated *reflow* process. + +We won't discuss rectified flow in further detail in this post, but it's worth pointing out its difference from shortcut models and mean flow. + +## SDE and Score Matching + +### SDE in Generative Modeling + +SDE, as its name suggests, is a differential equation with a stochastic component. Recall the general differential equation we introduced at the beginning: + +$$ +dx_t=\mu(x_t,t)dt+\sigma(x_t,t)dW_t +$$ + +In practice, the diffusion term $\sigma$ usually only depends on $t$, so we will use the simpler formula going forward: + +$$ +dx_t=\mu(x_t,t)dt+\sigma(t)dW_t +$$ + +$W_t$ is the Brownian motion (a.k.a. standard Wiener process). In practice, its behavior over time $t$ can be described as $W_{t+\Delta t}-W_t\sim \mathcal N(0, \Delta t)$. This is the source of SDE's stochasticity, and also why people like to call the family of SDE-based generative models *diffusion models*, since Brownian motion is derived from physical diffusion processes. + +In the context of generative modeling, the stochasticity in SDE means it can theoretically handle augmented data or data that is stochastic in nature (e.g., financial data) more gracefully. Practically, it also enables techniques such as stochastic control guidance. At the same time, it also means SDE is mathematically more complicated compared to ODE. We no longer have a deterministic vector field $\mu$ specifying flows of data points $x_0$ moving towards $x_1$. Instead, both $\mu$ and $\sigma$ have to be designed to ensure that the SDE leads to the target distribution $p(x_1)$ we want. + +To solve the SDE, similar to the Euler method used for solving ODE, we can use the Euler-Maruyama method: + +$$ +x_{t_{k+1}}=x_{t_k}+(t_{k+1}-t_k)\mu(x_t,t)+\sqrt{t_{k+1}-t_k}\sigma(t)\epsilon,\quad \epsilon\sim\mathcal N(0,1) +$$ + +In other words, we move the data point guided by the velocity $\mu(x_t,t)$ plus a bit of Gaussian noise scaled by $\sqrt{t_{k+1}-t_k}\sigma(t)$. + +### Score Matching + +In SDE, the exact form of the vector field $\mu$ is still (quite likely) unknown. To solve this, the general idea is consistent with flow matching: we want to find the ground truth $\mu(x_t,t)$ and build a neural network $\mu_\theta(x_t,t)$ to match it. + +Score matching models implement this idea by parameterizing $\mu$ as: + +$$ +\mu(x_t,t)=v(x_t,t)+\frac{\sigma^2(t)}{2}\nabla \log p(x_t) +$$ + +where $v(x_t,t)$ is a velocity similar to that in ODE, and $\nabla \log p(x_t)$ is the *score (a.k.a. informant)* of $x_t$. Without going too deep into the relevant theories, think of the score as a "compass" that points in the direction where $x_t$ becomes more likely to belong to the distribution $p(x_1)$. The beauty of introducing the score is that depending on the definition of ground truth $x_t$, the velocity $v$ can be derived from the score, or vice versa. Then, we only have to focus on building a learnable *score function* $s_\theta(x_t,t)$ to *match* the ground truth score using the loss function below, hence the name score matching: + +$$ +\mathcal L=\mathbb E_{x_t,t} \|s_\theta(x_t,t)-\nabla \log p(x_t)\|^2 +$$ + +For example, if we have time-dependent coefficients $\alpha_t$ and $\beta_t$ (termed noise schedulers in most diffusion models), and define that $x_t$ follows the distribution given a clean data point $x_1$: + +$$ +p(x_t)=\mathcal N(\alpha_t x_1,\beta^2_t) +$$ + +then we will have: + +$$ +\nabla \log p(x_t)=-\frac{x_t-\alpha_t x_1}{\beta^2_t},\quad +v(x_t,t) = \left(\beta_t^2 \frac{\partial_t\alpha_t}{\alpha_t} - (\partial_t\beta_t) \beta_t\right) \nabla \log p(x_t) + \frac{\partial_t\alpha_t}{\alpha_t} x_t +$$ + +Some works also propose to re-parameterize the score function with noise $\epsilon$ sampled from a standard normal distribution, so that the neural network can be a learnable denoiser $\epsilon_\theta(x_t,t)$ that matches the noise rather than the score. Since $s_\theta=-\epsilon_\theta / \sigma(t)$, both approaches are equivalent. + +### Shortcuts in SDE + +Most existing efforts sharing the idea of shortcut vector fields are grounded in ODEs. However, given the correlations between SDE and ODE, learning an SDE that follows the same idea should be straightforward. Generally speaking, SDE training, similar to ODE, focuses on the deterministic drift component $\mu$. One should be able to, for example, use the same mean flow loss function to train a score function for solving an SDE. + +> **Note:** Needless to say, generalizing shortcut models and mean flow to flow matching models with ground truth vector fields other than optimal transport flow requires no modification either, since most such models (e.g., Bayesian flow) are ultimately grounded in ODE. + +One caveat of training a "shortcut SDE" is that the ideal result of one-step sampling contradicts the stochastic nature of SDE--if you are going to perform the sampling in one step, you are probably better off using ODE to begin with. Still, I believe it would be useful to train an SDE so that its benefits versus ODE are preserved, while still enabling the lowering of sampling steps $N$ for improved computational efficiency. + +Below are some preliminary results I obtained from a set of amorphous material generation experiments. You don't need to understand the figure--just know that it shows that applying the idea of learning shortcuts to SDE does yield better results compared to the vanilla SDE when using few-step sampling. + +![SDE shortcut results](sde-results.png) +> Structural functions of generated materials, sampled in 10 steps. + +--- + +**References:** + +1. Holderrieth and Erives, "An Introduction to Flow Matching and Diffusion Models." +2. Song and Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution." +3. Rezende, Danilo, and Shakir Mohamed. "Variational inference with normalizing flows." +4. https://en.wikipedia.org/wiki/Differential_equation +5. https://en.wikipedia.org/wiki/Brownian_motion +6. https://en.wikipedia.org/wiki/Vector_field +7. https://en.wikipedia.org/wiki/Vector_flow +8. https://en.wikipedia.org/wiki/Ordinary_differential_equation +9. https://en.wikipedia.org/wiki/Stochastic_differential_equation +10. https://en.wikipedia.org/wiki/Euler_method +11. https://github.com/rtqichen/torchdiffeq +12. Lipman, Yaron, et al. "Flow matching for generative modeling." +13. Frans, Kevin, et al. "One step diffusion via shortcut models." +14. Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling." +15. Liu, Xingchao, Chengyue Gong, and Qiang Liu. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." +16. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." +17. https://en.wikipedia.org/wiki/Diffusion_process +18. Huang et al., "Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion." +19. https://en.wikipedia.org/wiki/Euler–Maruyama_method +20. Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations." +21. https://en.wikipedia.org/wiki/Informant_(statistics) diff --git a/content/ml-tech/ode-sde/ode-sde-difference.png b/content/ml-tech/ode-sde/ode-sde-difference.png new file mode 100644 index 0000000..6ca288d Binary files /dev/null and b/content/ml-tech/ode-sde/ode-sde-difference.png differ diff --git a/content/ml-tech/ode-sde/sde-results.png b/content/ml-tech/ode-sde/sde-results.png new file mode 100644 index 0000000..8ab6e07 Binary files /dev/null and b/content/ml-tech/ode-sde/sde-results.png differ diff --git a/content/ml-tech/ode-sde/shortcut-training.png b/content/ml-tech/ode-sde/shortcut-training.png new file mode 100644 index 0000000..1350273 Binary files /dev/null and b/content/ml-tech/ode-sde/shortcut-training.png differ diff --git a/content/ml-tech/ode-sde/vector-field.png b/content/ml-tech/ode-sde/vector-field.png new file mode 100644 index 0000000..f4f151b Binary files /dev/null and b/content/ml-tech/ode-sde/vector-field.png differ diff --git a/content/ml-tech/one-step-diffusion-models/consistency-model.png b/content/ml-tech/one-step-diffusion-models/consistency-model.png new file mode 100644 index 0000000..8c5b463 Binary files /dev/null and b/content/ml-tech/one-step-diffusion-models/consistency-model.png differ diff --git a/content/ml-tech/one-step-diffusion-models/diffusion-process.png b/content/ml-tech/one-step-diffusion-models/diffusion-process.png new file mode 100644 index 0000000..5b261ec Binary files /dev/null and b/content/ml-tech/one-step-diffusion-models/diffusion-process.png differ diff --git a/content/ml-tech/one-step-diffusion-models/dm-scale-poorly.png b/content/ml-tech/one-step-diffusion-models/dm-scale-poorly.png new file mode 100644 index 0000000..da6aedf Binary files /dev/null and b/content/ml-tech/one-step-diffusion-models/dm-scale-poorly.png differ diff --git a/content/ml-tech/one-step-diffusion-models/few-steps-results.png b/content/ml-tech/one-step-diffusion-models/few-steps-results.png new file mode 100644 index 0000000..f8dd53f Binary files /dev/null and b/content/ml-tech/one-step-diffusion-models/few-steps-results.png differ diff --git a/content/ml-tech/one-step-diffusion-models/index.md b/content/ml-tech/one-step-diffusion-models/index.md new file mode 100644 index 0000000..232fb2d --- /dev/null +++ b/content/ml-tech/one-step-diffusion-models/index.md @@ -0,0 +1,140 @@ ++++ +title = "One Step Diffusion Models" +date = 2025-05-12 +description = "" ++++ + +> **TL;DR:** +> Despite the promising performance of diffusion models on continuous modality generation, one deficiency that is holding them back is their requirement for multi-step denoising processes, which can be computationally expensive. In this article, we examine recent works that aim to build diffusion models capable of performing sampling in one or a few steps. + +## Background + +Diffusion models (DMs), or more broadly speaking, score-matching generative models, have become the de facto framework for building deep generation models. They demonstrate exceptional generation performance, especially on continuous modalities including images, videos, audios, and spatiotemporal data. + +Most diffusion models work by coupling a forward diffusion process and a reverse denoising diffusion process. The forward diffusion process gradually adds noise to the ground truth clean data $X_0$, until noisy data $X_T$ that follows a relatively simple distribution is reached. The reverse denoising diffusion process starts from the noisy data $X_T$, and removes the noise component step-by-step until clean generated data $X_0$ is reached. The reverse process is essentially a Monte-Carlo process, meaning it cannot be parallelized for each generation, which can be inefficient for a process with a large number of steps. + +![](diffusion-process.png) + +> The two processes in a typical diffusion model. *Source: Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models."* + +### Understanding DMs + +There are many ways to understand how Diffusion Models (DMs) work. One of the most common and intuitive approaches is that a DM learns an ordinary differential equation (ODE) or a stochastic differential equation (SDE) that transforms noise into data. Imagine an vector field between the noise $X_T$ and clean data $X_0$. By training on sufficiently large numbers of timesteps $t\in [0,T]$, a DM is able to learn the vector (tangent) towards the cleaner data $X_{t-\Delta t}$, given any specific timestep $t$ and the corresponding noisy data $X_t$. This idea is easy to illustrate in a simplified 1-dimensional data scenario. + +![](ode-sde-flow.png) + +> Illustrated ODE and SDE flow of a diffusion model on 1-dimensional data. *Source: Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."* + +### DMs Scale Poorly with Few Steps + +Vanilla DDPM, which is essentially a discrete-timestep DM, can only perform the reverse process using the same number of steps it is trained on, typically thousands. DDIM introduces a reparameterization scheme that enables skipping steps during the reverse process of DDPM. Continuous-timestep DMs like Stochastic Differential Equations (SDE) naturally possess the capability of using fewer steps in the reverse process compared to the forward process/training. + +> Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models." +> Song, Meng, and Ermon, "Denoising Diffusion Implicit Models." +> Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations." + +Nevertheless, it is observed that their performance typically suffers catastrophic degradation when reducing the number of reverse process steps to single digits. + +![](few-steps-results.png) + +> Images generated by conventional DMs with only a few steps of reverse process. *Source: Frans et al., "One Step Diffusion via Shortcut Models."* + +To understand why DMs scale poorly with few reverse process steps, we can return to the vector field perspective of DMs. When the target data distribution is complex, the vector field typically contains numerous intersections. When a given $X_t$ and $t$ is at these intersections, the vector points to the averaged direction of all candidates. This causes the generated data to approach the mean of the training data when only a few reverse process steps are used. Another explanation is that the learned vector field is highly curved. Using only a few reverse process steps means attempting to approximate these curves with polylines, which is inherently difficult. + +![](dm-scale-poorly.png) + +> Illustration of the why DMs scale poorly with few reverse process steps. *Source: Frans et al., "One Step Diffusion via Shortcut Models."* + +We will introduce two branches of methods that aim to scale DMs to few or even reverse process steps: **distillation-based**, which distillates a pre-trained DM into a one-step model; and **end-to-end-based**, which trains a one-step DM from scratch. + +## Distallation + +Distillation-based methods are also called **rectified flow** methods. Their idea follows the above insight of "curved ODE vector field": if the curved vectors (flows) are hindering the scaling of reverse process steps, can we try to straighten these vectors so that they are easy to approximate with polylines or even straight lines? + +*Liu, Gong, and Liu, "Flow Straight and Fast"* implements this idea, focusing on learning an ODE that follows straight vectors as much as possible. In the context of continuous time DMs where $T=1$ and and $t\in[0,1]$, suppose the clean data $X_0$ and noise $X_1$ each follows a data distribution, $X_0\sim \pi_0$ and $X_1\sim \pi_1$. The "straight vectors" can be achieved by solving a nonlinear least squares optimization problem: + +{% math() %} +\min_{v} \int_{0}^{1} \mathbb{E}\left[\left\|\left(X_{1}-X_{0}\right)-v\left(X_{t}, t\right)\right\|^{2}\right] \mathrm{d} t, +{% end %} + +{% math() %} +\quad X_{t}=t X_{1}+(1-t) X_{0} +{% end %} + +Where $v$ is the vector field of the ODE $dZ_t = v(Z_t,t)dt$. + +Though straightforward, when the clean data distribution $\pi_0$ is very complicated, the ideal result of completely straight vectors can be hard to achieve. To address this, a "reflow" procedure is introduced. This procedure iteratively trains new rectified flows using data generated by previously obtained flows: + +$$ +Z^{(k+1)} = RectFlow((Z_0^k, Z_1^k)) +$$ + +This procedure produces increasingly straight flows that can be simulated with very few steps, ideally one step after several iterations. + +![](reflow-iterations.png) + +> Illustrations of vector fields after different times of reflow processes. *Source: Liu, Gong, and Liu, "Flow Straight and Fast."* + +In practice, distillation-based methods are usually trained in two stages: first train a normal DM, and later distill one-step capabilities into it. This introduces additional computational overhead and complexity. + +## End-to-end + +Compared to distillation-based methods, end-to-end-based methods train a one-step-capable diffusion model (DM) within a single training run. Various techniques are used to implement such methods. We will focus on two of them: **consistency models** and **shortcut models**. + +### Consistency Models + +In discrete-timestep diffusion models (DMs), three components in the reverse denoising diffusion process are interchangeable through reparameterization: the noise component $\epsilon_t$ to remove, the less noisy previous step $x_{t-1}$, and the predicted clean sample $x_0$. This interchangeability is enabled by the following equation: + +{% math() %} +x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon_t +{% end %} + +In theory, without altering the fundamental formulation of DMs, the learnable denoiser network can be designed to predict any of these three components. Consistency models (CMs) follow this principle by training the denoiser to specifically predict the clean sample $x_0$. The benefit of this approach is that CMs can naturally scale to perform the reverse process with few steps or even a single step. + +![](consistency-model.png) + +> A consistency model that learns to map any point on the ODE trajectory to the clean sample. *Source: Song et al., "Consistency Models."* + +Formally, CMs learn a function $f_\theta(x_t,t)$ that maps noisy data $x_t$ at time $t$ directly to the clean data $x_0$, satisfying: + +{% math() %} +f_\theta(x_t, t) = f_\theta(x_{t'}, t') \quad \forall t, t' +{% end %} + +The model must also obey the differential consistency condition: + +{% math() %} +\frac{d}{dt} f_\theta(x_t, t) = 0 +{% end %} + +CMs are trained by minimizing the discrepancy between outputs at adjacent times, with the loss function: + +{% math() %} +\mathcal{L} = \mathbb{E} \left[ d\left(f_\theta(x_t, t), f_\theta(x_{t'}, t')\right) \right] +{% end %} + +Similar to continuous-timestep DMs and discrete-timestep DMs, CMs also have continuous-time and discrete-time variants. Discrete-time CMs are easier to train, but are more sensitive to timestep scheduling and suffer from discretization errors. Continuous-time CMs, on the other hand, suffer from instability during training. + +For a deeper discussion of the differences between the two variants of CMs, and how to stabilize continuous-time CMs, please refer to *Lu and Song, "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models."* + +### Shortcut Models + +Similar to distillation-based methods, the core idea of shortcut models is inspired by the "curved vector field" problem, but the shortcut models take a different approach to solve it. + +Shortcut models are introduced in *Frans et al., "One Step Diffusion via Shortcut Models."* The paper presents the insight that conventional DMs perform badly when jumping with large step sizes stems from their lack of awareness of the step size they are set to jump forward. Since they are only trained to comply with small step sizes, they are only learning the tangents in the curved vector field, not the "correct direction" when a large step size is used. + +Based on this insight, on top of $x_t$ and $t$, shortcut models additionally include step size $d$ as part of the condition for the denoiser network. At small step sizes ($d\rightarrow 0$), the model behaves like a standard flow-matching model, learning the expected tangent from noise to data. For larger step sizes, the model learns that one large step should equal two consecutive smaller steps (self-consistency), creating a binary recursive formulation. The model is trained by combining the standard flow matching loss when $d=0$ and the self-consistency loss when $d>0$: + +{% math() %} +\mathcal{L} = \mathbb{E} [ \underbrace{\| s_\theta(x_t, t, 0) - (x_1 - x_0)\|^2}_{\text{Flow-Matching}} + \underbrace{\|s_\theta(x_t, t, 2d) - \mathbf{s}_{\text{target}}\|^2}_{\text{Self-Consistency}}], +{% end %} + +{% math() %} +\mathbf{s}_{\text{target}} = s_\theta(x_t, t, d)/2 + s_\theta(x'_{t+d}, t + d, d)/2 \quad \text{and} \quad x'_{t+d} = x_t + s_\theta(x_t, t, d)d +{% end %} + +![](shortcut-training.png) + +> Illustration of the training process of shortcut models. *Source: Frans et al., "One Step Diffusion via Shortcut Models."* + +Both consistency models and shortcut models can be seamlessly scaled between one-step and multi-step generation to balance quality and efficiency. diff --git a/content/ml-tech/one-step-diffusion-models/ode-sde-flow.png b/content/ml-tech/one-step-diffusion-models/ode-sde-flow.png new file mode 100644 index 0000000..c3c493c Binary files /dev/null and b/content/ml-tech/one-step-diffusion-models/ode-sde-flow.png differ diff --git a/content/ml-tech/one-step-diffusion-models/reflow-iterations.png b/content/ml-tech/one-step-diffusion-models/reflow-iterations.png new file mode 100644 index 0000000..2d3efe8 Binary files /dev/null and b/content/ml-tech/one-step-diffusion-models/reflow-iterations.png differ diff --git a/content/ml-tech/one-step-diffusion-models/shortcut-training.png b/content/ml-tech/one-step-diffusion-models/shortcut-training.png new file mode 100644 index 0000000..55b3d0c Binary files /dev/null and b/content/ml-tech/one-step-diffusion-models/shortcut-training.png differ diff --git a/content/ml-tech/rotary-pe/index.md b/content/ml-tech/rotary-pe/index.md new file mode 100644 index 0000000..ec1b1b4 --- /dev/null +++ b/content/ml-tech/rotary-pe/index.md @@ -0,0 +1,173 @@ ++++ +title = "Encoding Relative Positions with RoPE" +date = 2025-11-19 +description = "" ++++ + +The [Transformer](https://en.wikipedia.org/wiki/Transformer_(deep_learning)) network is position-agnostic. +In other words, it doesn't care about the order of the input sequence. +You can easily see the reason from the equation of the Attention mechanism (more specifically, how attention weights are calculated), the primary component in the Transformer: + +$$ +\text{softmax}\left(\frac{\mathbf{q}_m^T \mathbf{k}_n}{\sqrt{|D|}}\right) +$$ + +It only depends on the content of the queries and keys, not where they are positioned in a sequence. +But in many cases, we want the network to be able to distinguish the positions of tokens in a sequence. +For example, you certainly don't want the network to interpret the sentence "Jack beats John" exactly the same as "John beats Jack". + +Thus the idea of positional encoding is born: we can explicitly include information about positions of queries/keys in their content. +The network is certainly able to distinguish "1-Jack 2-beats 3-John" from "1-John 2-beats 3-Jack" even when it cannot directly access positional information. + +## Vanilla Positional Encoding (PE) + +### Formulation + +The original Transformer paper recognized this limitation in position-awareness and introduced the vanilla positional encoding (PE for short). +For an input token at position $pos$, PE is a multi-dimensional vector where the odd and even dimensions are calculated as follows. + +$$ +PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{\text{model}}}) +$$ + +$$ +PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{\text{model}}}) +$$ + +This vector is then directly added to the token embedding vector. + +To build intuition for how PE works, consider an analogy to old-fashioned electricity meters or car odometers. +Imagine a mechanical meter with multiple rotating wheels. The rightmost wheel rotates the fastest, completing a full rotation for each unit of position. The next wheel rotates slower, completing a rotation every 10 units. The wheel to its left rotates even slower, once per 100 units, and so on. Each wheel to the left rotates at an increasingly slower rate than the one before it. + +![](odometer.png) + +In the vanilla PE formulation, different dimensions correspond to these different "wheels" rotating at different frequencies determined by $10000^{2i/d_{\text{model}}}$. +The sine and cosine functions encode the continuous rotation angle of each wheel. +This multi-scale representation allows the model to capture both fine-grained positional differences (nearby tokens) and coarse-grained ones (distant tokens) simultaneously. Just as you can read the exact count from an odometer by looking at all wheels together, the model can determine relative positions by examining the patterns across all PE dimensions. +It's worth noting that PE shares a very similar idea with Fourier Features. + +### Relative Position Information + +Computing the dot-product of two PE vectors reveals that the result only depends on the difference of position indices (in other words, relative positions): + +$$ +PE_i \cdot PE_j = \sum_k \cos(\theta_k (j-i)) +$$ + +Where $\theta_k$ is the frequency. Since dot-product is the primary way different tokens interact with each other in the Attention mechanism, the Transformer should be able to interpret the relative positions between tokens. +However, relative position information is not the only thing the Transformer receives. Since PE vectors are added to token embedding vectors, the absolute positions are hardcoded to each token. +This causes problems when you try to extend a Transformer to sequences longer than the longest sequence it saw during training. Intuitively, if a network only sees absolute position indices from 1 to 100 during training, it will have no idea what to do when it receives a position index of 500 during inference. + +## Rotary Position Embedding (RoPE) + +RoPE is proposed to achieve one goal: let the Transformer only interpret relative position information, while maintaining the benefits of PE (that is, it is a non-learning encoding, adding very little computational overhead, and does not require modifying the Attention mechanism). + +Remember that the dot-product of PE vectors is already relative, the problem being they are first added to token embedding vectors. +RoPE is designed so that the dot-product of the query and key vectors are purely relative, formally: + +$$ +\langle f_q(\mathbf{x}_m, m), f_k(\mathbf{x}_n, n) \rangle = g(\mathbf{x}_m, \mathbf{x}_n, m - n). +$$ + +And a query/key vector under RoPE is calculated as follows, assuming the vector is 2-dimensional. + +{% math() %} +f_{\{q,k\}}(\mathbf{x}_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} W_{\{q,k\}}^{(11)} & W_{\{q,k\}}^{(12)} \\ W_{\{q,k\}}^{(21)} & W_{\{q,k\}}^{(22)} \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} +{% end %} + +This essentially rotates the input token embedding vector with a certain angle determined by the pre-defined frequency and the token's position index. +The dot-product of two rotated vectors depends on their angle difference, which is determined by their relative positions, making the interaction purely relative. +You can also understand RoPE with the rotating meters analogy above, since it is literally rotating vectors as if they were meter hands. +After receiving those vectors, the Transformer is like an electrician, who only cares about the relative angle difference of meter hands between two reads, rather than the absolute positions of the meter hands at each read. + +![](rope-rotation.png) + +RoPE can be extended to arbitrary $d$ dimensions, by dividing the vector space into multiple 2-dimensional sub-spaces. + +{% math() %} +f_{\{q,k\}}(\mathbf{x}_m, m) = \begin{pmatrix} +\cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ +\sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ +0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ +0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ +\vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ +0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ +0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} +\end{pmatrix} +\mathbf{W}_{\{q,k\}} \mathbf{x}_m +{% end %} + +The frequency $\theta_i$ is gradually decreased from $\theta_1$ to $\theta_{d/2}$, just like PE. +This means the beginning dimensions have higher frequencies, thus rotate faster; the ending dimensions have lower frequencies, thus rotate slower. + +As a purely relative positional encoding, RoPE inherently improves the Transformer's generalizability to sequences longer than the longest training sequence. +For example, even if the Transformer only saw sequences no longer than 100 tokens during training, it at least understands the concept of relative distances up to 100. This allows it to reason about the relationship between two tokens at positions 500 and 550 during inference, since their relative distance (50) falls within the trained range. + +## Extending RoPE + +Absolute positions are essentially relative positions with regard to the first position. Thus, RoPE is not totally free from the limitation that prevents PE from generalizing to sequences longer than those saw during training. +In other words, if the network only understands relative position differences no longer than 100 through training, it won't be able to fetch a context longer than 100 tokens away during inference, which is still a problem especially for large language models. + +Since RoPE's first mainstream adoption in LLaMA, over the years lots of efforts in extending RoPE to context length beyond training emerged. +Ideally we want to extend RoPE without fine-tuning the Transformer, or at least only fine-tune with much smaller training set and much less epoches than training. + +### Positional Interpolation (PI) + +PI is a straightforward extension of RoPE: if the network can only interpret relative position differences (context) up to a certain length, then we simply squeeze the target extended context during inference to fit that length. +Formally, if $L$ is the training context length and we want to extend it to $L'$ during inference, PI scales every input position index $m$ to $\frac{L}{L'}m$. + +You can easily see the limitation of PI: the network cannot directly understand the compressed relative positions without fine-tuning. +For example, if $L'=2L$, then a relative position of 2 will be compressed to 1 by the scaling, and a relative position of 1 becomes 0.5, which the network never encountered during training. +Thus, fine-tuning is necessary for PI to work effectively. + +### Yet Another RoPE Extension (YaRN) + +YaRN is the result of multiple "informal" techniques proposed on Reddit and GitHub ([NTK-aware interpolation](https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/) and [NTK-by-parts interpolation](https://github.com/jquesnelle/yarn/pull/1)) that were later formalized in a research paper. + +The intuition is to find a more "intelligent" way to implement positional interpolation. +In real-world applications of large language models, contexts positioned farther away from the current position (i.e., a larger relative position difference) are usually less important than contexts positioned closer (i.e., a smaller relative position difference). +Thus, even if interpolating RoPE will inevitably degrade the Transformer's performance, we should minimize degradation to smaller relative positions. + +YaRN achieves this by recognizing that different dimensions of RoPE serve different purposes. Remember the odometer analogy where each wheel rotates at different speeds? The fast-rotating wheels (high frequencies) are crucial for distinguishing nearby tokens, while the slow-rotating wheels (low frequencies) encode long-range positions. PI's problem is that it slows down all wheels equally, making even nearby tokens harder to distinguish. + +YaRN's solution is selective interpolation. It divides the RoPE dimensions into three groups based on their wavelengths. +Dimensions with very short wavelengths (high frequencies) are not interpolated at all. These fast-rotating "wheels" need to stay fast to preserve the ability to distinguish adjacent tokens. +Dimensions with wavelengths longer than the training context are interpolated fully, just like PI. These slow-rotating "wheels" can afford to rotate even slower to accommodate longer contexts. While after interpolation, the network might interpret a relative position of 10000 tokens as, for example, 5000 tokens, they are both very far away context so shouldn't have a huge impact on performance. +Finally, dimensions in between get a smooth blend of both strategies. + +This way, the network maintains its ability to understand local relationships while gaining the capability to handle much longer contexts. YaRN also introduces a temperature parameter in the attention mechanism that helps maintain consistent performance across the extended context window. + +### Resonance RoPE + +YaRN solves the extrapolation problem by not interpolating the high-frequency dimensions. But there's still an issue even with dimensions YaRN leaves unchanged. + +The problem is RoPE's non-integer wavelengths. Because of the common base value 10,000, most dimensions have wavelengths like 6.28 or 15.7 tokens. +Back to the odometer analogy: imagine a wheel that rotates every 10.3 positions instead of exactly 10. At position 10.3, it shows the same angle as position 0. At position 20.6, same as position 0 again. +But during training on sequences up to length 64, the model only sees positions 0, 10.3, 20.6, 30.9, 41.2, 51.5, 61.8. When inferencing on position 72.1 or 82.4, these are rotation angles the model never encountered during training. + +Resonance RoPE addresses this by rounding wavelengths to the nearest integer. +A wavelength of 10.3 becomes 10. Now positions 0, 10, 20, 30... all show identical rotation angles. When the model sees position 80 or 120 during inference, these align perfectly with positions seen during training. The model doesn't need to generalize to new rotation angles. +This applies to all dimensions with wavelengths shorter than the training length. For these dimensions, Resonance RoPE provably eliminates the feature gap between training and inference positions. The rounding happens offline during model setup, so there's no computational cost. + +![](resonance-rope.png) + +Resonance RoPE works with any RoPE-based method. Combined with YaRN, it provides a complete solution: YaRN handles the long-wavelength dimensions, Resonance handles the short-wavelength ones. +Experiments show the combination consistently outperforms YaRN alone on long-context tasks. + +### LongRoPE + +Both YaRN and Resonance RoPE rely on hand-crafted rules to determine how different dimensions should be scaled. YaRN divides dimensions into three groups with fixed boundaries, and Resonance rounds wavelengths to integers. LongRoPE takes a different approach: instead of manually designing the scaling strategy, it uses evolutionary search to find optimal rescale factors for each dimension automatically. + +The search process treats the rescale factors as parameters to optimize. Starting from an initial population of candidates, LongRoPE evaluates each candidate's perplexity on validation data and evolves better solutions over iterations. This automated approach discovered non-uniform scaling patterns that outperform hand-crafted rules, enabling LongRoPE to extend context windows to 2048k tokens (over 2 million). + +LongRoPE also introduces a progressive extension strategy. Rather than jumping directly from the training length to the target length, it extends in stages: first from 4k to 256k with evolutionary search, then applies the same factors to reach 2048k. The model only needs 1000 fine-tuning steps at 256k tokens to adapt, making the extension process both effective and efficient. This progressive approach reduces the risk of performance degradation that can occur with aggressive single-step extensions. + +![](longrope.png) + +> **References:** +> +> 1. RoFormer: Enhanced transformer with Rotary Position Embedding (2024). Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng. +> 2. Extending context window of large language models via positional interpolation (2023). Chen, Shouyuan and Wong, Sherman and Chen, Liangjian and Tian, Yuandong. +> 3. YaRN: Efficient Context Window Extension of Large Language Models (2023). Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico. +> 4. Resonance rope: Improving context length generalization of large language models (2024). Wang, Suyuchen and Kobyzev, Ivan and Lu, Peng and Rezagholizadeh, Mehdi and Liu, Bang. +> 5. LongRoPE: Extending LLM Context Window Beyond 3 Million Tokens (2024). Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao. diff --git a/content/ml-tech/rotary-pe/longrope.png b/content/ml-tech/rotary-pe/longrope.png new file mode 100644 index 0000000..c1cd2af Binary files /dev/null and b/content/ml-tech/rotary-pe/longrope.png differ diff --git a/content/ml-tech/rotary-pe/odometer.png b/content/ml-tech/rotary-pe/odometer.png new file mode 100644 index 0000000..c54d7ba Binary files /dev/null and b/content/ml-tech/rotary-pe/odometer.png differ diff --git a/content/ml-tech/rotary-pe/resonance-rope.png b/content/ml-tech/rotary-pe/resonance-rope.png new file mode 100644 index 0000000..2ca24c0 Binary files /dev/null and b/content/ml-tech/rotary-pe/resonance-rope.png differ diff --git a/content/ml-tech/rotary-pe/rope-rotation.png b/content/ml-tech/rotary-pe/rope-rotation.png new file mode 100644 index 0000000..41d63e3 Binary files /dev/null and b/content/ml-tech/rotary-pe/rope-rotation.png differ