draft new ml-tech/discrete-diffusion post

2026-02-07 16:49:24 +01:00 · 2026-02-07 16:49:24 +01:00 · 91d7f7d07c
commit 91d7f7d07c
parent 159a139846
10 changed files with 132 additions and 42 deletions
--- a/content/ml-tech/discrete-diffusion/category-diffusion.png
+++ b/content/ml-tech/discrete-diffusion/category-diffusion.png
--- a/content/ml-tech/discrete-diffusion/category-diffusion.webp
+++ b/content/ml-tech/discrete-diffusion/category-diffusion.webp
--- a/content/ml-tech/discrete-diffusion/ddpm.png
+++ b/content/ml-tech/discrete-diffusion/ddpm.png
--- a/content/ml-tech/discrete-diffusion/ddpm.webp
+++ b/content/ml-tech/discrete-diffusion/ddpm.webp
--- a/content/ml-tech/discrete-diffusion/diffusion-lm.png
+++ b/content/ml-tech/discrete-diffusion/diffusion-lm.png
--- a/content/ml-tech/discrete-diffusion/diffusion-lm.webp
+++ b/content/ml-tech/discrete-diffusion/diffusion-lm.webp
--- a/content/ml-tech/discrete-diffusion/index.md
+++ b/content/ml-tech/discrete-diffusion/index.md
@ -0,0 +1,97 @@
+++
+title = "Diffusion Models in Discrete Space"
+date = 2026-02-07
+description = ""
+++
+
+> **Disclaimer:**
+> This post won't dive deep into the nitty-gritty of diffusion models (e.g., noise schedules, re-parameterization of scores, difference between flow matching and score matching, etc.). It will focus on the high-level idea of how to adapt diffusion models to discrete space to build a generative model for discrete data.
+> Also, I will be primarily using diffusion model/score-matching-flavored rather than flow matching-flavored terminology.
+
+At this point you probably are very familiar with how diffusion models or score/flow matching models work in general.
+Just in case, diffusion models generate data following a target distribution through a multi-step Markov denoising process, where the model starts with random noise and gradually removes noise until clean data is reached.
+For the purpose of training the model a noising process is also used, which adds noise to clean data, and the two processes are conceptually invertible with regard to each other.
+[This post](../ode-sde) discussed the differential equation view of diffusion models: they are learning a time-continuous map from noise to data.
+
+![ddpm](./ddpm.webp)
+
+{% cap() %}The noising and denoising process illustrated in DDPM [1].{% end %}
+
+"Adding noise" in practice typically means sampling a certain magnitude of Gaussian noise and adding it to the data features.
+Obviously this only works on continuous features.
+What if we want to adapt diffusion models to data with discrete features?
+
+## Latent Diffusion Models
+
+![diffusion-lm](./diffusion-lm.webp)
+
+{% cap() %}The diffusion process of Diffusion-LM [2] that works on the embedding space of discrete features.{% end %}
+
+The most straight-forward way to adapt diffusion models to discrete features is to encode them to continuous space, use standard diffusion models on the continuous features, and decode the continuous features back to discrete ones.
+Both the encode and the decode processes can be learned or non-learned.
+
+More specifically, suppose you have a clean state $X_0$ composed of discrete features. Using an encoder that maps discrete features to continuous embeddings, you get a continuous latent representation $E_0$ of $X_0$. Then the forward and backward diffusion process is fully done on $E_t$.
+For inference, the reverse diffusion process produces a clean $\hat E_0$, and you can decode it back to discrete space and get $\hat X_0$.
+
+The encoder and decoder can be as simple as a onehot embedding and an $\arg \max$, respectively.
+Alternatively, one can use a learned lookup table embedding layer for the encoder, and a classification head for the decoder.
+
+> There are some technical tricks out there that can improve performance or stability. For example, instead of following DDPM and parameterize the reverse diffusion process with noise predicted by a denoiser network, one can train the network to predict $\hat E_0$ given $E_t$ at any step $t$, which makes it easier to impose constraints on $\hat E_0$ and $\hat X_0$.
+> During inference, at each step, one can clamp the predicted $\hat E_0$ to its closest embedding vector in the lookup table, so that the inference is explicitly geared towards producing meaningful embedding vectors.
+
+Since the encode and decode processes and the diffusion process are largely disentangled, following this framework, one can technically use any standard diffusion models to deal with data with discrete features.
+At the same time this would create a gap between the training of the diffusion models versus the final output target, which intuitively won't be the most natural and best performing way to adapt diffusion models to discrete features.
+
+## Diffusion on Categories
+
+![category-diffusion](./category-diffusion.webp)
+
+{% cap() %}Illustration of diffusion processes operating on categorical distributions.{% end %}
+
+There are diffusion models that operate directly on the discrete space [3, 4], in other words, they are trying to build a pair of Markov processes that manipulate category labels directly.
+
+To simplify, let's suppose a data point $x$ just contains one category label.
+During the forward diffusion process, given the clean $x_0$, "adding noise" to calculate $x_t$ is implemented as either (a) staying at the current label with a probability $\beta_t$, or (b) switching to another category with a probability $1-\beta_t$.
+
+> Case (b) can be simply randomly picking a category from a uniform distribution, or it can be more complex following a specific noise schedule, leaning more heavily towards certain categories depending on the step $t$.
+
+During the reverse diffusion process, given $x_t$, a network predicts a distribution over categories $p_\theta(x_{t-1}|x_t)$, and the process can move forward by one step by sampling from this distribution to get $\hat x_{t-1}$.
+
+For training, one can re-parameterize $\hat x_{t-1}$ by letting the network predict the categorical distribution of $\hat x_0$, so that the output layer of the network is essentially a classification head, and the network can be trained with cross-entropy loss.
+
+### Score-matching on Categories
+
+Building on the above general framework, SEDD [6] frames the reverse process and trains the denoiser differently. Recall in standard score-matching [5], each reverse step follows:
+
+$$x_{t-\Delta t} = x_t + \nabla_{x_t} \log p_t(x_t) \cdot \Delta t$$
+
+where $\nabla_{x_t} \log p_t(x_t)$ is the score, intuitively the direction towards higher density (closer to the target distribution) at $x_t$.
+
+SEDD defines an analogous score in discrete space. Given $x_t$, the discrete score for each category $c$ is formulated as $p_t(c)/p_t(x_t)$, measuring how strongly $x_t$ should jump towards category $c$ at time $t$ to get closer to the target data distribution.
+
+A network is trained to estimate this score, just like in score-matching: {% m() %}s_\theta(x_t, t)_c\approx p_t(c)/p_t(x_t){% end %}. In practice one would use $p_{t|0}(c \mid x_0) / p_{t|0}(x_t \mid x_0)$ as the training target, which is easy to compute given a known $x_0$ and the noise schedule.
+
+On each reverse step, the probability of sampling each category for the next step follows a similar formulation as the score-matching reverse step:
+
+$$p(x_{t-\Delta t} = c) = \delta_{x_t}(c) + \Delta t \cdot Q_t(x_t, c) \cdot s_\theta(x_t, t)_c$$
+
+The first term is the probability of staying at the current category, and the second term is the probability of jumping to category $c$, scaled by the step size. $Q_t(x_t, c)$ is the noise factor given the specific noise schedule.
+
+### ODE Flow on Categories
+
+Within the above frameworks, during inference you sample from a categorical distribution to get the cleaner step.
+
+If you want no stochasticity in the sampling steps, similar to how you would use flow matching ODE instead of score-matching SDE, you can use discrete flow [7]. The key difference is that instead of sampling from a distribution, you use $\arg \max$ to deterministically pick the category for the cleaner step.
+
+Of course $\arg \max$ is non-differentiable, so you cannot directly train a model on its output. A trick to bypass this is to backpropagate through a softmax with low temperature instead, which approximates $\arg \max$ while remaining differentiable. The true $\arg \max$ is still used during inference.
+
+
+> **References:**
+> 1. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising Diffusion Probabilistic Models.”
+> 2. Li, Xiang Lisa, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B. Hashimoto. “Diffusion-LM Improves Controllable Text Generation.”
+> 3. Hoogeboom, Emiel, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions.”
+> 4. Austin, Jacob, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. “Structured Denoising Diffusion Models in Discrete State-Spaces.”
+> 5. Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. “Score-Based Generative Modeling through Stochastic Differential Equations.”
+> 6. Lou, Aaron, Chenlin Meng, and Stefano Ermon. “Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution.”
+> 7. Tran, Dustin, Keyon Vafa, Kumar Agrawal, Laurent Dinh, and Ben Poole. “Discrete Flows: Invertible Generative Models of Discrete Data.”
+
--- a/content/ml-tech/new-bert/index.md
+++ b/content/ml-tech/new-bert/index.md
@ -96,12 +96,9 @@ Seeing the insane computational load required by LLMs, there are lots of techniq
 - Learning Rate Schedule: A modified trapezoidal (Warmup-Stable-Decay) schedule with a short warmup period, constant learning rate for the majority of training, and a decay phase at the end
 - Batch Size Schedule: Gradually increases batch size during training from smaller to larger values; this can accelerate training progress by updating weights more frequently in early stages with smaller batches.

---
-
-## References
-
-1. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (2025). Warner, Benjamin and Chaffin, Antoine and Clavié, Benjamin and Weller, Orion and Hallstr"om, Oskar and Taghadouini, Said and Gallagher, Alexis and Biswas, Raja and Ladhak, Faisal and Aarsen, Tom and Adams, Griffin Thomas and Howard, Jeremy and Poli, Iacopo.
-2. NeoBERT: A Next Generation BERT (2025). Breton, Lola Le and Fournier, Quentin and Morris, John Xavier and Mezouar, Mariam El and Chandar, Sarath.
-3. Nomic Embed: Training a Reproducible Long Context Text Embedder (2025). Nussbaum, Zach and Morris, John Xavier and Mulyar, Andriy and Duderstadt, Brandon.
-4. Unveiling the Potential of BERT-family: A New Recipe for Building Scalable, General and Competitive Large Language Models (2025). Xiao, Yisheng and Li, Juntao and Hu, Wenpeng and Luo, Zhunchen and Zhang, Min.
+> **References:**
+> 1. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (2025). Warner, Benjamin and Chaffin, Antoine and Clavié, Benjamin and Weller, Orion and Hallstr"om, Oskar and Taghadouini, Said and Gallagher, Alexis and Biswas, Raja and Ladhak, Faisal and Aarsen, Tom and Adams, Griffin Thomas and Howard, Jeremy and Poli, Iacopo.
+> 2. NeoBERT: A Next Generation BERT (2025). Breton, Lola Le and Fournier, Quentin and Morris, John Xavier and Mezouar, Mariam El and Chandar, Sarath.
+> 3. Nomic Embed: Training a Reproducible Long Context Text Embedder (2025). Nussbaum, Zach and Morris, John Xavier and Mulyar, Andriy and Duderstadt, Brandon.
+> 4. Unveiling the Potential of BERT-family: A New Recipe for Building Scalable, General and Competitive Large Language Models (2025). Xiao, Yisheng and Li, Juntao and Hu, Wenpeng and Luo, Zhunchen and Zhang, Min.

--- a/content/ml-tech/ode-sde/index.md
+++ b/content/ml-tech/ode-sde/index.md
@ -259,28 +259,26 @@ Below are some preliminary results I obtained from a set of amorphous material g
 ![SDE shortcut results](sde-results.webp)
 {% cap() %}Structural functions of generated materials, sampled in 10 steps.{% end %}

---
-
-## References
-
-1. Holderrieth and Erives, "An Introduction to Flow Matching and Diffusion Models."
-2. Song and Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution."
-3. Rezende, Danilo, and Shakir Mohamed. "Variational inference with normalizing flows."
-4. https://en.wikipedia.org/wiki/Differential_equation
-5. https://en.wikipedia.org/wiki/Brownian_motion
-6. https://en.wikipedia.org/wiki/Vector_field
-7. https://en.wikipedia.org/wiki/Vector_flow
-8. https://en.wikipedia.org/wiki/Ordinary_differential_equation
-9. https://en.wikipedia.org/wiki/Stochastic_differential_equation
-10. https://en.wikipedia.org/wiki/Euler_method
-11. https://github.com/rtqichen/torchdiffeq
-12. Lipman, Yaron, et al. "Flow matching for generative modeling."
-13. Frans, Kevin, et al. "One step diffusion via shortcut models."
-14. Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."
-15. Liu, Xingchao, Chengyue Gong, and Qiang Liu. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow."
-16. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models."
-17. https://en.wikipedia.org/wiki/Diffusion_process
-18. Huang et al., "Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion."
-19. https://en.wikipedia.org/wiki/Euler–Maruyama_method
-20. Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."
-21. https://en.wikipedia.org/wiki/Informant_(statistics)
+> **References:**
+> 
+> 1. Holderrieth and Erives, "An Introduction to Flow Matching and Diffusion Models."
+> 2. Song and Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution."
+> 3. Rezende, Danilo, and Shakir Mohamed. "Variational inference with normalizing flows."
+> 4. https://en.wikipedia.org/wiki/Differential_equation
+> 5. https://en.wikipedia.org/wiki/Brownian_motion
+> 6. https://en.wikipedia.org/wiki/Vector_field
+> 7. https://en.wikipedia.org/wiki/Vector_flow
+> 8. https://en.wikipedia.org/wiki/Ordinary_differential_equation
+> 9. https://en.wikipedia.org/wiki/Stochastic_differential_equation
+> 10. https://en.wikipedia.org/wiki/Euler_method
+> 11. https://github.com/rtqichen/torchdiffeq
+> 12. Lipman, Yaron, et al. "Flow matching for generative modeling."
+> 13. Frans, Kevin, et al. "One step diffusion via shortcut models."
+> 14. Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."
+> 15. Liu, Xingchao, Chengyue Gong, and Qiang Liu. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow."
+> 16. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models."
+> 17. https://en.wikipedia.org/wiki/Diffusion_process
+> 18. Huang et al., "Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion."
+> 19. https://en.wikipedia.org/wiki/Euler–Maruyama_method
+> 20. Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."
+> 21. https://en.wikipedia.org/wiki/Informant_(statistics)
--- a/content/ml-tech/rotary-pe/index.md
+++ b/content/ml-tech/rotary-pe/index.md
@ -164,12 +164,10 @@ LongRoPE also introduces a progressive extension strategy. Rather than jumping d

 ![](longrope.webp)

---
-
-## References
-
-1. RoFormer: Enhanced transformer with Rotary Position Embedding (2024). Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng.
-2. Extending context window of large language models via positional interpolation (2023). Chen, Shouyuan and Wong, Sherman and Chen, Liangjian and Tian, Yuandong.
-3. YaRN: Efficient Context Window Extension of Large Language Models (2023). Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico.
-4. Resonance rope: Improving context length generalization of large language models (2024). Wang, Suyuchen and Kobyzev, Ivan and Lu, Peng and Rezagholizadeh, Mehdi and Liu, Bang.
-5. LongRoPE: Extending LLM Context Window Beyond 3 Million Tokens (2024). Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao.
+> **References:**
+>
+> 1. RoFormer: Enhanced transformer with Rotary Position Embedding (2024). Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng.
+> 2. Extending context window of large language models via positional interpolation (2023). Chen, Shouyuan and Wong, Sherman and Chen, Liangjian and Tian, Yuandong.
+> 3. YaRN: Efficient Context Window Extension of Large Language Models (2023). Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico.
+> 4. Resonance rope: Improving context length generalization of large language models (2024). Wang, Suyuchen and Kobyzev, Ivan and Lu, Peng and Rezagholizadeh, Mehdi and Liu, Bang.
+> 5. LongRoPE: Extending LLM Context Window Beyond 3 Million Tokens (2024). Ding, Yiran and Zhang, Li Lyna and Zhang, Chengruidong and Xu, Yuanyuan and Shang, Ning and Xu, Jiahang and Yang, Fan and Yang, Mao.