introduce figcaption

This commit is contained in:
Yan Lin 2026-02-06 09:11:05 +01:00
parent 05dea86964
commit d8ea74211f
14 changed files with 81 additions and 63 deletions

View file

@ -11,16 +11,16 @@ chapter = "Chapter 5"
End-to-end learning means training a model to perform a task from input to output, supervising only on how the output aligns with the task's ground truth. End-to-end learning means training a model to perform a task from input to output, supervising only on how the output aligns with the task's ground truth.
End-to-end is typically the most straightforward option for building a deep learning method for a certain task, and that also applies to most tasks related to spatiotemporal trajectories. End-to-end is typically the most straightforward option for building a deep learning method for a certain task, and that also applies to most tasks related to spatiotemporal trajectories.
<img src="end-to-end.webp" alt="end-to-end" style="max-width: min(500px, 100%);"> {{ img(src="end-to-end.webp", alt="end-to-end", width="500px") }}
> Illustration of end-to-end learning of spatiotemporal trajectories. {% cap() %}Illustration of end-to-end learning of spatiotemporal trajectories.{% end %}
In this post we will categorize end-to-end learning tasks of trajectories from a technical standpoint: prediction, classification, and imputation. In this post we will categorize end-to-end learning tasks of trajectories from a technical standpoint: prediction, classification, and imputation.
For each category of tasks, we will give a general problem formulation and the general framework for solving it, and briefly discuss the motivation and use case of more specific downstream applications that fit into the category. For each category of tasks, we will give a general problem formulation and the general framework for solving it, and briefly discuss the motivation and use case of more specific downstream applications that fit into the category.
![categories](categories.webp) ![categories](categories.webp)
> Schema overview of the three categories of end-to-end trajectory learning tasks. {% cap() %}Schema overview of the three categories of end-to-end trajectory learning tasks.{% end %}
{{ toc() }} {{ toc() }}

View file

@ -12,7 +12,7 @@ A spatiotemporal trajectory is a sequence, with each item being a timestamped lo
![trajectory](trajectory.webp) ![trajectory](trajectory.webp)
> Examples of human (left) and vehicle (right) trajectories. {% cap() %}Examples of human (left) and vehicle (right) trajectories.{% end %}
With the development of technologies and devices that can record trajectories, such as GPS-equipped mobile phones, human society has gathered large-scale trajectory data. With the development of technologies and devices that can record trajectories, such as GPS-equipped mobile phones, human society has gathered large-scale trajectory data.
Such data can be valuable for traffic planning and city development. Such data can be valuable for traffic planning and city development.
@ -80,7 +80,7 @@ Anomaly detection identifies paths or behaviors that deviate from normal pattern
![tasks](tasks.webp) ![tasks](tasks.webp)
> Illustration of trajectory-related tasks. {% cap() %}Illustration of trajectory-related tasks.{% end %}
## Deep Learning for Trajectories ## Deep Learning for Trajectories

View file

@ -33,7 +33,7 @@ where each $t_i \in \{1, 2, \ldots, |\mathcal{V}|\}$ is an index into the vocabu
![token](./token.webp) ![token](./token.webp)
> An example of tokenized sentences. Source: [Tiktokenizer](https://tiktokenizer.vercel.app/) {% cap() %}An example of tokenized sentences. Source: [Tiktokenizer](https://tiktokenizer.vercel.app/){% end %}
Different tokenization strategies exist, each with trade-offs between vocabulary size and sequence length. Word-level tokenization is intuitive but requires a very large vocabulary. Character-level tokenization has a small vocabulary but produces very long sequences. Modern LLMs use subword tokenization, which breaks words into meaningful subunits that balance vocabulary size and sequence length. For a detailed treatment of tokenization, see [Karpathy's tutorial](https://www.youtube.com/watch?v=zduSFxRajkE). Different tokenization strategies exist, each with trade-offs between vocabulary size and sequence length. Word-level tokenization is intuitive but requires a very large vocabulary. Character-level tokenization has a small vocabulary but produces very long sequences. Modern LLMs use subword tokenization, which breaks words into meaningful subunits that balance vocabulary size and sequence length. For a detailed treatment of tokenization, see [Karpathy's tutorial](https://www.youtube.com/watch?v=zduSFxRajkE).
@ -62,7 +62,7 @@ The dot product of two rotated vectors depends only on the difference of their r
![rope](./rope-rotation.webp) ![rope](./rope-rotation.webp)
> RoPE encodes positional information by rotating input vectors. {% cap() %}RoPE encodes positional information by rotating input vectors.{% end %}
This relative formulation improves generalization to longer sequences. A model trained on sequences up to length $L$ has learned to interpret relative distances up to $L$. When processing longer sequences at inference time, it can still reason about token pairs whose relative distance falls within the trained range, even if their absolute positions exceed $L$. Various techniques like positional interpolation and YaRN further extend this capability by carefully scaling the rotation frequencies. This relative formulation improves generalization to longer sequences. A model trained on sequences up to length $L$ has learned to interpret relative distances up to $L$. When processing longer sequences at inference time, it can still reason about token pairs whose relative distance falls within the trained range, even if their absolute positions exceed $L$. Various techniques like positional interpolation and YaRN further extend this capability by carefully scaling the rotation frequencies.
@ -85,7 +85,7 @@ There are also hybrid approaches that combine both objectives. For example, GLM
![glm](glm.webp) ![glm](glm.webp)
> Blank-infilling objective used by GLM. {% cap() %}Blank-infilling objective used by GLM.{% end %}
> **References:** > **References:**
> - Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. "Attention Is All You Need." > - Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. "Attention Is All You Need."
@ -164,7 +164,7 @@ where {% m() %}\mathbf{e}_i{% end %} is the query and {% m() %}\mathbf{E}_{\text
![trajcogn](./trajcogn.webp) ![trajcogn](./trajcogn.webp)
> The reprogramming module and anchor words in TrajCogn. {% cap() %}The reprogramming module and anchor words in TrajCogn.{% end %}
An alternative approach is to directly serialize trajectory features into text and feed them through the LLM's native tokenizer. For features that are already textual, such as POI names or addresses, this is straightforward. For continuous features like coordinates, one can simply convert numbers to their string representations. This technically works, but LLMs are not well-known for their ability to reason about raw numbers, let alone spatial relationships encoded in coordinate pairs. A trajectory represented as a sequence of "(lat, lng)" strings may be parseable by the model, but whether it can extract meaningful spatial patterns from such input is questionable. An alternative approach is to directly serialize trajectory features into text and feed them through the LLM's native tokenizer. For features that are already textual, such as POI names or addresses, this is straightforward. For continuous features like coordinates, one can simply convert numbers to their string representations. This technically works, but LLMs are not well-known for their ability to reason about raw numbers, let alone spatial relationships encoded in coordinate pairs. A trajectory represented as a sequence of "(lat, lng)" strings may be parseable by the model, but whether it can extract meaningful spatial patterns from such input is questionable.
@ -172,7 +172,7 @@ To address this limitation, one can transform trajectory data into modalities th
![traj-mllm](./trajmllm.webp) ![traj-mllm](./trajmllm.webp)
> Mixture of textual and visual representations of trajectory features in Traj-MLLM. {% cap() %}Mixture of textual and visual representations of trajectory features in Traj-MLLM.{% end %}
### Efficient Fine-tuning on Trajectories ### Efficient Fine-tuning on Trajectories
@ -213,7 +213,7 @@ Here $g_{[m]}$ represents missing points between $g_1$ and $g_4$, and the model
![uvtm](./uvtm.webp) ![uvtm](./uvtm.webp)
> Implementation of the above generation paradigm in UVTM. {% cap() %}Implementation of the above generation paradigm in UVTM.{% end %}
Under this formulation, we can frame different tasks with different masking patterns. For origin-destination travel time estimation, the input is $(l_o, t_o, [m]), (l_d, [m], [m])$, where only the origin coordinate, departure time, and destination coordinate are known, and the model generates the arrival time as part of the destination tuple. For trajectory recovery, known sparse points have complete tuples while gaps between them are marked with $g_{[m]}$, and the model generates the intermediate points. For trajectory prediction, historical points are complete and a single $g_{[m]}$ at the end signals that future points should be generated. Under this formulation, we can frame different tasks with different masking patterns. For origin-destination travel time estimation, the input is $(l_o, t_o, [m]), (l_d, [m], [m])$, where only the origin coordinate, departure time, and destination coordinate are known, and the model generates the arrival time as part of the destination tuple. For trajectory recovery, known sparse points have complete tuples while gaps between them are marked with $g_{[m]}$, and the model generates the intermediate points. For trajectory prediction, historical points are complete and a single $g_{[m]}$ at the end signals that future points should be generated.

View file

@ -14,9 +14,9 @@ This is also very relevant in the context of trajectory learning, since the avai
Self-supervised learning is also widely used to pre-train deep learning models. Put into the perspective of trajectories, self-supervised learning can help models get a general understanding of trajectory sequences or components in trajectories; the pre-trained models can later be fine-tuned on specific tasks, or be used directly for unsupervised tasks like clustering. Self-supervised learning is also widely used to pre-train deep learning models. Put into the perspective of trajectories, self-supervised learning can help models get a general understanding of trajectory sequences or components in trajectories; the pre-trained models can later be fine-tuned on specific tasks, or be used directly for unsupervised tasks like clustering.
<img src="self-supervised.webp" alt="self-supervised" style="max-width: min(500px, 100%);"> {{ img(src="self-supervised.webp", alt="self-supervised", width="500px") }}
> Illustration of self-supervised learning of spatiotemporal trajectories. {% cap() %}Illustration of self-supervised learning of spatiotemporal trajectories.{% end %}
Most widely-adopted self-supervised learning frameworks for trajectories originate from the natural language processing (NLP), graph learning, and computer vision (CV) domains. Most widely-adopted self-supervised learning frameworks for trajectories originate from the natural language processing (NLP), graph learning, and computer vision (CV) domains.
In this post we categorize self-supervised learning methods for trajectories based on the framework they adhere to: static word embedding, graph node embedding, contextual word embedding, auto-encoding, and contrastive learning. In this post we categorize self-supervised learning methods for trajectories based on the framework they adhere to: static word embedding, graph node embedding, contextual word embedding, auto-encoding, and contrastive learning.
@ -67,14 +67,14 @@ It constructs the Huffman tree by recursively partitioning the geographic space
![poi2vec](poi2vec.webp) ![poi2vec](poi2vec.webp)
> Construction of the geography-aware binary tree in POI2Vec. {% cap() %}Construction of the geography-aware binary tree in POI2Vec.{% end %}
_TALE_ incorporates temporal periodicity with a time-aware hierarchical softmax structure. Many locations exhibit strong temporal patterns: office buildings are visited during work hours, restaurants peak at meal times, and entertainment venues are active at night. TALE captures these patterns by replacing the standard Huffman tree used in hierarchical softmax with a temporal tree. _TALE_ incorporates temporal periodicity with a time-aware hierarchical softmax structure. Many locations exhibit strong temporal patterns: office buildings are visited during work hours, restaurants peak at meal times, and entertainment venues are active at night. TALE captures these patterns by replacing the standard Huffman tree used in hierarchical softmax with a temporal tree.
The temporal tree has a root node at the top level, followed by time nodes corresponding to equal-length time slices of a day. Below each time node, a Huffman subtree organizes the locations that are visited during that time slice, based on their visit frequencies within that slice. The temporal tree has a root node at the top level, followed by time nodes corresponding to equal-length time slices of a day. Below each time node, a Huffman subtree organizes the locations that are visited during that time slice, based on their visit frequencies within that slice.
![tale](tale.webp) ![tale](tale.webp)
> The temporal tree in TALE. {% cap() %}The temporal tree in TALE.{% end %}
Predicting a visit $(l, t)$ requires traversing a path through this tree, which decomposes into two stages: Predicting a visit $(l, t)$ requires traversing a path through this tree, which decomposes into two stages:
@ -175,7 +175,7 @@ The second design is an additional self-supervised objective called masked hour
![ctle](ctle.webp) ![ctle](ctle.webp)
> The model architecture of CTLE. {% cap() %}The model architecture of CTLE.{% end %}
### Applications: Capturing Location Polysemy ### Applications: Capturing Location Polysemy
@ -259,7 +259,7 @@ The core idea of contrastive learning is to learn representations by comparing p
![cmc](cmc.webp) ![cmc](cmc.webp)
> Contrasting different views of the same data point versus another data point in contrastive multiview coding. {% cap() %}Contrasting different views of the same data point versus another data point in contrastive multiview coding.{% end %}
The _contrastive multiview coding_ framework formalizes this with the InfoNCE loss. Given a data point $\mathbf{x}$, we apply two different augmentations to obtain views $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$. An encoder $f$ maps each view to an embedding, and the model is trained to identify the positive pair among a set of negatives. For a batch of $N$ data points, the loss for a positive pair $(i, j)$ is: The _contrastive multiview coding_ framework formalizes this with the InfoNCE loss. Given a data point $\mathbf{x}$, we apply two different augmentations to obtain views $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$. An encoder $f$ maps each view to an embedding, and the model is trained to identify the positive pair among a set of negatives. For a batch of $N$ data points, the loss for a positive pair $(i, j)$ is:

View file

@ -18,13 +18,13 @@ With `cd` aliased to `zoxide`, I only need to type `cd n` (supposing that `~/.co
![zoxide jump](zoxide-jump.webp) ![zoxide jump](zoxide-jump.webp)
> Fuzzy directory jump with `zoxide`. {% cap() %}Fuzzy directory jump with `zoxide`.{% end %}
Internally `zoxide` records my visits to directories in a SQLite database and sorts them based on frequency. If the first hit is not what I want, I can also interactively select from the matched list. Internally `zoxide` records my visits to directories in a SQLite database and sorts them based on frequency. If the first hit is not what I want, I can also interactively select from the matched list.
![zoxide select](zoxide-select.webp) ![zoxide select](zoxide-select.webp)
> Candidate selection screen of `zoxide`. {% cap() %}Candidate selection screen of `zoxide`.{% end %}
## `du` -> `ncdu` ## `du` -> `ncdu`
@ -35,7 +35,7 @@ It can totally be an alternative to those fancy disk space analyzers as well.
![ncdu](ncdu.webp) ![ncdu](ncdu.webp)
> Interface of `ncdu`. {% cap() %}Interface of `ncdu`.{% end %}
## `top` -> `btop` ## `top` -> `btop`
@ -43,13 +43,13 @@ It can totally be an alternative to those fancy disk space analyzers as well.
![htop](htop.webp) ![htop](htop.webp)
> The default look of `top`. {% cap() %}The default look of `top`.{% end %}
[`btop`](https://github.com/aristocratos/btop) might be the most "nerdy-looking" `top` alternative out of the box. It can be a handy tool if you are trying to make people believe you are a hacker. [`btop`](https://github.com/aristocratos/btop) might be the most "nerdy-looking" `top` alternative out of the box. It can be a handy tool if you are trying to make people believe you are a hacker.
![btop](btop.webp) ![btop](btop.webp)
> Interface of `btop`, with the gruvbox theme. {% cap() %}Interface of `btop`, with the gruvbox theme.{% end %}
At the same time, it is very feature-rich and configurable. To some extent, it is also an alternative to bandwidth monitoring tools like `iftop` and disk utilization tools like `df`. At the same time, it is very feature-rich and configurable. To some extent, it is also an alternative to bandwidth monitoring tools like `iftop` and disk utilization tools like `df`.
@ -59,13 +59,13 @@ I think there is nothing wrong with the classic `ls`. So, as an alternative, [`e
![eza list](eza-list.webp) ![eza list](eza-list.webp)
> `eza` adds icons and color to the `ls` command. {% cap() %}`eza` adds icons and color to the `ls` command.{% end %}
It can replace the `tree` command as well. It can replace the `tree` command as well.
![eza tree](eza-tree.webp) ![eza tree](eza-tree.webp)
> File tree display of `eza`. {% cap() %}File tree display of `eza`.{% end %}
## `vim` -> `nvim` ## `vim` -> `nvim`
@ -79,8 +79,8 @@ Syntax highlighting, file browser, fuzzy search, intelligent autocompletion, deb
![neovim](nvim-1.webp) ![neovim](nvim-1.webp)
> Interface of `neovim`, with neotree file browser. {% cap() %}Interface of `neovim`, with neotree file browser.{% end %}
![neovim fuzzy search](nvim-2.webp) ![neovim fuzzy search](nvim-2.webp)
> Interface of `neovim`, with telescope search. {% cap() %}Interface of `neovim`, with telescope search.{% end %}

View file

@ -8,7 +8,7 @@ This is a very concise walkthrough of my main home server running NixOS. I assum
![cover](cover.webp) ![cover](cover.webp)
> `neofetch` screen of my home server. {% cap() %}`neofetch` screen of my home server.{% end %}
My home server (or many would rather call it a NAS) serves common home server purposes: bulk storage, basic file sharing, media streaming service, and photo backup. My home server (or many would rather call it a NAS) serves common home server purposes: bulk storage, basic file sharing, media streaming service, and photo backup.
@ -18,7 +18,7 @@ Below is a recent photo of my home server, living in the utility closet together
![server photo](server-photo.webp) ![server photo](server-photo.webp)
> Real-world photo of the home server. {% cap() %}Real-world photo of the home server.{% end %}
It is essentially an Intel N305 custom motherboard with SATA back panel and a 3D-printed enclosure. I bought it on Taobao last time I went back to China to visit my family. It is essentially an Intel N305 custom motherboard with SATA back panel and a 3D-printed enclosure. I bought it on Taobao last time I went back to China to visit my family.
Not very important here, as long as you stick to common hardware, it should be relatively straightforward to install NixOS and replicate my setup. Not very important here, as long as you stick to common hardware, it should be relatively straightforward to install NixOS and replicate my setup.
@ -191,7 +191,7 @@ For photo backup I use [Immich](https://immich.app/). It is a self-hosted altern
![immich](immich.webp) ![immich](immich.webp)
> Web interface of Immich. {% cap() %}Web interface of Immich.{% end %}
Right now Immich is the only service I am running with containers rather than native Nix modules (as you can see in [this configuration file](https://github.com/Logan-Lin/nix-config/blob/master/hosts/nixos/hs/containers.nix)). Technically it is possible to set up Immich with pure Nix modules, but for this type of services that rely on specific versions of databases (in this case, PostgreSQL with vector support), I feel containers are the easier route. Right now Immich is the only service I am running with containers rather than native Nix modules (as you can see in [this configuration file](https://github.com/Logan-Lin/nix-config/blob/master/hosts/nixos/hs/containers.nix)). Technically it is possible to set up Immich with pure Nix modules, but for this type of services that rely on specific versions of databases (in this case, PostgreSQL with vector support), I feel containers are the easier route.
And to be honest, I don't think there is much benefit going with pure Nix module here (especially for Immich which you can still [declare its config](https://github.com/Logan-Lin/nix-config/blob/master/config/immich.nix) even with containers), other than fulfilling the purism many Nix users seem to have. And to be honest, I don't think there is much benefit going with pure Nix module here (especially for Immich which you can still [declare its config](https://github.com/Logan-Lin/nix-config/blob/master/config/immich.nix) even with containers), other than fulfilling the purism many Nix users seem to have.
@ -220,7 +220,9 @@ I do have a [login display module](https://github.com/Logan-Lin/nix-config/blob/
![login display](login-display.webp) ![login display](login-display.webp)
> Information displayed at `ssh` login. {% cap() %}
Information displayed at `ssh` login.
{% end %}
## Why NixOS? ## Why NixOS?
@ -232,5 +234,7 @@ Compared to other Linux distributions, NixOS is quite suitable for setting up a
![terminal comparison](terminal-comparison.webp) ![terminal comparison](terminal-comparison.webp)
> Local (left) and `ssh`-connected server (right) terminal interface. {% cap() %}
> It looks completely identical (why not), to the point I have to set up visual hints (like the highlighted tmux hostname display) to remind myself which host I am currently on. Local (left) and `ssh`-connected server (right) terminal interface.
They look completely identical (why not).
{% end %}

View file

@ -9,7 +9,7 @@ Not anymore, since last night I refactored the whole site using [Zola](https://g
![compare](compare.webp) ![compare](compare.webp)
> Comparison of how the blog site looks before (left) and after (right) the refactor. {% cap() %}Comparison of how the blog site looks before (left) and after (right) the refactor.{% end %}
Aside from artistic changes, the main reason behind this refactor is to use a static site generator (SSG) that has fewer dependencies and straight-forward control of templates. Aside from artistic changes, the main reason behind this refactor is to use a static site generator (SSG) that has fewer dependencies and straight-forward control of templates.
I will dive deeper into the rationale and the refactor process below. I will dive deeper into the rationale and the refactor process below.
@ -74,4 +74,4 @@ Zola itself is very lightweight. As for the generated blog site, the only extern
![compare-speed](compare-speed.webp) ![compare-speed](compare-speed.webp)
> Load speed comparison of my homepage and the refactored blog site. {% cap() %}Load speed comparison of my homepage and the refactored blog site.{% end %}

View file

@ -25,7 +25,7 @@ Also, from my experience, all cloud storage services I've used frequently run in
![Syncthing web UI](syncthing.webp) ![Syncthing web UI](syncthing.webp)
> Web UI of Syncthing. {% cap() %}Web UI of Syncthing.{% end %}
## Examples ## Examples
@ -40,7 +40,7 @@ Once you use a local file sync service to sync the vault folder of Obsidian, it
![Obsidian](obsidian.webp) ![Obsidian](obsidian.webp)
> Obsidian desktop UI. {% cap() %}Obsidian desktop UI.{% end %}
### Reference Management: Zotero ### Reference Management: Zotero
@ -48,13 +48,13 @@ Once you use a local file sync service to sync the vault folder of Obsidian, it
![Zotero](zotero.webp) ![Zotero](zotero.webp)
> Zotero desktop UI, showing list of papers. {% cap() %}Zotero desktop UI, showing list of papers.{% end %}
Zotero has a built-in cloud sync functionality but their price for storage upgrades is quite high. One thing you might not know is, Zotero stores metadata and attachments in the same folder. You can use Syncthing to sync that folder, and completely ignore the official cloud sync functionality. Zotero has a built-in cloud sync functionality but their price for storage upgrades is quite high. One thing you might not know is, Zotero stores metadata and attachments in the same folder. You can use Syncthing to sync that folder, and completely ignore the official cloud sync functionality.
![Zotero folder structure](zotero-files.webp) ![Zotero folder structure](zotero-files.webp)
> The folder containing both metadata and raw PDF of papers in Zotero. {% cap() %}The folder containing both metadata and raw PDF of papers in Zotero.{% end %}
### Paper Writing: Overleaf vs. Local Text Editor ### Paper Writing: Overleaf vs. Local Text Editor
@ -81,7 +81,7 @@ clean:
![LaTeX editing](latex.webp) ![LaTeX editing](latex.webp)
> LaTeX editing with `neovim` on the left, compiled PDF view on the right. {% cap() %}LaTeX editing with `neovim` on the left, compiled PDF view on the right.{% end %}
Overleaf also provides two types of Git integration for you to sync your local changes with Overleaf projects: sync with a GitHub repo, or directly as a remote git repo. It's totally viable to have a mixed setup, where you primarily use local editors and most of your collaborators use Overleaf. Overleaf also provides two types of Git integration for you to sync your local changes with Overleaf projects: sync with a GitHub repo, or directly as a remote git repo. It's totally viable to have a mixed setup, where you primarily use local editors and most of your collaborators use Overleaf.
@ -91,7 +91,7 @@ Overleaf also provides two types of Git integration for you to sync your local c
![Calibre](calibre.webp) ![Calibre](calibre.webp)
> Calibre desktop UI, showing list of books. {% cap() %}Calibre desktop UI, showing list of books.{% end %}
Similar to Zotero, Calibre stores all the books and metadata of a library in a local folder, so there is nothing stopping you from syncing the folder across multiple computers. Although this is something explicitly suggested against by the software (a line when you select the location for a library: "Note that putting the calibre library on a Networked drive is not safe"), from my experience, as long as you don't try to open and modify the same library on two synced computers simultaneously, you won't be running into any issues. Similar to Zotero, Calibre stores all the books and metadata of a library in a local folder, so there is nothing stopping you from syncing the folder across multiple computers. Although this is something explicitly suggested against by the software (a line when you select the location for a library: "Note that putting the calibre library on a Networked drive is not safe"), from my experience, as long as you don't try to open and modify the same library on two synced computers simultaneously, you won't be running into any issues.
@ -106,7 +106,7 @@ Of course, to save some space, I will always transcode each file to AAC 256k bef
![foobar2000](foobar2000.webp) ![foobar2000](foobar2000.webp)
> foobar2000 running on an iPad mini. {% cap() %}foobar2000 running on an iPad mini.{% end %}
## Limitations ## Limitations

View file

@ -26,11 +26,11 @@ The goal of a multi-modal Transformer is to create a model that can accept multi
![](multi-modal-fusion.webp) ![](multi-modal-fusion.webp)
> An example of "conventional" multi-modal fusion. Different modality is processed by separate models and fused at some point. Source: *Xiang, Hao, Runsheng Xu, and Jiaqi Ma. "HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer." CVPR, 2023.* {% cap() %}An example of "conventional" multi-modal fusion. Different modality is processed by separate models and fused at some point. Source: *Xiang, Hao, Runsheng Xu, and Jiaqi Ma. "HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer." CVPR, 2023.*{% end %}
![](video-poet.webp) ![](video-poet.webp)
> An example of a Transformer that can handle multi-modal inputs and outputs. Different modalities are all projected into tokens and subsequently processed by a unified Transformer encoder. Source: *Kondratyuk, Dan, Lijun Yu, et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML, 2024.* {% cap() %}An example of a Transformer that can handle multi-modal inputs and outputs. Different modalities are all projected into tokens and subsequently processed by a unified Transformer encoder. Source: *Kondratyuk, Dan, Lijun Yu, et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML, 2024.*{% end %}
Beyond multi-modal processing, a multi-function Transformer can, for example, function as both a language model (auto-regressive generation) and diffusion denoiser (score-matching generation) simultaneously, supporting two of the most common generation schemes used today. Beyond multi-modal processing, a multi-function Transformer can, for example, function as both a language model (auto-regressive generation) and diffusion denoiser (score-matching generation) simultaneously, supporting two of the most common generation schemes used today.
@ -40,13 +40,13 @@ A fundamental challenge in unifying multiple modalities within a single Transfor
![](qkv-attention.webp) ![](qkv-attention.webp)
> Illustration of the QKV self-attention mechanism in Transformer. [Source](https://en.wikipedia.org/wiki/Attention_(machine_learning)) {% cap() %}Illustration of the QKV self-attention mechanism in Transformer. [Source](https://en.wikipedia.org/wiki/Attention_(machine_learning)){% end %}
The most common method for mapping language into the embedding space is through tokenization and token embedding. A tokenizer maps a word or word fragment into a discrete token index, and an index-fetching embedding layer (implemented in frameworks like PyTorch with `nn.Embedding`) maps this index into a fixed-dimension embedding vector. In principle, all discrete features can be mapped into the embedding space using this approach. The most common method for mapping language into the embedding space is through tokenization and token embedding. A tokenizer maps a word or word fragment into a discrete token index, and an index-fetching embedding layer (implemented in frameworks like PyTorch with `nn.Embedding`) maps this index into a fixed-dimension embedding vector. In principle, all discrete features can be mapped into the embedding space using this approach.
![](token-embedding.webp) ![](token-embedding.webp)
> Visualization of tokenizer and index-fetching embedding layer. [Source](https://medium.com/@hunter-j-phillips/the-embedding-layer-27d9c980d124) {% cap() %}Visualization of tokenizer and index-fetching embedding layer. [Source](https://medium.com/@hunter-j-phillips/the-embedding-layer-27d9c980d124){% end %}
### Vector Quantization ### Vector Quantization
@ -121,7 +121,7 @@ One approach to reverse vector quantization is readily available in VQ-VAE, sinc
![](magvit.webp) ![](magvit.webp)
> The encoder-decoder structure of MAGVIT (*Yu et al., "MAGVIT"*), a visual VQ-VAE model. A 3D-VQ encoder quantizes a video into discrete tokens, and a 3D-VQ decoder maps them back to the pixel space. {% cap() %}The encoder-decoder structure of MAGVIT (*Yu et al., "MAGVIT"*), a visual VQ-VAE model. A 3D-VQ encoder quantizes a video into discrete tokens, and a 3D-VQ decoder maps them back to the pixel space.{% end %}
### Efficiency Enhancement ### Efficiency Enhancement
@ -133,7 +133,7 @@ Another workaround follows the idea of compression. Take video generation as an
![](video-lavit.webp) ![](video-lavit.webp)
> Keys frames and motion vectors used in *Jin et al., "Video-LaVIT."* {% cap() %}Keys frames and motion vectors used in *Jin et al., "Video-LaVIT."*{% end %}
## Fuse with Diffusion Models ## Fuse with Diffusion Models
@ -143,7 +143,7 @@ An intriguing question arises: why not integrate the structures of language mode
![](transfusion.webp) ![](transfusion.webp)
> A Transformer capable of function as a language model and a diffusion denoiser at the same time. Source: *Zhou, Chunting, Lili Yu, et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model," ICLR, 2025.* {% cap() %}A Transformer capable of function as a language model and a diffusion denoiser at the same time. Source: *Zhou, Chunting, Lili Yu, et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model," ICLR, 2025.*{% end %}
## Conclusion ## Conclusion

View file

@ -16,12 +16,12 @@ $$
where $\mu$ is the drift component that is deterministic, and $\sigma$ is the diffusion term driven by Brownian motion (denoted by $W_t$) that is stochastic. This differential equation specifies a *time-dependent vector (velocity) field* telling how a data point $x_t$ should be moved as time $t$ evolves from $t=0$ to $t=1$ (i.e., a *flow* from $x_0$ to $x_1$). Below we give an illustration where $x_t$ is 1-dimensional: where $\mu$ is the drift component that is deterministic, and $\sigma$ is the diffusion term driven by Brownian motion (denoted by $W_t$) that is stochastic. This differential equation specifies a *time-dependent vector (velocity) field* telling how a data point $x_t$ should be moved as time $t$ evolves from $t=0$ to $t=1$ (i.e., a *flow* from $x_0$ to $x_1$). Below we give an illustration where $x_t$ is 1-dimensional:
![Vector field between two distributions](vector-field.webp) ![Vector field between two distributions](vector-field.webp)
> Vector field between two distributions specified by a differential equation. {% cap() %}Vector field between two distributions specified by a differential equation.{% end %}
When $\sigma(x_t,t)\equiv 0$, we get an *ordinary differential equation (ODE)* where the vector field is deterministic, i.e., the movement of $x_t$ is fully determined by $\mu$ and $t$. Otherwise, we get a *stochastic differential equation (SDE)* where the movement of $x_t$ has a certain level of randomness. Extending the previous illustration, below we show the difference in flow of $x_t$ under ODE and SDE: When $\sigma(x_t,t)\equiv 0$, we get an *ordinary differential equation (ODE)* where the vector field is deterministic, i.e., the movement of $x_t$ is fully determined by $\mu$ and $t$. Otherwise, we get a *stochastic differential equation (SDE)* where the movement of $x_t$ has a certain level of randomness. Extending the previous illustration, below we show the difference in flow of $x_t$ under ODE and SDE:
![ODE vs SDE movements](ode-sde-difference.webp) ![ODE vs SDE movements](ode-sde-difference.webp)
> Difference of movements in vector fields specified by ODE and SDE. *Source: Song, Yang, et al. "Score-based generative modeling through stochastic differential equations."* Note that their time is reversed. {% cap() %}Difference of movements in vector fields specified by ODE and SDE. *Source: Song, Yang, et al. "Score-based generative modeling through stochastic differential equations."* Note that their time is reversed.{% end %}
As you would imagine, once we manage to solve the differential equation, even if we still cannot have a closed form of $p(x_1)$, we can sample from $p(x_1)$ by sampling a data point $x_0$ from $p(x_0)$ and get the generated data point $x_1$ by calculating the following forward-time integral with an integration technique of our choice: As you would imagine, once we manage to solve the differential equation, even if we still cannot have a closed form of $p(x_1)$, we can sample from $p(x_1)$ by sampling a data point $x_0$ from $p(x_0)$ and get the generated data point $x_1$ by calculating the following forward-time integral with an integration technique of our choice:
@ -32,7 +32,7 @@ $$
Or more intuitively, moving $x_0$ towards $x_1$ along time in the vector field: Or more intuitively, moving $x_0$ towards $x_1$ along time in the vector field:
![Flow of data point](flow-data-point.webp) ![Flow of data point](flow-data-point.webp)
> A flow of data point moving from $x_0$ towards $x_1$ in the vector field. {% cap() %}A flow of data point moving from $x_0$ towards $x_1$ in the vector field.{% end %}
## ODE and Flow Matching ## ODE and Flow Matching
@ -81,12 +81,12 @@ $$
Although the ground truth vector field is designed to be straight, in practice it usually is not. When the data space is high-dimensional and the target distribution $p(x_1)$ is complex, there will be multiple pairs of $(x_0, x_1)$ that result in the same intermediate data point $x_t$, thus multiple velocities $x_1-x_0$. At the end of the day, the actual ground truth velocity at $x_t$ will be the average of all possible velocities $x_1-x_0$ that pass through $x_t$. This will lead to a "curvy" vector field, illustrated as follows: Although the ground truth vector field is designed to be straight, in practice it usually is not. When the data space is high-dimensional and the target distribution $p(x_1)$ is complex, there will be multiple pairs of $(x_0, x_1)$ that result in the same intermediate data point $x_t$, thus multiple velocities $x_1-x_0$. At the end of the day, the actual ground truth velocity at $x_t$ will be the average of all possible velocities $x_1-x_0$ that pass through $x_t$. This will lead to a "curvy" vector field, illustrated as follows:
![Curvy vector field](curvy-vector-field.webp) ![Curvy vector field](curvy-vector-field.webp)
> Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. *Source: Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."* Note $z_t$ and $v$ in the figure correspond to $x_t$ and $\mu$ in this post, respectively. {% cap() %}Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. *Source: Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."* Note $z_t$ and $v$ in the figure correspond to $x_t$ and $\mu$ in this post, respectively.{% end %}
As we discussed, when you calculate the ODE integral, you are using the instantaneous velocity--tangent of the curves in the vector field--of each step. You would imagine this will lead to subpar performance when using a small number $N$ of steps, as demonstrated below: As we discussed, when you calculate the ODE integral, you are using the instantaneous velocity--tangent of the curves in the vector field--of each step. You would imagine this will lead to subpar performance when using a small number $N$ of steps, as demonstrated below:
![Few-step sampling failure](few-step-sampling.webp) ![Few-step sampling failure](few-step-sampling.webp)
> Native flow matching models fail at few-step sampling. *Source: Frans, Kevin, et al. "One step diffusion via shortcut models."* {% cap() %}Native flow matching models fail at few-step sampling. *Source: Frans, Kevin, et al. "One step diffusion via shortcut models."*{% end %}
### Shortcut Vector Field ### Shortcut Vector Field
@ -130,14 +130,14 @@ $$
Where $\text{sg}$ is stop gradient, i.e., detach $\mathbf{u}_\text{target}$ from back propagation, making it a pseudo ground truth. Below is an illustration of the training process provided in the original paper. Where $\text{sg}$ is stop gradient, i.e., detach $\mathbf{u}_\text{target}$ from back propagation, making it a pseudo ground truth. Below is an illustration of the training process provided in the original paper.
![Shortcut model training](shortcut-training.webp) ![Shortcut model training](shortcut-training.webp)
> Training of the shortcut models with self-consistency loss. {% cap() %}Training of the shortcut models with self-consistency loss.{% end %}
#### Mean Flow #### Mean Flow
Mean flow is another work sharing the idea of learning velocities that take large step size shortcuts but with a stronger theoretical foundation and a different approach to training. Mean flow is another work sharing the idea of learning velocities that take large step size shortcuts but with a stronger theoretical foundation and a different approach to training.
![Average velocity illustration](average-velocity.webp) ![Average velocity illustration](average-velocity.webp)
> Illustration of the average velocity provided in the original paper. {% cap() %}Illustration of the average velocity provided in the original paper.{% end %}
Mean flow defines an *average velocity* as a shortcut between times $t$ and $r$ where $t$ and $r$ are independent: Mean flow defines an *average velocity* as a shortcut between times $t$ and $r$ where $t$ and $r$ are independent:
@ -257,7 +257,7 @@ One caveat of training a "shortcut SDE" is that the ideal result of one-step sam
Below are some preliminary results I obtained from a set of amorphous material generation experiments. You don't need to understand the figure--just know that it shows that applying the idea of learning shortcuts to SDE does yield better results compared to the vanilla SDE when using few-step sampling. Below are some preliminary results I obtained from a set of amorphous material generation experiments. You don't need to understand the figure--just know that it shows that applying the idea of learning shortcuts to SDE does yield better results compared to the vanilla SDE when using few-step sampling.
![SDE shortcut results](sde-results.webp) ![SDE shortcut results](sde-results.webp)
> Structural functions of generated materials, sampled in 10 steps. {% cap() %}Structural functions of generated materials, sampled in 10 steps.{% end %}
--- ---

View file

@ -15,7 +15,7 @@ Most diffusion models work by coupling a forward diffusion process and a reverse
![](diffusion-process.webp) ![](diffusion-process.webp)
> The two processes in a typical diffusion model. *Source: Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models."* {% cap() %}The two processes in a typical diffusion model. *Source: Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models."*{% end %}
### Understanding DMs ### Understanding DMs
@ -23,7 +23,7 @@ There are many ways to understand how Diffusion Models (DMs) work. One of the mo
![](ode-sde-flow.webp) ![](ode-sde-flow.webp)
> Illustrated ODE and SDE flow of a diffusion model on 1-dimensional data. *Source: Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."* {% cap() %}Illustrated ODE and SDE flow of a diffusion model on 1-dimensional data. *Source: Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."*{% end %}
### DMs Scale Poorly with Few Steps ### DMs Scale Poorly with Few Steps
@ -37,13 +37,13 @@ Nevertheless, it is observed that their performance typically suffers catastroph
![](few-steps-results.webp) ![](few-steps-results.webp)
> Images generated by conventional DMs with only a few steps of reverse process. *Source: Frans et al., "One Step Diffusion via Shortcut Models."* {% cap() %}Images generated by conventional DMs with only a few steps of reverse process. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*{% end %}
To understand why DMs scale poorly with few reverse process steps, we can return to the vector field perspective of DMs. When the target data distribution is complex, the vector field typically contains numerous intersections. When a given $X_t$ and $t$ is at these intersections, the vector points to the averaged direction of all candidates. This causes the generated data to approach the mean of the training data when only a few reverse process steps are used. Another explanation is that the learned vector field is highly curved. Using only a few reverse process steps means attempting to approximate these curves with polylines, which is inherently difficult. To understand why DMs scale poorly with few reverse process steps, we can return to the vector field perspective of DMs. When the target data distribution is complex, the vector field typically contains numerous intersections. When a given $X_t$ and $t$ is at these intersections, the vector points to the averaged direction of all candidates. This causes the generated data to approach the mean of the training data when only a few reverse process steps are used. Another explanation is that the learned vector field is highly curved. Using only a few reverse process steps means attempting to approximate these curves with polylines, which is inherently difficult.
![](dm-scale-poorly.webp) ![](dm-scale-poorly.webp)
> Illustration of the why DMs scale poorly with few reverse process steps. *Source: Frans et al., "One Step Diffusion via Shortcut Models."* {% cap() %}Illustration of the why DMs scale poorly with few reverse process steps. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*{% end %}
We will introduce two branches of methods that aim to scale DMs to few or even reverse process steps: **distillation-based**, which distillates a pre-trained DM into a one-step model; and **end-to-end-based**, which trains a one-step DM from scratch. We will introduce two branches of methods that aim to scale DMs to few or even reverse process steps: **distillation-based**, which distillates a pre-trained DM into a one-step model; and **end-to-end-based**, which trains a one-step DM from scratch.
@ -73,7 +73,7 @@ This procedure produces increasingly straight flows that can be simulated with v
![](reflow-iterations.webp) ![](reflow-iterations.webp)
> Illustrations of vector fields after different times of reflow processes. *Source: Liu, Gong, and Liu, "Flow Straight and Fast."* {% cap() %}Illustrations of vector fields after different times of reflow processes. *Source: Liu, Gong, and Liu, "Flow Straight and Fast."*{% end %}
In practice, distillation-based methods are usually trained in two stages: first train a normal DM, and later distill one-step capabilities into it. This introduces additional computational overhead and complexity. In practice, distillation-based methods are usually trained in two stages: first train a normal DM, and later distill one-step capabilities into it. This introduces additional computational overhead and complexity.
@ -93,7 +93,7 @@ In theory, without altering the fundamental formulation of DMs, the learnable de
![](consistency-model.webp) ![](consistency-model.webp)
> A consistency model that learns to map any point on the ODE trajectory to the clean sample. *Source: Song et al., "Consistency Models."* {% cap() %}A consistency model that learns to map any point on the ODE trajectory to the clean sample. *Source: Song et al., "Consistency Models."*{% end %}
Formally, CMs learn a function $f_\theta(x_t,t)$ that maps noisy data $x_t$ at time $t$ directly to the clean data $x_0$, satisfying: Formally, CMs learn a function $f_\theta(x_t,t)$ that maps noisy data $x_t$ at time $t$ directly to the clean data $x_0$, satisfying:
@ -135,6 +135,6 @@ Based on this insight, on top of $x_t$ and $t$, shortcut models additionally inc
![](shortcut-training.webp) ![](shortcut-training.webp)
> Illustration of the training process of shortcut models. *Source: Frans et al., "One Step Diffusion via Shortcut Models."* {% cap() %}Illustration of the training process of shortcut models. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*{% end %}
Both consistency models and shortcut models can be seamlessly scaled between one-step and multi-step generation to balance quality and efficiency. Both consistency models and shortcut models can be seamlessly scaled between one-step and multi-step generation to balance quality and efficiency.

View file

@ -76,8 +76,18 @@ main {
} }
img { img {
display: block;
max-width: 100%; max-width: 100%;
height: auto; height: auto;
margin: 0 auto;
}
figcaption {
text-align: center;
font-size: 0.9rem;
color: var(--muted);
margin-top: 0;
margin-bottom: 1.5rem;
} }
h1, h2, h3, h4 { h1, h2, h3, h4 {

View file

@ -0,0 +1 @@
<figcaption>{{ body | markdown | safe }}</figcaption>

View file

@ -0,0 +1,3 @@
<p>
<img src="{{ src }}" alt="{{ alt | default(value='') }}" style="max-width: min({{ width | default(value='500px') }}, 100%);">
</p>