introduce figcaption

2026-02-06 09:11:05 +01:00 · 2026-02-06 09:11:05 +01:00 · d8ea74211f
commit d8ea74211f
parent 05dea86964
14 changed files with 81 additions and 63 deletions
--- a/content/dl4traj/end-to-end/index.md
+++ b/content/dl4traj/end-to-end/index.md
@ -11,16 +11,16 @@ chapter = "Chapter 5"
 End-to-end learning means training a model to perform a task from input to output, supervising only on how the output aligns with the task's ground truth.
 End-to-end is typically the most straightforward option for building a deep learning method for a certain task, and that also applies to most tasks related to spatiotemporal trajectories.

-<img src="end-to-end.webp" alt="end-to-end" style="max-width: min(500px, 100%);">
+{{ img(src="end-to-end.webp", alt="end-to-end", width="500px") }}

-> Illustration of end-to-end learning of spatiotemporal trajectories.
+{% cap() %}Illustration of end-to-end learning of spatiotemporal trajectories.{% end %}

 In this post we will categorize end-to-end learning tasks of trajectories from a technical standpoint: prediction, classification, and imputation.
 For each category of tasks, we will give a general problem formulation and the general framework for solving it, and briefly discuss the motivation and use case of more specific downstream applications that fit into the category.

 ![categories](categories.webp)

-> Schema overview of the three categories of end-to-end trajectory learning tasks.
+{% cap() %}Schema overview of the three categories of end-to-end trajectory learning tasks.{% end %}

 {{ toc() }}

--- a/content/dl4traj/introduction/index.md
+++ b/content/dl4traj/introduction/index.md
@ -12,7 +12,7 @@ A spatiotemporal trajectory is a sequence, with each item being a timestamped lo

 ![trajectory](trajectory.webp)

-> Examples of human (left) and vehicle (right) trajectories.
+{% cap() %}Examples of human (left) and vehicle (right) trajectories.{% end %}

 With the development of technologies and devices that can record trajectories, such as GPS-equipped mobile phones, human society has gathered large-scale trajectory data.
 Such data can be valuable for traffic planning and city development.
@ -80,7 +80,7 @@ Anomaly detection identifies paths or behaviors that deviate from normal pattern

 ![tasks](tasks.webp)

-> Illustration of trajectory-related tasks.
+{% cap() %}Illustration of trajectory-related tasks.{% end %}

 ## Deep Learning for Trajectories

--- a/content/dl4traj/large-model/index.md
+++ b/content/dl4traj/large-model/index.md
@ -33,7 +33,7 @@ where each $t_i \in \{1, 2, \ldots, |\mathcal{V}|\}$ is an index into the vocabu

 ![token](./token.webp)

-> An example of tokenized sentences. Source: [Tiktokenizer](https://tiktokenizer.vercel.app/)
+{% cap() %}An example of tokenized sentences. Source: [Tiktokenizer](https://tiktokenizer.vercel.app/){% end %}

 Different tokenization strategies exist, each with trade-offs between vocabulary size and sequence length. Word-level tokenization is intuitive but requires a very large vocabulary. Character-level tokenization has a small vocabulary but produces very long sequences. Modern LLMs use subword tokenization, which breaks words into meaningful subunits that balance vocabulary size and sequence length. For a detailed treatment of tokenization, see [Karpathy's tutorial](https://www.youtube.com/watch?v=zduSFxRajkE).

@ -62,7 +62,7 @@ The dot product of two rotated vectors depends only on the difference of their r

 ![rope](./rope-rotation.webp)

-> RoPE encodes positional information by rotating input vectors.
+{% cap() %}RoPE encodes positional information by rotating input vectors.{% end %}

 This relative formulation improves generalization to longer sequences. A model trained on sequences up to length $L$ has learned to interpret relative distances up to $L$. When processing longer sequences at inference time, it can still reason about token pairs whose relative distance falls within the trained range, even if their absolute positions exceed $L$. Various techniques like positional interpolation and YaRN further extend this capability by carefully scaling the rotation frequencies.

@ -85,7 +85,7 @@ There are also hybrid approaches that combine both objectives. For example, GLM

 ![glm](glm.webp)

-> Blank-infilling objective used by GLM.
+{% cap() %}Blank-infilling objective used by GLM.{% end %}

 > **References:**
 > - Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. "Attention Is All You Need."
@ -164,7 +164,7 @@ where {% m() %}\mathbf{e}_i{% end %} is the query and {% m() %}\mathbf{E}_{\text

 ![trajcogn](./trajcogn.webp)

-> The reprogramming module and anchor words in TrajCogn.
+{% cap() %}The reprogramming module and anchor words in TrajCogn.{% end %}

 An alternative approach is to directly serialize trajectory features into text and feed them through the LLM's native tokenizer. For features that are already textual, such as POI names or addresses, this is straightforward. For continuous features like coordinates, one can simply convert numbers to their string representations. This technically works, but LLMs are not well-known for their ability to reason about raw numbers, let alone spatial relationships encoded in coordinate pairs. A trajectory represented as a sequence of "(lat, lng)" strings may be parseable by the model, but whether it can extract meaningful spatial patterns from such input is questionable.

@ -172,7 +172,7 @@ To address this limitation, one can transform trajectory data into modalities th

 ![traj-mllm](./trajmllm.webp)

-> Mixture of textual and visual representations of trajectory features in Traj-MLLM.
+{% cap() %}Mixture of textual and visual representations of trajectory features in Traj-MLLM.{% end %}

 ### Efficient Fine-tuning on Trajectories

@ -213,7 +213,7 @@ Here $g_{[m]}$ represents missing points between $g_1$ and $g_4$, and the model

 ![uvtm](./uvtm.webp)

-> Implementation of the above generation paradigm in UVTM.
+{% cap() %}Implementation of the above generation paradigm in UVTM.{% end %}

 Under this formulation, we can frame different tasks with different masking patterns. For origin-destination travel time estimation, the input is $(l_o, t_o, [m]), (l_d, [m], [m])$, where only the origin coordinate, departure time, and destination coordinate are known, and the model generates the arrival time as part of the destination tuple. For trajectory recovery, known sparse points have complete tuples while gaps between them are marked with $g_{[m]}$, and the model generates the intermediate points. For trajectory prediction, historical points are complete and a single $g_{[m]}$ at the end signals that future points should be generated.

--- a/content/dl4traj/self-supervised/index.md
+++ b/content/dl4traj/self-supervised/index.md
@ -14,9 +14,9 @@ This is also very relevant in the context of trajectory learning, since the avai

 Self-supervised learning is also widely used to pre-train deep learning models. Put into the perspective of trajectories, self-supervised learning can help models get a general understanding of trajectory sequences or components in trajectories; the pre-trained models can later be fine-tuned on specific tasks, or be used directly for unsupervised tasks like clustering.

-<img src="self-supervised.webp" alt="self-supervised" style="max-width: min(500px, 100%);">
+{{ img(src="self-supervised.webp", alt="self-supervised", width="500px") }}

-> Illustration of self-supervised learning of spatiotemporal trajectories.
+{% cap() %}Illustration of self-supervised learning of spatiotemporal trajectories.{% end %}

 Most widely-adopted self-supervised learning frameworks for trajectories originate from the natural language processing (NLP), graph learning, and computer vision (CV) domains.
 In this post we categorize self-supervised learning methods for trajectories based on the framework they adhere to: static word embedding, graph node embedding, contextual word embedding, auto-encoding, and contrastive learning.
@ -67,14 +67,14 @@ It constructs the Huffman tree by recursively partitioning the geographic space

 ![poi2vec](poi2vec.webp)

-> Construction of the geography-aware binary tree in POI2Vec.
+{% cap() %}Construction of the geography-aware binary tree in POI2Vec.{% end %}

 _TALE_ incorporates temporal periodicity with a time-aware hierarchical softmax structure. Many locations exhibit strong temporal patterns: office buildings are visited during work hours, restaurants peak at meal times, and entertainment venues are active at night. TALE captures these patterns by replacing the standard Huffman tree used in hierarchical softmax with a temporal tree.
 The temporal tree has a root node at the top level, followed by time nodes corresponding to equal-length time slices of a day. Below each time node, a Huffman subtree organizes the locations that are visited during that time slice, based on their visit frequencies within that slice.

 ![tale](tale.webp)

-> The temporal tree in TALE.
+{% cap() %}The temporal tree in TALE.{% end %}

 Predicting a visit $(l, t)$ requires traversing a path through this tree, which decomposes into two stages:

@ -175,7 +175,7 @@ The second design is an additional self-supervised objective called masked hour

 ![ctle](ctle.webp)

-> The model architecture of CTLE.
+{% cap() %}The model architecture of CTLE.{% end %}

 ### Applications: Capturing Location Polysemy

@ -259,7 +259,7 @@ The core idea of contrastive learning is to learn representations by comparing p

 ![cmc](cmc.webp)

-> Contrasting different views of the same data point versus another data point in contrastive multiview coding.
+{% cap() %}Contrasting different views of the same data point versus another data point in contrastive multiview coding.{% end %}

 The _contrastive multiview coding_ framework formalizes this with the InfoNCE loss. Given a data point $\mathbf{x}$, we apply two different augmentations to obtain views $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$. An encoder $f$ maps each view to an embedding, and the model is trained to identify the positive pair among a set of negatives. For a batch of $N$ data points, the loss for a positive pair $(i, j)$ is: