compress images into webp

2026-01-30 22:04:35 +01:00 · 2026-01-30 22:04:35 +01:00 · ee7245f82f
commit ee7245f82f
parent 50459f199d
70 changed files with 67 additions and 67 deletions
--- a/content/homelab/modern-unix-cmd/btop.webp
+++ b/content/homelab/modern-unix-cmd/btop.webp
--- a/content/homelab/modern-unix-cmd/eza-list.webp
+++ b/content/homelab/modern-unix-cmd/eza-list.webp
--- a/content/homelab/modern-unix-cmd/eza-tree.webp
+++ b/content/homelab/modern-unix-cmd/eza-tree.webp
--- a/content/homelab/modern-unix-cmd/htop.webp
+++ b/content/homelab/modern-unix-cmd/htop.webp
--- a/content/homelab/modern-unix-cmd/index.md
+++ b/content/homelab/modern-unix-cmd/index.md
@ -16,11 +16,11 @@ Let's say I am currently in `~/Documents/Projects/personal-blog` and I want to j
 With the classic `cd`, I will have to type the whole path.
 With `cd` aliased to `zoxide`, I only need to type `cd n` (supposing that `~/.config/nix` is the most frequently visited directory among all matched directories).

-![zoxide jump](zoxide-jump.png)
+![zoxide jump](zoxide-jump.webp)

 Internally `zoxide` records my visits to directories in a SQLite database and sorts them based on frequency. If the first hit is not what I want, I can also interactively select from the matched list.

-![zoxide select](zoxide-select.png)
+![zoxide select](zoxide-select.webp)

 ## `du` -> `ncdu`

@ -29,17 +29,17 @@ Internally `zoxide` records my visits to directories in a SQLite database and so
 [`ncdu`](https://dev.yorhel.nl/ncdu) is an interactive alternative to `du`, and is very usable out of the box. Interestingly, I also feel it is a touch faster than `du`.
 It can totally be an alternative to those fancy disk space analyzers as well.

-![ncdu](ncdu.png)
+![ncdu](ncdu.webp)

 ## `top` -> `btop`

 `top` is quite basic and looks "unexciting". `htop` also ships with most Unix/Linux systems and looks better.

-![htop](htop.png)
+![htop](htop.webp)

 [`btop`](https://github.com/aristocratos/btop) might be the most "nerdy-looking" `top` alternative out of the box. It can be a handy tool if you are trying to make people believe you are a hacker.

-![btop](btop.png)
+![btop](btop.webp)

 At the same time, it is very feature-rich and configurable. To some extent, it is also an alternative to bandwidth monitoring tools like `iftop` and disk utilization tools like `df`.

@ -47,11 +47,11 @@ At the same time, it is very feature-rich and configurable. To some extent, it i

 I think there is nothing wrong with the classic `ls`. So, as an alternative, [`eza`](https://github.com/eza-community/eza) just has a few quality-of-life improvements, like file type icons, Git status, and (based on personal taste) prettier colors.

-![eza list](eza-list.png)
+![eza list](eza-list.webp)

 It can replace the `tree` command as well.

-![eza tree](eza-tree.png)
+![eza tree](eza-tree.webp)

 ## `vim` -> `nvim`

@ -63,6 +63,6 @@ You can simply use `vim` keybindings in many editors or IDEs. `vim` itself can f
 To put it simply, it is a TUI editor that can truly be your only text editor. With countless plugins and ways to configure it, it can be a basic text editor, or a fully-featured development IDE, or anything in-between.
 Syntax highlighting, file browser, fuzzy search, intelligent autocompletion, debugging, AI™ integration. You name it, `neovim` has it.

-![neovim](nvim-1.png)
+![neovim](nvim-1.webp)

-![neovim fuzzy search](nvim-2.png)
+![neovim fuzzy search](nvim-2.webp)
--- a/content/homelab/modern-unix-cmd/ncdu.webp
+++ b/content/homelab/modern-unix-cmd/ncdu.webp
--- a/content/homelab/modern-unix-cmd/nvim-1.webp
+++ b/content/homelab/modern-unix-cmd/nvim-1.webp
--- a/content/homelab/modern-unix-cmd/nvim-2.webp
+++ b/content/homelab/modern-unix-cmd/nvim-2.webp
--- a/content/homelab/modern-unix-cmd/zoxide-jump.webp
+++ b/content/homelab/modern-unix-cmd/zoxide-jump.webp
--- a/content/homelab/modern-unix-cmd/zoxide-select.webp
+++ b/content/homelab/modern-unix-cmd/zoxide-select.webp
--- a/content/homelab/nixos-home-server/cover.webp
+++ b/content/homelab/nixos-home-server/cover.webp
--- a/content/homelab/nixos-home-server/immich.webp
+++ b/content/homelab/nixos-home-server/immich.webp
--- a/content/homelab/nixos-home-server/index.md
+++ b/content/homelab/nixos-home-server/index.md
@ -6,7 +6,7 @@ description = "How I built a NixOS-based Home Server/Nas"

 This is a very concise walkthrough of my main home server running NixOS. I assume the reader already has basic knowledge about NixOS.

-![cover](cover.png)
+![cover](cover.webp)

 My home server (or many would rather call it a NAS) serves common home server purposes: bulk storage, basic file sharing, media streaming service, and photo backup.

@ -14,7 +14,7 @@ My home server (or many would rather call it a NAS) serves common home server pu

 Below is a recent photo of my home server, living in the utility closet together with my network equipments.

-![server photo](server-photo.jpg)
+![server photo](server-photo.webp)

 It is essentially an Intel N305 custom motherboard with SATA back panel and a 3D-printed enclosure. I bought it on Taobao last time I went back to China to visit my family.
 Not very important here, as long as you stick to common hardware, it should be relatively straightforward to install NixOS and replicate my setup.
@ -185,7 +185,7 @@ Transmission seems to be more stable, but its interface is so barebones and is m

 For photo backup I use [Immich](https://immich.app/). It is a self-hosted alternative to iCloud Photos and Google Photos. Aside from basic photo backup and management, it also has many nice touches, such as face detection, CLIP-based image embedding for semantic search, and recently added OCR for text search. It also comes with quite robust mobile apps for both iOS and Android.

-![immich](immich.png)
+![immich](immich.webp)

 Right now Immich is the only service I am running with containers rather than native Nix modules (as you can see in [this configuration file](https://github.com/Logan-Lin/nix-config/blob/master/hosts/nixos/hs/containers.nix)). Technically it is possible to set up Immich with pure Nix modules, but for this type of services that rely on specific versions of databases (in this case, PostgreSQL with vector support), I feel containers are the easier route.
 And to be honest, I don't think there is much benefit going with pure Nix module here (especially for Immich which you can still [declare its config](https://github.com/Logan-Lin/nix-config/blob/master/config/immich.nix) even with containers), other than fulfilling the purism many Nix users seem to have.
@ -212,7 +212,7 @@ The P2P nature of Tailscale also means that, if you have no interest in creating
 I don't want to complicate things so I haven't set up any automated system to check the health status of my home server and send notification if anything goes wrong.
 I do have a [login display module](https://github.com/Logan-Lin/nix-config/blob/master/modules/login-display.nix) that will report important status every time I SSH into my home server.

-![login display](login-display.png)
+![login display](login-display.webp)

 ## Why NixOS?

@ -222,6 +222,6 @@ Compared to purposefully built home server systems (like Unraid) and pre-built h

 Compared to other Linux distributions, NixOS is quite suitable for setting up a home server. Since it is declarative, setting up many things are probably easier than you thought. In other words, for the most part, you only have to care **what** you want to achieve, not **how** you are going to achieve them (this is of course, primarily thanks to the amazing NixOS community). On the other hand, most of the configuration is fully self-contained and tracked in your Nix config repo (supposing you use git). So it is much less prone to oversight during configuration, and you also don't have to explicitly remember your setup for future references. Before switching my home server to NixOS, I've been using Nix-darwin on my Macbook for a while, so I also get to reuse a lot of custom modules, like the [neovim module](https://github.com/Logan-Lin/nix-config/blob/master/modules/nvim.nix).

-![terminal comparison](terminal-comparison.png)
+![terminal comparison](terminal-comparison.webp)

 > It looks completely identical (why not), to the point I have to set up visual hints (like the highlighted tmux hostname display) to remind myself which host I am currently on.
--- a/content/homelab/nixos-home-server/login-display.webp
+++ b/content/homelab/nixos-home-server/login-display.webp
--- a/content/homelab/nixos-home-server/server-photo.webp
+++ b/content/homelab/nixos-home-server/server-photo.webp
--- a/content/homelab/nixos-home-server/terminal-comparison.webp
+++ b/content/homelab/nixos-home-server/terminal-comparison.webp
--- a/content/homelab/replace-cloud-w-sync/calibre.webp
+++ b/content/homelab/replace-cloud-w-sync/calibre.webp
--- a/content/homelab/replace-cloud-w-sync/foobar2000.webp
+++ b/content/homelab/replace-cloud-w-sync/foobar2000.webp
--- a/content/homelab/replace-cloud-w-sync/index.md
+++ b/content/homelab/replace-cloud-w-sync/index.md
@ -23,7 +23,7 @@ As long as the service can achieve one functionality: always keep a full copy of
 I do want to recommend a service for this purpose: [Syncthing](https://syncthing.net/). This is a peer-to-peer file sync service, which means it has minimal reliance on cloud infrastructure, and your data never has to be stored on computers that are not yours.
 Also, from my experience, all cloud storage services I've used frequently run into stability issues when I try to sync tons of small files at once (e.g., a Git repo), but Syncthing is never unstable no matter how I abuse it.

-![Syncthing web UI](syncthing.png)
+![Syncthing web UI](syncthing.webp)

 ## Examples

@ -36,17 +36,17 @@ One replacement for Notion that has no reliance on cloud what so ever is [Obsidi
 Every type of data needed by Obsidian, including the notes themselves, settings, plugins, and GUI customization, are stored locally (even better, in plain text).
 Once you use a local file sync service to sync the vault folder of Obsidian, it works like the cloud in that everything is always in-sync, but does not have any of the downsides of the cloud.

-![Obsidian](obsidian.png)
+![Obsidian](obsidian.webp)

 ### Reference Management: Zotero

 [Zotero](https://www.zotero.org/) is a reference management software that can be used in a variety of scenarios. For me, I largely use it to manage academic papers I need to read.

-![Zotero](zotero.png)
+![Zotero](zotero.webp)

 Zotero has a built-in cloud sync functionality but their price for storage upgrades is quite high. One thing you might not know is, Zotero stores metadata and attachments in the same folder. You can use Syncthing to sync that folder, and completely ignore the official cloud sync functionality.

-![Zotero folder structure](zotero-files.png)
+![Zotero folder structure](zotero-files.webp)

 ### Paper Writing: Overleaf vs. Local Text Editor

@ -71,7 +71,7 @@ clean:
 	rm -rf out
 ```

-![LaTeX editing in VSCode](latex.png)
+![LaTeX editing in VSCode](latex.webp)

 Overleaf also provides two types of Git integration for you to sync your local changes with Overleaf projects: sync with a GitHub repo, or directly as a remote git repo. It's totally viable to have a mixed setup, where you primarily use local editors and most of your collaborators use Overleaf.

@ -79,7 +79,7 @@ Overleaf also provides two types of Git integration for you to sync your local c

 [Calibre](https://calibre-ebook.com/) is a book management software that can be used to manage your book collection, edit metadata, along with many handy functionalities like bulk conversion.

-![Calibre](calibre.png)
+![Calibre](calibre.webp)

 Similar to Zotero, Calibre stores all the books and metadata of a library in a local folder, so there is nothing stopping you from syncing the folder across multiple computers. Although this is something explicitly suggested against by the software (a line when you select the location for a library: "Note that putting the calibre library on a Networked drive is not safe"), from my experience, as long as you don't try to open and modify the same library on two synced computers simultaneously, you won't be running into any issues.

@ -92,7 +92,7 @@ It is my own cloud infrastructure, nevertheless it is a cloud infrastructure so
 Now I use a simpler yet more robust setup. I just sync all my music files in a folder through Syncthing, and use a local music player like [foobar2000](https://www.foobar2000.org/) to read that folder.
 Of course, to save some space, I will always transcode each file to AAC 256k before putting them in the sync folder.

-![foobar2000](foobar2000.jpg)
+![foobar2000](foobar2000.webp)

 ## Limitations

--- a/content/homelab/replace-cloud-w-sync/latex.webp
+++ b/content/homelab/replace-cloud-w-sync/latex.webp
--- a/content/homelab/replace-cloud-w-sync/obsidian.webp
+++ b/content/homelab/replace-cloud-w-sync/obsidian.webp
--- a/content/homelab/replace-cloud-w-sync/syncthing.webp
+++ b/content/homelab/replace-cloud-w-sync/syncthing.webp
--- a/content/homelab/replace-cloud-w-sync/zotero-files.webp
+++ b/content/homelab/replace-cloud-w-sync/zotero-files.webp
--- a/content/homelab/replace-cloud-w-sync/zotero.webp
+++ b/content/homelab/replace-cloud-w-sync/zotero.webp
--- a/content/homelab/yt-dlp-tv-show/index.md
+++ b/content/homelab/yt-dlp-tv-show/index.md
@ -118,7 +118,7 @@ SEASONNFO

  # Handle thumbnail: rename and copy as posters
  local thumb_file=""
-  for ext in jpg webp png; do
+  for ext in.webp webp.webp; do
    if [[ -f "$dir/$name_noext.$ext" ]]; then
      thumb_file="$dir/$name_noext.$ext"
      break
@ -129,11 +129,11 @@ SEASONNFO
    local thumb_ext="${thumb_file##*.}"
    mv "$thumb_file" "$dir/$name_noext-thumb.$thumb_ext" 2>/dev/null

-    if [[ ! -f "$series_dir/poster.jpg" ]] && [[ ! -f "$series_dir/poster.webp" ]] && [[ ! -f "$series_dir/poster.png" ]]; then
+    if [[ ! -f "$series_dir/poster.webp" ]] && [[ ! -f "$series_dir/poster.webp" ]] && [[ ! -f "$series_dir/poster.webp" ]]; then
      cp "$dir/$name_noext-thumb.$thumb_ext" "$series_dir/poster.$thumb_ext"
    fi

-    if [[ ! -f "$season_dir/poster.jpg" ]] && [[ ! -f "$season_dir/poster.webp" ]] && [[ ! -f "$season_dir/poster.png" ]]; then
+    if [[ ! -f "$season_dir/poster.webp" ]] && [[ ! -f "$season_dir/poster.webp" ]] && [[ ! -f "$season_dir/poster.webp" ]]; then
      cp "$dir/$name_noext-thumb.$thumb_ext" "$season_dir/poster.$thumb_ext"
    fi
  fi
@ -144,15 +144,15 @@ I also include the thumbnail extraction logic in the implementation. The end res

 ```
 PMM_LORD
-├── poster.jpg
+├── poster.webp
 ├── Season 2025
-│   ├── poster.jpg
+│   ├── poster.webp
 │   ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f.mp4
 │   ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f.nfo
-│   ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f-thumb.jpg
+│   ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f-thumb.webp
 │   ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA.mp4
 │   ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA.nfo
-│   ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA-thumb.jpg
+│   ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA-thumb.webp
 │   └── season.nfo
 └── tvshow.nfo

@ -175,14 +175,14 @@ And the episode `.nfo` file will record the title, upload date, and video descri

 Finally, in Jellyfin/Emby, we can set up a "TV show" type library, but uncheck all the metadata fetching sources so that the information will only be provided by the `.nfo` files.

-![Jellyfin library settings](jellyfin-settings.png)
+![Jellyfin library settings](jellyfin-settings.webp)

 And it works!

-![Jellyfin result](jellyfin-result.png)
+![Jellyfin result](jellyfin-result.webp)

 And of course it also works nicely with third-party clients like Infuse on my mobile devices/TV.

-![Infuse client](infuse-client.png)
+![Infuse client](infuse-client.webp)

 I packaged all my custom implementation of yt-dlp into a Nix module, which you can take a look in the [this link](https://github.com/Logan-Lin/nix-config/blob/master/modules/yt-dlp.nix) if interested.
--- a/content/homelab/yt-dlp-tv-show/infuse-client.webp
+++ b/content/homelab/yt-dlp-tv-show/infuse-client.webp
--- a/content/homelab/yt-dlp-tv-show/jellyfin-result.webp
+++ b/content/homelab/yt-dlp-tv-show/jellyfin-result.webp
--- a/content/homelab/yt-dlp-tv-show/jellyfin-settings.webp
+++ b/content/homelab/yt-dlp-tv-show/jellyfin-settings.webp
--- a/content/ml-tech/multi-modal-transformer/index.md
+++ b/content/ml-tech/multi-modal-transformer/index.md
@ -24,11 +24,11 @@ Since images and language modalities represent continuous and discrete data resp

 The goal of a multi-modal Transformer is to create a model that can accept multi-modal inputs and produce multi-modal outputs. For example, instead of using a CNN-based image encoder and a Transformer-based language encoder to map image and language modalities to the latent space separately, a multi-modal Transformer would be able to process the combination of image and language (sentence) as a single sequence.

-![](multi-modal-fusion.png)
+![](multi-modal-fusion.webp)

 > An example of "conventional" multi-modal fusion. Different modality is processed by separate models and fused at some point. Source: *Xiang, Hao, Runsheng Xu, and Jiaqi Ma. "HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer." CVPR, 2023.*

-![](video-poet.png)
+![](video-poet.webp)

 > An example of a Transformer that can handle multi-modal inputs and outputs. Different modalities are all projected into tokens and subsequently processed by a unified Transformer encoder. Source: *Kondratyuk, Dan, Lijun Yu, et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML, 2024.*

@ -38,13 +38,13 @@ Beyond multi-modal processing, a multi-function Transformer can, for example, fu

 A fundamental challenge in unifying multiple modalities within a single Transformer is how to represent different modalities in the same embedding space. For the "QKV" self-attention mechanism to work properly, each item in the input sequence must be represented by an embedding vector of the same dimension, matching the "model dimension" of the Transformer.

-![](qkv-attention.png)
+![](qkv-attention.webp)

 > Illustration of the QKV self-attention mechanism in Transformer. [Source](https://en.wikipedia.org/wiki/Attention_(machine_learning))

 The most common method for mapping language into the embedding space is through tokenization and token embedding. A tokenizer maps a word or word fragment into a discrete token index, and an index-fetching embedding layer (implemented in frameworks like PyTorch with `nn.Embedding`) maps this index into a fixed-dimension embedding vector. In principle, all discrete features can be mapped into the embedding space using this approach.

-![](token-embedding.png)
+![](token-embedding.webp)

 > Visualization of tokenizer and index-fetching embedding layer. [Source](https://medium.com/@hunter-j-phillips/the-embedding-layer-27d9c980d124)

@ -58,7 +58,7 @@ Vector quantization maintains a "codebook" $\boldsymbol C \in \mathbb R^{n\times
 $$
 i = \arg\min_j ||\boldsymbol z - \boldsymbol C_j||_2
 $$
-![](vector-quantization.png)
+![](vector-quantization.webp)

 ### Lookup-Free Quantization

@ -119,7 +119,7 @@ For language generation, Transformers typically use classifier output layers, ma

 One approach to reverse vector quantization is readily available in VQ-VAE, since it is an auto-encoder. Given a token $i$, we can look up its embedding in the codebook as $\boldsymbol C_i$, then apply a decoder network to map $\boldsymbol C_i$ back to the continuous feature vector $\boldsymbol z$. The decoder network can be pre-trained in the VQ-VAE framework—pre-train the VQ-VAE tokenizer, encoder, and decoder using auto-encoding loss functions, or end-to-end trained along with the whole Transformer. In the NLP and CV communities, the pre-training approach is more popular, since there are many large-scale pre-trained auto-encoders available.

-![](magvit.png)
+![](magvit.webp)

 > The encoder-decoder structure of MAGVIT (*Yu et al., "MAGVIT"*), a visual VQ-VAE model. A 3D-VQ encoder quantizes a video into discrete tokens, and a 3D-VQ decoder maps them back to the pixel space.

@ -131,7 +131,7 @@ There are several workarounds to improve the efficiency of multi-modal outputs.

 Another workaround follows the idea of compression. Take video generation as an example. The model generates full features for key frames, and light-weight features for motion vectors that describe subtle differences from those key frames. This is essentially how inter-frame compressed video codecs work, which takes advantage of temporal redundancy between neighboring frames.

-![](video-lavit.png)
+![](video-lavit.webp)

 > Keys frames and motion vectors used in *Jin et al., "Video-LaVIT."*

@ -141,7 +141,7 @@ Despite continuous efforts to enable representation and generation of images and

 An intriguing question arises: why not integrate the structures of language models and diffusion models into one Transformer to reach the best of both worlds? *Zhou et al. in "Transfusion"* explored this idea. The approach is straightforward: build a Transformer that can handle both language and image inputs and outputs. The language component functions as a language model, while the image component serves as a denoiser network for diffusion models. The model is trained by combining the language modeling loss and DDPM loss, enabling it to function either as a language model or a text-to-image denoiser.

-![](transfusion.png)
+![](transfusion.webp)

 > A Transformer capable of function as a language model and a diffusion denoiser at the same time. Source: *Zhou, Chunting, Lili Yu, et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model," ICLR, 2025.*

--- a/content/ml-tech/multi-modal-transformer/magvit.webp
+++ b/content/ml-tech/multi-modal-transformer/magvit.webp
--- a/content/ml-tech/multi-modal-transformer/multi-modal-fusion.webp
+++ b/content/ml-tech/multi-modal-transformer/multi-modal-fusion.webp
--- a/content/ml-tech/multi-modal-transformer/qkv-attention.webp
+++ b/content/ml-tech/multi-modal-transformer/qkv-attention.webp
--- a/content/ml-tech/multi-modal-transformer/token-embedding.webp
+++ b/content/ml-tech/multi-modal-transformer/token-embedding.webp
--- a/content/ml-tech/multi-modal-transformer/transfusion.webp
+++ b/content/ml-tech/multi-modal-transformer/transfusion.webp
--- a/content/ml-tech/multi-modal-transformer/vector-quantization.webp
+++ b/content/ml-tech/multi-modal-transformer/vector-quantization.webp
--- a/content/ml-tech/multi-modal-transformer/video-lavit.webp
+++ b/content/ml-tech/multi-modal-transformer/video-lavit.webp
--- a/content/ml-tech/multi-modal-transformer/video-poet.webp
+++ b/content/ml-tech/multi-modal-transformer/video-poet.webp
--- a/content/ml-tech/new-bert/ar-mask.webp
+++ b/content/ml-tech/new-bert/ar-mask.webp
--- a/content/ml-tech/new-bert/index.md
+++ b/content/ml-tech/new-bert/index.md
@ -69,7 +69,7 @@ These are basically free performance improvement to BERT.

 Vanilla BERT uses the original Transformer layer normalization design: a layer normalization is applied after each residual connection. Some modernized BERT models used alternative designs called pre-layer normalization, which moves the normalization layer inside the residual connections.

-![normalization](normalization.png)
+![normalization](normalization.webp)

 > On layer normalization in the transformer architecture (2020). Xiong, Ruibin and Yang, Yunchang and He, Di and Zheng, Kai and Zheng, Shuxin and Xing, Chen and Zhang, Huishuai and Lan, Yanyan and Wang, Liwei and Liu, Tieyan.

@ -87,7 +87,7 @@ Another aspect of improvement is how the masked tokens are selected. Vanilla BER

 If you were to train BERT to perform generative tasks, randomly masking and recovering tokens in input sequences might not be enough, and you should consider more generation-oriented pre-training tasks. An intuitive design is an AR-like generation task where a long and consecutive sub-sequence is fully masked and set for recovering.

-![ar-mask](ar-mask.png)
+![ar-mask](ar-mask.webp)

 > Unveiling the Potential of BERT-family: A New Recipe for Building Scalable, General and Competitive Large Language Models (2025). Xiao, Yisheng and Li, Juntao and Hu, Wenpeng and Luo, Zhunchen and Zhang, Min.

--- a/content/ml-tech/new-bert/normalization.webp
+++ b/content/ml-tech/new-bert/normalization.webp
--- a/content/ml-tech/ode-sde/average-velocity.webp
+++ b/content/ml-tech/ode-sde/average-velocity.webp
--- a/content/ml-tech/ode-sde/curvy-vector-field.webp
+++ b/content/ml-tech/ode-sde/curvy-vector-field.webp
--- a/content/ml-tech/ode-sde/few-step-sampling.webp
+++ b/content/ml-tech/ode-sde/few-step-sampling.webp
--- a/content/ml-tech/ode-sde/flow-data-point.webp
+++ b/content/ml-tech/ode-sde/flow-data-point.webp
--- a/content/ml-tech/ode-sde/index.md
+++ b/content/ml-tech/ode-sde/index.md
@ -15,12 +15,12 @@ $$

 where $\mu$ is the drift component that is deterministic, and $\sigma$ is the diffusion term driven by Brownian motion (denoted by $W_t$) that is stochastic. This differential equation specifies a *time-dependent vector (velocity) field* telling how a data point $x_t$ should be moved as time $t$ evolves from $t=0$ to $t=1$ (i.e., a *flow* from $x_0$ to $x_1$). Below we give an illustration where $x_t$ is 1-dimensional:

-![Vector field between two distributions](vector-field.png)
+![Vector field between two distributions](vector-field.webp)
 > Vector field between two distributions specified by a differential equation.

 When $\sigma(x_t,t)\equiv 0$, we get an *ordinary differential equation (ODE)* where the vector field is deterministic, i.e., the movement of $x_t$ is fully determined by $\mu$ and $t$. Otherwise, we get a *stochastic differential equation (SDE)* where the movement of $x_t$ has a certain level of randomness. Extending the previous illustration, below we show the difference in flow of $x_t$ under ODE and SDE:

-![ODE vs SDE movements](ode-sde-difference.png)
+![ODE vs SDE movements](ode-sde-difference.webp)
 > Difference of movements in vector fields specified by ODE and SDE. *Source: Song, Yang, et al. "Score-based generative modeling through stochastic differential equations."* Note that their time is reversed.

 As you would imagine, once we manage to solve the differential equation, even if we still cannot have a closed form of $p(x_1)$, we can sample from $p(x_1)$ by sampling a data point $x_0$ from $p(x_0)$ and get the generated data point $x_1$ by calculating the following forward-time integral with an integration technique of our choice:
@ -31,7 +31,7 @@ $$

 Or more intuitively, moving $x_0$ towards $x_1$ along time in the vector field:

-![Flow of data point](flow-data-point.png)
+![Flow of data point](flow-data-point.webp)
 > A flow of data point moving from $x_0$ towards $x_1$ in the vector field.

 ## ODE and Flow Matching
@ -80,12 +80,12 @@ $$

 Although the ground truth vector field is designed to be straight, in practice it usually is not. When the data space is high-dimensional and the target distribution $p(x_1)$ is complex, there will be multiple pairs of $(x_0, x_1)$ that result in the same intermediate data point $x_t$, thus multiple velocities $x_1-x_0$. At the end of the day, the actual ground truth velocity at $x_t$ will be the average of all possible velocities $x_1-x_0$ that pass through $x_t$. This will lead to a "curvy" vector field, illustrated as follows:

-![Curvy vector field](curvy-vector-field.png)
+![Curvy vector field](curvy-vector-field.webp)
 > Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. *Source: Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."* Note $z_t$ and $v$ in the figure correspond to $x_t$ and $\mu$ in this post, respectively.

 As we discussed, when you calculate the ODE integral, you are using the instantaneous velocity--tangent of the curves in the vector field--of each step. You would imagine this will lead to subpar performance when using a small number $N$ of steps, as demonstrated below:

-![Few-step sampling failure](few-step-sampling.png)
+![Few-step sampling failure](few-step-sampling.webp)
 > Native flow matching models fail at few-step sampling. *Source: Frans, Kevin, et al. "One step diffusion via shortcut models."*

 ### Shortcut Vector Field
@ -129,14 +129,14 @@ $$

 Where $\text{sg}$ is stop gradient, i.e., detach $\mathbf{u}_\text{target}$ from back propagation, making it a pseudo ground truth. Below is an illustration of the training process provided in the original paper.

-![Shortcut model training](shortcut-training.png)
+![Shortcut model training](shortcut-training.webp)
 > Training of the shortcut models with self-consistency loss.

 #### Mean Flow

 Mean flow is another work sharing the idea of learning velocities that take large step size shortcuts but with a stronger theoretical foundation and a different approach to training.

-![Average velocity illustration](average-velocity.png)
+![Average velocity illustration](average-velocity.webp)
 > Illustration of the average velocity provided in the original paper.

 Mean flow defines an *average velocity* as a shortcut between times $t$ and $r$ where $t$ and $r$ are independent:
@ -256,7 +256,7 @@ One caveat of training a "shortcut SDE" is that the ideal result of one-step sam

 Below are some preliminary results I obtained from a set of amorphous material generation experiments. You don't need to understand the figure--just know that it shows that applying the idea of learning shortcuts to SDE does yield better results compared to the vanilla SDE when using few-step sampling.

-![SDE shortcut results](sde-results.png)
+![SDE shortcut results](sde-results.webp)
 > Structural functions of generated materials, sampled in 10 steps.

 ---
--- a/content/ml-tech/ode-sde/ode-sde-difference.webp
+++ b/content/ml-tech/ode-sde/ode-sde-difference.webp
--- a/content/ml-tech/ode-sde/sde-results.webp
+++ b/content/ml-tech/ode-sde/sde-results.webp
--- a/content/ml-tech/ode-sde/shortcut-training.webp
+++ b/content/ml-tech/ode-sde/shortcut-training.webp
--- a/content/ml-tech/ode-sde/vector-field.webp
+++ b/content/ml-tech/ode-sde/vector-field.webp
--- a/content/ml-tech/one-step-diffusion-models/consistency-model.webp
+++ b/content/ml-tech/one-step-diffusion-models/consistency-model.webp
--- a/content/ml-tech/one-step-diffusion-models/diffusion-process.webp
+++ b/content/ml-tech/one-step-diffusion-models/diffusion-process.webp
--- a/content/ml-tech/one-step-diffusion-models/dm-scale-poorly.webp
+++ b/content/ml-tech/one-step-diffusion-models/dm-scale-poorly.webp
--- a/content/ml-tech/one-step-diffusion-models/few-steps-results.webp
+++ b/content/ml-tech/one-step-diffusion-models/few-steps-results.webp
--- a/content/ml-tech/one-step-diffusion-models/index.md
+++ b/content/ml-tech/one-step-diffusion-models/index.md
@ -13,7 +13,7 @@ Diffusion models (DMs), or more broadly speaking, score-matching generative mode

 Most diffusion models work by coupling a forward diffusion process and a reverse denoising diffusion process. The forward diffusion process gradually adds noise to the ground truth clean data $X_0$, until noisy data $X_T$ that follows a relatively simple distribution is reached. The reverse denoising diffusion process starts from the noisy data $X_T$, and removes the noise component step-by-step until clean generated data $X_0$ is reached. The reverse process is essentially a Monte-Carlo process, meaning it cannot be parallelized for each generation, which can be inefficient for a process with a large number of steps.

-![](diffusion-process.png)
+![](diffusion-process.webp)

 > The two processes in a typical diffusion model. *Source: Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models."*

@ -21,7 +21,7 @@ Most diffusion models work by coupling a forward diffusion process and a reverse

 There are many ways to understand how Diffusion Models (DMs) work. One of the most common and intuitive approaches is that a DM learns an ordinary differential equation (ODE) or a stochastic differential equation (SDE) that transforms noise into data. Imagine an vector field between the noise $X_T$ and clean data $X_0$. By training on sufficiently large numbers of timesteps $t\in [0,T]$, a DM is able to learn the vector (tangent) towards the cleaner data $X_{t-\Delta t}$, given any specific timestep $t$ and the corresponding noisy data $X_t$. This idea is easy to illustrate in a simplified 1-dimensional data scenario.

-![](ode-sde-flow.png)
+![](ode-sde-flow.webp)

 > Illustrated ODE and SDE flow of a diffusion model on 1-dimensional data. *Source: Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."*

@ -35,13 +35,13 @@ Vanilla DDPM, which is essentially a discrete-timestep DM, can only perform the

 Nevertheless, it is observed that their performance typically suffers catastrophic degradation when reducing the number of reverse process steps to single digits.

-![](few-steps-results.png)
+![](few-steps-results.webp)

 > Images generated by conventional DMs with only a few steps of reverse process. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*

 To understand why DMs scale poorly with few reverse process steps, we can return to the vector field perspective of DMs. When the target data distribution is complex, the vector field typically contains numerous intersections. When a given $X_t$ and $t$ is at these intersections, the vector points to the averaged direction of all candidates. This causes the generated data to approach the mean of the training data when only a few reverse process steps are used. Another explanation is that the learned vector field is highly curved. Using only a few reverse process steps means attempting to approximate these curves with polylines, which is inherently difficult.

-![](dm-scale-poorly.png)
+![](dm-scale-poorly.webp)

 > Illustration of the why DMs scale poorly with few reverse process steps. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*

@ -71,7 +71,7 @@ $$

 This procedure produces increasingly straight flows that can be simulated with very few steps, ideally one step after several iterations.

-![](reflow-iterations.png)
+![](reflow-iterations.webp)

 > Illustrations of vector fields after different times of reflow processes. *Source: Liu, Gong, and Liu, "Flow Straight and Fast."*

@ -91,7 +91,7 @@ x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon_t

 In theory, without altering the fundamental formulation of DMs, the learnable denoiser network can be designed to predict any of these three components. Consistency models (CMs) follow this principle by training the denoiser to specifically predict the clean sample $x_0$. The benefit of this approach is that CMs can naturally scale to perform the reverse process with few steps or even a single step.

-![](consistency-model.png)
+![](consistency-model.webp)

 > A consistency model that learns to map any point on the ODE trajectory to the clean sample. *Source: Song et al., "Consistency Models."*

@ -133,7 +133,7 @@ Based on this insight, on top of $x_t$ and $t$, shortcut models additionally inc
 \mathbf{s}_{\text{target}} = s_\theta(x_t, t, d)/2 + s_\theta(x'_{t+d}, t + d, d)/2 \quad \text{and} \quad x'_{t+d} = x_t + s_\theta(x_t, t, d)d
 {% end %}

-![](shortcut-training.png)
+![](shortcut-training.webp)

 > Illustration of the training process of shortcut models. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*

--- a/content/ml-tech/one-step-diffusion-models/ode-sde-flow.webp
+++ b/content/ml-tech/one-step-diffusion-models/ode-sde-flow.webp
--- a/content/ml-tech/one-step-diffusion-models/reflow-iterations.webp
+++ b/content/ml-tech/one-step-diffusion-models/reflow-iterations.webp
--- a/content/ml-tech/one-step-diffusion-models/shortcut-training.webp
+++ b/content/ml-tech/one-step-diffusion-models/shortcut-training.webp
--- a/content/ml-tech/rotary-pe/index.md
+++ b/content/ml-tech/rotary-pe/index.md
@ -39,7 +39,7 @@ This vector is then directly added to the token embedding vector.
 To build intuition for how PE works, consider an analogy to old-fashioned electricity meters or car odometers.
 Imagine a mechanical meter with multiple rotating wheels. The rightmost wheel rotates the fastest, completing a full rotation for each unit of position. The next wheel rotates slower, completing a rotation every 10 units. The wheel to its left rotates even slower, once per 100 units, and so on. Each wheel to the left rotates at an increasingly slower rate than the one before it.

-![](odometer.png)
+![](odometer.webp)

 In the vanilla PE formulation, different dimensions correspond to these different "wheels" rotating at different frequencies determined by $10000^{2i/d_{\text{model}}}$.
 The sine and cosine functions encode the continuous rotation angle of each wheel.
@ -80,7 +80,7 @@ The dot-product of two rotated vectors depends on their angle difference, which
 You can also understand RoPE with the rotating meters analogy above, since it is literally rotating vectors as if they were meter hands.
 After receiving those vectors, the Transformer is like an electrician, who only cares about the relative angle difference of meter hands between two reads, rather than the absolute positions of the meter hands at each read.

-![](rope-rotation.png)
+![](rope-rotation.webp)

 RoPE can be extended to arbitrary $d$ dimensions, by dividing the vector space into multiple 2-dimensional sub-spaces.

@ -149,7 +149,7 @@ Resonance RoPE addresses this by rounding wavelengths to the nearest integer.
 A wavelength of 10.3 becomes 10. Now positions 0, 10, 20, 30... all show identical rotation angles. When the model sees position 80 or 120 during inference, these align perfectly with positions seen during training. The model doesn't need to generalize to new rotation angles.
 This applies to all dimensions with wavelengths shorter than the training length. For these dimensions, Resonance RoPE provably eliminates the feature gap between training and inference positions. The rounding happens offline during model setup, so there's no computational cost.

-![](resonance-rope.png)
+![](resonance-rope.webp)

 Resonance RoPE works with any RoPE-based method. Combined with YaRN, it provides a complete solution: YaRN handles the long-wavelength dimensions, Resonance handles the short-wavelength ones.
 Experiments show the combination consistently outperforms YaRN alone on long-context tasks.
@ -162,7 +162,7 @@ The search process treats the rescale factors as parameters to optimize. Startin

 LongRoPE also introduces a progressive extension strategy. Rather than jumping directly from the training length to the target length, it extends in stages: first from 4k to 256k with evolutionary search, then applies the same factors to reach 2048k. The model only needs 1000 fine-tuning steps at 256k tokens to adapt, making the extension process both effective and efficient. This progressive approach reduces the risk of performance degradation that can occur with aggressive single-step extensions.

-![](longrope.png)
+![](longrope.webp)

 > **References:**
 >
--- a/content/ml-tech/rotary-pe/longrope.webp
+++ b/content/ml-tech/rotary-pe/longrope.webp
--- a/content/ml-tech/rotary-pe/odometer.webp
+++ b/content/ml-tech/rotary-pe/odometer.webp
--- a/content/ml-tech/rotary-pe/resonance-rope.webp
+++ b/content/ml-tech/rotary-pe/resonance-rope.webp
--- a/content/ml-tech/rotary-pe/rope-rotation.webp
+++ b/content/ml-tech/rotary-pe/rope-rotation.webp
--- a/content/ml-tech/train-multi-modal-llm/blip-bootstrap.webp
+++ b/content/ml-tech/train-multi-modal-llm/blip-bootstrap.webp
--- a/content/ml-tech/train-multi-modal-llm/deepseek-ocr.webp
+++ b/content/ml-tech/train-multi-modal-llm/deepseek-ocr.webp
--- a/content/ml-tech/train-multi-modal-llm/diffusion-captions.webp
+++ b/content/ml-tech/train-multi-modal-llm/diffusion-captions.webp
--- a/content/ml-tech/train-multi-modal-llm/image-text-pair.webp
+++ b/content/ml-tech/train-multi-modal-llm/image-text-pair.webp
--- a/content/ml-tech/train-multi-modal-llm/index.md
+++ b/content/ml-tech/train-multi-modal-llm/index.md
@ -14,11 +14,11 @@ The most straight-forward method to bridge multi-modal data and text is to train

 For images, it is relatively easy to find a large-scale image dataset where each image is coupled with a text description. For example, you can scrape images from Wikipedia which often contain descriptions, or from social media where users write descriptions.

-![](image-text-pair.png)
+![](image-text-pair.webp)

 There are some practices that you can improve efficiency of this training step. You do not necessary have to train an LLM from scratch, instead, you can train only the adaption layer between a pre-trained image encoder (like CLIP) and a text-only pre-trained LLM, like the design in LLaVA as shown below.

-![](llava-architecture.png)
+![](llava-architecture.webp)

 > Liu, Haotian, et al. "Visual instruction tuning." _Advances in neural information processing systems_ 36 (2023): 34892-34916.

@ -30,13 +30,13 @@ If you have at least a few data-text pairs to begin with, there are methods to e

 You can first train a smaller LLM with available data-text pairs at hand, then use it to generate more descriptions on unlabeled data. For example, with limited image-text pairs, you can first train a image descriptor, and apply it on unlabeled images to generate more image-text pairs. Images without text descriptions have much higher availability compared to those with.

-![](blip-bootstrap.png)
+![](blip-bootstrap.webp)

 > Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." _International conference on machine learning_. PMLR, 2022.

 Even crazier, you can train a new or use an off-the-shelf conditioned diffusion model that can generate images given descriptions. It should be relatively easy to make up descriptions using text-only LLMs.

-![](diffusion-captions.png)
+![](diffusion-captions.webp)

 > Ma, Feipeng, et al. "Image captioning with multi-context synthetic data." _Proceedings of the AAAI Conference on Artificial Intelligence_. Vol. 38. No. 5. 2024.

@ -44,7 +44,7 @@ Based on the idea of instruction-tuning that is widely use to train LLMs, LLaVA
 - Original text description
 - Description of bounding boxes, as a textual representation of the spatial relationships of objects

-![](llava-instruction.png)
+![](llava-instruction.webp)

 > Liu, Haotian, et al. "Visual instruction tuning." _Advances in neural information processing systems_ 36 (2023): 34892-34916.

@ -60,7 +60,7 @@ You can try to apply the vast available self-supervising methods that have been

 STIC also demonstrates an interesting implementation of self-supervised learning: Use LLMs to generate positive and negative (less preferred) captions of the same image, which can then be used to perform contrastive learning or [direct preference optimization (DPO)](https://arxiv.org/abs/2305.18290).

-![](stic-self-training.png)
+![](stic-self-training.webp)

 > Deng, Yihe, et al. "Enhancing large vision language models with self-training on image comprehension." _Advances in Neural Information Processing Systems_ 37 (2024): 131369-131397.

@ -72,7 +72,7 @@ Here a work that is not directly related to the topic of this post, but I feel m

 DeepSeek-OCR is a recently published and very interesting work. The core idea is, when feeding text input into LLMs, compared to directly using the text, it is actually more token-efficient to paste the text into a Word document, take a screenshot, and feed the image to LLMs.

-![](deepseek-ocr.png)
+![](deepseek-ocr.webp)

 > Wei, Haoran, Yaofeng Sun, and Yukun Li. "DeepSeek-OCR: Contexts Optical Compression." _arXiv preprint arXiv:2510.18234_ (2025).

--- a/content/ml-tech/train-multi-modal-llm/llava-architecture.webp
+++ b/content/ml-tech/train-multi-modal-llm/llava-architecture.webp
--- a/content/ml-tech/train-multi-modal-llm/llava-instruction.webp
+++ b/content/ml-tech/train-multi-modal-llm/llava-instruction.webp
--- a/content/ml-tech/train-multi-modal-llm/stic-self-training.webp
+++ b/content/ml-tech/train-multi-modal-llm/stic-self-training.webp