compress images into webp
BIN
content/homelab/modern-unix-cmd/btop.webp
Normal file
|
After Width: | Height: | Size: 150 KiB |
BIN
content/homelab/modern-unix-cmd/eza-list.webp
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
content/homelab/modern-unix-cmd/eza-tree.webp
Normal file
|
After Width: | Height: | Size: 29 KiB |
BIN
content/homelab/modern-unix-cmd/htop.webp
Normal file
|
After Width: | Height: | Size: 188 KiB |
|
|
@ -16,11 +16,11 @@ Let's say I am currently in `~/Documents/Projects/personal-blog` and I want to j
|
|||
With the classic `cd`, I will have to type the whole path.
|
||||
With `cd` aliased to `zoxide`, I only need to type `cd n` (supposing that `~/.config/nix` is the most frequently visited directory among all matched directories).
|
||||
|
||||

|
||||

|
||||
|
||||
Internally `zoxide` records my visits to directories in a SQLite database and sorts them based on frequency. If the first hit is not what I want, I can also interactively select from the matched list.
|
||||
|
||||

|
||||

|
||||
|
||||
## `du` -> `ncdu`
|
||||
|
||||
|
|
@ -29,17 +29,17 @@ Internally `zoxide` records my visits to directories in a SQLite database and so
|
|||
[`ncdu`](https://dev.yorhel.nl/ncdu) is an interactive alternative to `du`, and is very usable out of the box. Interestingly, I also feel it is a touch faster than `du`.
|
||||
It can totally be an alternative to those fancy disk space analyzers as well.
|
||||
|
||||

|
||||

|
||||
|
||||
## `top` -> `btop`
|
||||
|
||||
`top` is quite basic and looks "unexciting". `htop` also ships with most Unix/Linux systems and looks better.
|
||||
|
||||

|
||||

|
||||
|
||||
[`btop`](https://github.com/aristocratos/btop) might be the most "nerdy-looking" `top` alternative out of the box. It can be a handy tool if you are trying to make people believe you are a hacker.
|
||||
|
||||

|
||||

|
||||
|
||||
At the same time, it is very feature-rich and configurable. To some extent, it is also an alternative to bandwidth monitoring tools like `iftop` and disk utilization tools like `df`.
|
||||
|
||||
|
|
@ -47,11 +47,11 @@ At the same time, it is very feature-rich and configurable. To some extent, it i
|
|||
|
||||
I think there is nothing wrong with the classic `ls`. So, as an alternative, [`eza`](https://github.com/eza-community/eza) just has a few quality-of-life improvements, like file type icons, Git status, and (based on personal taste) prettier colors.
|
||||
|
||||

|
||||

|
||||
|
||||
It can replace the `tree` command as well.
|
||||
|
||||

|
||||

|
||||
|
||||
## `vim` -> `nvim`
|
||||
|
||||
|
|
@ -63,6 +63,6 @@ You can simply use `vim` keybindings in many editors or IDEs. `vim` itself can f
|
|||
To put it simply, it is a TUI editor that can truly be your only text editor. With countless plugins and ways to configure it, it can be a basic text editor, or a fully-featured development IDE, or anything in-between.
|
||||
Syntax highlighting, file browser, fuzzy search, intelligent autocompletion, debugging, AI™ integration. You name it, `neovim` has it.
|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
|
|
|||
BIN
content/homelab/modern-unix-cmd/ncdu.webp
Normal file
|
After Width: | Height: | Size: 63 KiB |
BIN
content/homelab/modern-unix-cmd/nvim-1.webp
Normal file
|
After Width: | Height: | Size: 115 KiB |
BIN
content/homelab/modern-unix-cmd/nvim-2.webp
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
content/homelab/modern-unix-cmd/zoxide-jump.webp
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
content/homelab/modern-unix-cmd/zoxide-select.webp
Normal file
|
After Width: | Height: | Size: 47 KiB |
BIN
content/homelab/nixos-home-server/cover.webp
Normal file
|
After Width: | Height: | Size: 33 KiB |
BIN
content/homelab/nixos-home-server/immich.webp
Normal file
|
After Width: | Height: | Size: 344 KiB |
|
|
@ -6,7 +6,7 @@ description = "How I built a NixOS-based Home Server/Nas"
|
|||
|
||||
This is a very concise walkthrough of my main home server running NixOS. I assume the reader already has basic knowledge about NixOS.
|
||||
|
||||

|
||||

|
||||
|
||||
My home server (or many would rather call it a NAS) serves common home server purposes: bulk storage, basic file sharing, media streaming service, and photo backup.
|
||||
|
||||
|
|
@ -14,7 +14,7 @@ My home server (or many would rather call it a NAS) serves common home server pu
|
|||
|
||||
Below is a recent photo of my home server, living in the utility closet together with my network equipments.
|
||||
|
||||

|
||||

|
||||
|
||||
It is essentially an Intel N305 custom motherboard with SATA back panel and a 3D-printed enclosure. I bought it on Taobao last time I went back to China to visit my family.
|
||||
Not very important here, as long as you stick to common hardware, it should be relatively straightforward to install NixOS and replicate my setup.
|
||||
|
|
@ -185,7 +185,7 @@ Transmission seems to be more stable, but its interface is so barebones and is m
|
|||
|
||||
For photo backup I use [Immich](https://immich.app/). It is a self-hosted alternative to iCloud Photos and Google Photos. Aside from basic photo backup and management, it also has many nice touches, such as face detection, CLIP-based image embedding for semantic search, and recently added OCR for text search. It also comes with quite robust mobile apps for both iOS and Android.
|
||||
|
||||

|
||||

|
||||
|
||||
Right now Immich is the only service I am running with containers rather than native Nix modules (as you can see in [this configuration file](https://github.com/Logan-Lin/nix-config/blob/master/hosts/nixos/hs/containers.nix)). Technically it is possible to set up Immich with pure Nix modules, but for this type of services that rely on specific versions of databases (in this case, PostgreSQL with vector support), I feel containers are the easier route.
|
||||
And to be honest, I don't think there is much benefit going with pure Nix module here (especially for Immich which you can still [declare its config](https://github.com/Logan-Lin/nix-config/blob/master/config/immich.nix) even with containers), other than fulfilling the purism many Nix users seem to have.
|
||||
|
|
@ -212,7 +212,7 @@ The P2P nature of Tailscale also means that, if you have no interest in creating
|
|||
I don't want to complicate things so I haven't set up any automated system to check the health status of my home server and send notification if anything goes wrong.
|
||||
I do have a [login display module](https://github.com/Logan-Lin/nix-config/blob/master/modules/login-display.nix) that will report important status every time I SSH into my home server.
|
||||
|
||||

|
||||

|
||||
|
||||
## Why NixOS?
|
||||
|
||||
|
|
@ -222,6 +222,6 @@ Compared to purposefully built home server systems (like Unraid) and pre-built h
|
|||
|
||||
Compared to other Linux distributions, NixOS is quite suitable for setting up a home server. Since it is declarative, setting up many things are probably easier than you thought. In other words, for the most part, you only have to care **what** you want to achieve, not **how** you are going to achieve them (this is of course, primarily thanks to the amazing NixOS community). On the other hand, most of the configuration is fully self-contained and tracked in your Nix config repo (supposing you use git). So it is much less prone to oversight during configuration, and you also don't have to explicitly remember your setup for future references. Before switching my home server to NixOS, I've been using Nix-darwin on my Macbook for a while, so I also get to reuse a lot of custom modules, like the [neovim module](https://github.com/Logan-Lin/nix-config/blob/master/modules/nvim.nix).
|
||||
|
||||

|
||||

|
||||
|
||||
> It looks completely identical (why not), to the point I have to set up visual hints (like the highlighted tmux hostname display) to remind myself which host I am currently on.
|
||||
|
|
|
|||
BIN
content/homelab/nixos-home-server/login-display.webp
Normal file
|
After Width: | Height: | Size: 17 KiB |
BIN
content/homelab/nixos-home-server/server-photo.webp
Normal file
|
After Width: | Height: | Size: 261 KiB |
BIN
content/homelab/nixos-home-server/terminal-comparison.webp
Normal file
|
After Width: | Height: | Size: 146 KiB |
BIN
content/homelab/replace-cloud-w-sync/calibre.webp
Normal file
|
After Width: | Height: | Size: 186 KiB |
BIN
content/homelab/replace-cloud-w-sync/foobar2000.webp
Normal file
|
After Width: | Height: | Size: 978 KiB |
|
|
@ -23,7 +23,7 @@ As long as the service can achieve one functionality: always keep a full copy of
|
|||
I do want to recommend a service for this purpose: [Syncthing](https://syncthing.net/). This is a peer-to-peer file sync service, which means it has minimal reliance on cloud infrastructure, and your data never has to be stored on computers that are not yours.
|
||||
Also, from my experience, all cloud storage services I've used frequently run into stability issues when I try to sync tons of small files at once (e.g., a Git repo), but Syncthing is never unstable no matter how I abuse it.
|
||||
|
||||

|
||||

|
||||
|
||||
## Examples
|
||||
|
||||
|
|
@ -36,17 +36,17 @@ One replacement for Notion that has no reliance on cloud what so ever is [Obsidi
|
|||
Every type of data needed by Obsidian, including the notes themselves, settings, plugins, and GUI customization, are stored locally (even better, in plain text).
|
||||
Once you use a local file sync service to sync the vault folder of Obsidian, it works like the cloud in that everything is always in-sync, but does not have any of the downsides of the cloud.
|
||||
|
||||

|
||||

|
||||
|
||||
### Reference Management: Zotero
|
||||
|
||||
[Zotero](https://www.zotero.org/) is a reference management software that can be used in a variety of scenarios. For me, I largely use it to manage academic papers I need to read.
|
||||
|
||||

|
||||

|
||||
|
||||
Zotero has a built-in cloud sync functionality but their price for storage upgrades is quite high. One thing you might not know is, Zotero stores metadata and attachments in the same folder. You can use Syncthing to sync that folder, and completely ignore the official cloud sync functionality.
|
||||
|
||||

|
||||

|
||||
|
||||
### Paper Writing: Overleaf vs. Local Text Editor
|
||||
|
||||
|
|
@ -71,7 +71,7 @@ clean:
|
|||
rm -rf out
|
||||
```
|
||||
|
||||

|
||||

|
||||
|
||||
Overleaf also provides two types of Git integration for you to sync your local changes with Overleaf projects: sync with a GitHub repo, or directly as a remote git repo. It's totally viable to have a mixed setup, where you primarily use local editors and most of your collaborators use Overleaf.
|
||||
|
||||
|
|
@ -79,7 +79,7 @@ Overleaf also provides two types of Git integration for you to sync your local c
|
|||
|
||||
[Calibre](https://calibre-ebook.com/) is a book management software that can be used to manage your book collection, edit metadata, along with many handy functionalities like bulk conversion.
|
||||
|
||||

|
||||

|
||||
|
||||
Similar to Zotero, Calibre stores all the books and metadata of a library in a local folder, so there is nothing stopping you from syncing the folder across multiple computers. Although this is something explicitly suggested against by the software (a line when you select the location for a library: "Note that putting the calibre library on a Networked drive is not safe"), from my experience, as long as you don't try to open and modify the same library on two synced computers simultaneously, you won't be running into any issues.
|
||||
|
||||
|
|
@ -92,7 +92,7 @@ It is my own cloud infrastructure, nevertheless it is a cloud infrastructure so
|
|||
Now I use a simpler yet more robust setup. I just sync all my music files in a folder through Syncthing, and use a local music player like [foobar2000](https://www.foobar2000.org/) to read that folder.
|
||||
Of course, to save some space, I will always transcode each file to AAC 256k before putting them in the sync folder.
|
||||
|
||||

|
||||

|
||||
|
||||
## Limitations
|
||||
|
||||
|
|
|
|||
BIN
content/homelab/replace-cloud-w-sync/latex.webp
Normal file
|
After Width: | Height: | Size: 223 KiB |
BIN
content/homelab/replace-cloud-w-sync/obsidian.webp
Normal file
|
After Width: | Height: | Size: 135 KiB |
BIN
content/homelab/replace-cloud-w-sync/syncthing.webp
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
content/homelab/replace-cloud-w-sync/zotero-files.webp
Normal file
|
After Width: | Height: | Size: 113 KiB |
BIN
content/homelab/replace-cloud-w-sync/zotero.webp
Normal file
|
After Width: | Height: | Size: 191 KiB |
|
|
@ -118,7 +118,7 @@ SEASONNFO
|
|||
|
||||
# Handle thumbnail: rename and copy as posters
|
||||
local thumb_file=""
|
||||
for ext in jpg webp png; do
|
||||
for ext in.webp webp.webp; do
|
||||
if [[ -f "$dir/$name_noext.$ext" ]]; then
|
||||
thumb_file="$dir/$name_noext.$ext"
|
||||
break
|
||||
|
|
@ -129,11 +129,11 @@ SEASONNFO
|
|||
local thumb_ext="${thumb_file##*.}"
|
||||
mv "$thumb_file" "$dir/$name_noext-thumb.$thumb_ext" 2>/dev/null
|
||||
|
||||
if [[ ! -f "$series_dir/poster.jpg" ]] && [[ ! -f "$series_dir/poster.webp" ]] && [[ ! -f "$series_dir/poster.png" ]]; then
|
||||
if [[ ! -f "$series_dir/poster.webp" ]] && [[ ! -f "$series_dir/poster.webp" ]] && [[ ! -f "$series_dir/poster.webp" ]]; then
|
||||
cp "$dir/$name_noext-thumb.$thumb_ext" "$series_dir/poster.$thumb_ext"
|
||||
fi
|
||||
|
||||
if [[ ! -f "$season_dir/poster.jpg" ]] && [[ ! -f "$season_dir/poster.webp" ]] && [[ ! -f "$season_dir/poster.png" ]]; then
|
||||
if [[ ! -f "$season_dir/poster.webp" ]] && [[ ! -f "$season_dir/poster.webp" ]] && [[ ! -f "$season_dir/poster.webp" ]]; then
|
||||
cp "$dir/$name_noext-thumb.$thumb_ext" "$season_dir/poster.$thumb_ext"
|
||||
fi
|
||||
fi
|
||||
|
|
@ -144,15 +144,15 @@ I also include the thumbnail extraction logic in the implementation. The end res
|
|||
|
||||
```
|
||||
PMM_LORD
|
||||
├── poster.jpg
|
||||
├── poster.webp
|
||||
├── Season 2025
|
||||
│ ├── poster.jpg
|
||||
│ ├── poster.webp
|
||||
│ ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f.mp4
|
||||
│ ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f.nfo
|
||||
│ ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f-thumb.jpg
|
||||
│ ├── S2025E1031 - 【PGN】狐狸鸣泣之时——寂静岭f-thumb.webp
|
||||
│ ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA.mp4
|
||||
│ ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA.nfo
|
||||
│ ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA-thumb.jpg
|
||||
│ ├── S2025E1127 - 【PGN】卡洛斯传奇略人区——宝可梦传说ZA-thumb.webp
|
||||
│ └── season.nfo
|
||||
└── tvshow.nfo
|
||||
|
||||
|
|
@ -175,14 +175,14 @@ And the episode `.nfo` file will record the title, upload date, and video descri
|
|||
|
||||
Finally, in Jellyfin/Emby, we can set up a "TV show" type library, but uncheck all the metadata fetching sources so that the information will only be provided by the `.nfo` files.
|
||||
|
||||

|
||||

|
||||
|
||||
And it works!
|
||||
|
||||

|
||||

|
||||
|
||||
And of course it also works nicely with third-party clients like Infuse on my mobile devices/TV.
|
||||
|
||||

|
||||

|
||||
|
||||
I packaged all my custom implementation of yt-dlp into a Nix module, which you can take a look in the [this link](https://github.com/Logan-Lin/nix-config/blob/master/modules/yt-dlp.nix) if interested.
|
||||
|
|
|
|||
BIN
content/homelab/yt-dlp-tv-show/infuse-client.webp
Normal file
|
After Width: | Height: | Size: 96 KiB |
BIN
content/homelab/yt-dlp-tv-show/jellyfin-result.webp
Normal file
|
After Width: | Height: | Size: 89 KiB |
BIN
content/homelab/yt-dlp-tv-show/jellyfin-settings.webp
Normal file
|
After Width: | Height: | Size: 36 KiB |
|
|
@ -24,11 +24,11 @@ Since images and language modalities represent continuous and discrete data resp
|
|||
|
||||
The goal of a multi-modal Transformer is to create a model that can accept multi-modal inputs and produce multi-modal outputs. For example, instead of using a CNN-based image encoder and a Transformer-based language encoder to map image and language modalities to the latent space separately, a multi-modal Transformer would be able to process the combination of image and language (sentence) as a single sequence.
|
||||
|
||||

|
||||

|
||||
|
||||
> An example of "conventional" multi-modal fusion. Different modality is processed by separate models and fused at some point. Source: *Xiang, Hao, Runsheng Xu, and Jiaqi Ma. "HM-ViT: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer." CVPR, 2023.*
|
||||
|
||||

|
||||

|
||||
|
||||
> An example of a Transformer that can handle multi-modal inputs and outputs. Different modalities are all projected into tokens and subsequently processed by a unified Transformer encoder. Source: *Kondratyuk, Dan, Lijun Yu, et al. "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML, 2024.*
|
||||
|
||||
|
|
@ -38,13 +38,13 @@ Beyond multi-modal processing, a multi-function Transformer can, for example, fu
|
|||
|
||||
A fundamental challenge in unifying multiple modalities within a single Transformer is how to represent different modalities in the same embedding space. For the "QKV" self-attention mechanism to work properly, each item in the input sequence must be represented by an embedding vector of the same dimension, matching the "model dimension" of the Transformer.
|
||||
|
||||

|
||||

|
||||
|
||||
> Illustration of the QKV self-attention mechanism in Transformer. [Source](https://en.wikipedia.org/wiki/Attention_(machine_learning))
|
||||
|
||||
The most common method for mapping language into the embedding space is through tokenization and token embedding. A tokenizer maps a word or word fragment into a discrete token index, and an index-fetching embedding layer (implemented in frameworks like PyTorch with `nn.Embedding`) maps this index into a fixed-dimension embedding vector. In principle, all discrete features can be mapped into the embedding space using this approach.
|
||||
|
||||

|
||||

|
||||
|
||||
> Visualization of tokenizer and index-fetching embedding layer. [Source](https://medium.com/@hunter-j-phillips/the-embedding-layer-27d9c980d124)
|
||||
|
||||
|
|
@ -58,7 +58,7 @@ Vector quantization maintains a "codebook" $\boldsymbol C \in \mathbb R^{n\times
|
|||
$$
|
||||
i = \arg\min_j ||\boldsymbol z - \boldsymbol C_j||_2
|
||||
$$
|
||||

|
||||

|
||||
|
||||
### Lookup-Free Quantization
|
||||
|
||||
|
|
@ -119,7 +119,7 @@ For language generation, Transformers typically use classifier output layers, ma
|
|||
|
||||
One approach to reverse vector quantization is readily available in VQ-VAE, since it is an auto-encoder. Given a token $i$, we can look up its embedding in the codebook as $\boldsymbol C_i$, then apply a decoder network to map $\boldsymbol C_i$ back to the continuous feature vector $\boldsymbol z$. The decoder network can be pre-trained in the VQ-VAE framework—pre-train the VQ-VAE tokenizer, encoder, and decoder using auto-encoding loss functions, or end-to-end trained along with the whole Transformer. In the NLP and CV communities, the pre-training approach is more popular, since there are many large-scale pre-trained auto-encoders available.
|
||||
|
||||

|
||||

|
||||
|
||||
> The encoder-decoder structure of MAGVIT (*Yu et al., "MAGVIT"*), a visual VQ-VAE model. A 3D-VQ encoder quantizes a video into discrete tokens, and a 3D-VQ decoder maps them back to the pixel space.
|
||||
|
||||
|
|
@ -131,7 +131,7 @@ There are several workarounds to improve the efficiency of multi-modal outputs.
|
|||
|
||||
Another workaround follows the idea of compression. Take video generation as an example. The model generates full features for key frames, and light-weight features for motion vectors that describe subtle differences from those key frames. This is essentially how inter-frame compressed video codecs work, which takes advantage of temporal redundancy between neighboring frames.
|
||||
|
||||

|
||||

|
||||
|
||||
> Keys frames and motion vectors used in *Jin et al., "Video-LaVIT."*
|
||||
|
||||
|
|
@ -141,7 +141,7 @@ Despite continuous efforts to enable representation and generation of images and
|
|||
|
||||
An intriguing question arises: why not integrate the structures of language models and diffusion models into one Transformer to reach the best of both worlds? *Zhou et al. in "Transfusion"* explored this idea. The approach is straightforward: build a Transformer that can handle both language and image inputs and outputs. The language component functions as a language model, while the image component serves as a denoiser network for diffusion models. The model is trained by combining the language modeling loss and DDPM loss, enabling it to function either as a language model or a text-to-image denoiser.
|
||||
|
||||

|
||||

|
||||
|
||||
> A Transformer capable of function as a language model and a diffusion denoiser at the same time. Source: *Zhou, Chunting, Lili Yu, et al. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model," ICLR, 2025.*
|
||||
|
||||
|
|
|
|||
BIN
content/ml-tech/multi-modal-transformer/magvit.webp
Normal file
|
After Width: | Height: | Size: 85 KiB |
BIN
content/ml-tech/multi-modal-transformer/multi-modal-fusion.webp
Normal file
|
After Width: | Height: | Size: 56 KiB |
BIN
content/ml-tech/multi-modal-transformer/qkv-attention.webp
Normal file
|
After Width: | Height: | Size: 43 KiB |
BIN
content/ml-tech/multi-modal-transformer/token-embedding.webp
Normal file
|
After Width: | Height: | Size: 48 KiB |
BIN
content/ml-tech/multi-modal-transformer/transfusion.webp
Normal file
|
After Width: | Height: | Size: 25 KiB |
BIN
content/ml-tech/multi-modal-transformer/vector-quantization.webp
Normal file
|
After Width: | Height: | Size: 72 KiB |
BIN
content/ml-tech/multi-modal-transformer/video-lavit.webp
Normal file
|
After Width: | Height: | Size: 115 KiB |
BIN
content/ml-tech/multi-modal-transformer/video-poet.webp
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
content/ml-tech/new-bert/ar-mask.webp
Normal file
|
After Width: | Height: | Size: 34 KiB |
|
|
@ -69,7 +69,7 @@ These are basically free performance improvement to BERT.
|
|||
|
||||
Vanilla BERT uses the original Transformer layer normalization design: a layer normalization is applied after each residual connection. Some modernized BERT models used alternative designs called pre-layer normalization, which moves the normalization layer inside the residual connections.
|
||||
|
||||

|
||||

|
||||
|
||||
> On layer normalization in the transformer architecture (2020). Xiong, Ruibin and Yang, Yunchang and He, Di and Zheng, Kai and Zheng, Shuxin and Xing, Chen and Zhang, Huishuai and Lan, Yanyan and Wang, Liwei and Liu, Tieyan.
|
||||
|
||||
|
|
@ -87,7 +87,7 @@ Another aspect of improvement is how the masked tokens are selected. Vanilla BER
|
|||
|
||||
If you were to train BERT to perform generative tasks, randomly masking and recovering tokens in input sequences might not be enough, and you should consider more generation-oriented pre-training tasks. An intuitive design is an AR-like generation task where a long and consecutive sub-sequence is fully masked and set for recovering.
|
||||
|
||||

|
||||

|
||||
|
||||
> Unveiling the Potential of BERT-family: A New Recipe for Building Scalable, General and Competitive Large Language Models (2025). Xiao, Yisheng and Li, Juntao and Hu, Wenpeng and Luo, Zhunchen and Zhang, Min.
|
||||
|
||||
|
|
|
|||
BIN
content/ml-tech/new-bert/normalization.webp
Normal file
|
After Width: | Height: | Size: 27 KiB |
BIN
content/ml-tech/ode-sde/average-velocity.webp
Normal file
|
After Width: | Height: | Size: 28 KiB |
BIN
content/ml-tech/ode-sde/curvy-vector-field.webp
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
content/ml-tech/ode-sde/few-step-sampling.webp
Normal file
|
After Width: | Height: | Size: 80 KiB |
BIN
content/ml-tech/ode-sde/flow-data-point.webp
Normal file
|
After Width: | Height: | Size: 55 KiB |
|
|
@ -15,12 +15,12 @@ $$
|
|||
|
||||
where $\mu$ is the drift component that is deterministic, and $\sigma$ is the diffusion term driven by Brownian motion (denoted by $W_t$) that is stochastic. This differential equation specifies a *time-dependent vector (velocity) field* telling how a data point $x_t$ should be moved as time $t$ evolves from $t=0$ to $t=1$ (i.e., a *flow* from $x_0$ to $x_1$). Below we give an illustration where $x_t$ is 1-dimensional:
|
||||
|
||||

|
||||

|
||||
> Vector field between two distributions specified by a differential equation.
|
||||
|
||||
When $\sigma(x_t,t)\equiv 0$, we get an *ordinary differential equation (ODE)* where the vector field is deterministic, i.e., the movement of $x_t$ is fully determined by $\mu$ and $t$. Otherwise, we get a *stochastic differential equation (SDE)* where the movement of $x_t$ has a certain level of randomness. Extending the previous illustration, below we show the difference in flow of $x_t$ under ODE and SDE:
|
||||
|
||||

|
||||

|
||||
> Difference of movements in vector fields specified by ODE and SDE. *Source: Song, Yang, et al. "Score-based generative modeling through stochastic differential equations."* Note that their time is reversed.
|
||||
|
||||
As you would imagine, once we manage to solve the differential equation, even if we still cannot have a closed form of $p(x_1)$, we can sample from $p(x_1)$ by sampling a data point $x_0$ from $p(x_0)$ and get the generated data point $x_1$ by calculating the following forward-time integral with an integration technique of our choice:
|
||||
|
|
@ -31,7 +31,7 @@ $$
|
|||
|
||||
Or more intuitively, moving $x_0$ towards $x_1$ along time in the vector field:
|
||||
|
||||

|
||||

|
||||
> A flow of data point moving from $x_0$ towards $x_1$ in the vector field.
|
||||
|
||||
## ODE and Flow Matching
|
||||
|
|
@ -80,12 +80,12 @@ $$
|
|||
|
||||
Although the ground truth vector field is designed to be straight, in practice it usually is not. When the data space is high-dimensional and the target distribution $p(x_1)$ is complex, there will be multiple pairs of $(x_0, x_1)$ that result in the same intermediate data point $x_t$, thus multiple velocities $x_1-x_0$. At the end of the day, the actual ground truth velocity at $x_t$ will be the average of all possible velocities $x_1-x_0$ that pass through $x_t$. This will lead to a "curvy" vector field, illustrated as follows:
|
||||
|
||||

|
||||

|
||||
> Left: multiple vectors passing through the same intermediate data point. Right: the resulting ground truth vector field. *Source: Geng, Zhengyang, et al. "Mean Flows for One-step Generative Modeling."* Note $z_t$ and $v$ in the figure correspond to $x_t$ and $\mu$ in this post, respectively.
|
||||
|
||||
As we discussed, when you calculate the ODE integral, you are using the instantaneous velocity--tangent of the curves in the vector field--of each step. You would imagine this will lead to subpar performance when using a small number $N$ of steps, as demonstrated below:
|
||||
|
||||

|
||||

|
||||
> Native flow matching models fail at few-step sampling. *Source: Frans, Kevin, et al. "One step diffusion via shortcut models."*
|
||||
|
||||
### Shortcut Vector Field
|
||||
|
|
@ -129,14 +129,14 @@ $$
|
|||
|
||||
Where $\text{sg}$ is stop gradient, i.e., detach $\mathbf{u}_\text{target}$ from back propagation, making it a pseudo ground truth. Below is an illustration of the training process provided in the original paper.
|
||||
|
||||

|
||||

|
||||
> Training of the shortcut models with self-consistency loss.
|
||||
|
||||
#### Mean Flow
|
||||
|
||||
Mean flow is another work sharing the idea of learning velocities that take large step size shortcuts but with a stronger theoretical foundation and a different approach to training.
|
||||
|
||||

|
||||

|
||||
> Illustration of the average velocity provided in the original paper.
|
||||
|
||||
Mean flow defines an *average velocity* as a shortcut between times $t$ and $r$ where $t$ and $r$ are independent:
|
||||
|
|
@ -256,7 +256,7 @@ One caveat of training a "shortcut SDE" is that the ideal result of one-step sam
|
|||
|
||||
Below are some preliminary results I obtained from a set of amorphous material generation experiments. You don't need to understand the figure--just know that it shows that applying the idea of learning shortcuts to SDE does yield better results compared to the vanilla SDE when using few-step sampling.
|
||||
|
||||

|
||||

|
||||
> Structural functions of generated materials, sampled in 10 steps.
|
||||
|
||||
---
|
||||
|
|
|
|||
BIN
content/ml-tech/ode-sde/ode-sde-difference.webp
Normal file
|
After Width: | Height: | Size: 103 KiB |
BIN
content/ml-tech/ode-sde/sde-results.webp
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
content/ml-tech/ode-sde/shortcut-training.webp
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
content/ml-tech/ode-sde/vector-field.webp
Normal file
|
After Width: | Height: | Size: 50 KiB |
BIN
content/ml-tech/one-step-diffusion-models/consistency-model.webp
Normal file
|
After Width: | Height: | Size: 103 KiB |
BIN
content/ml-tech/one-step-diffusion-models/diffusion-process.webp
Normal file
|
After Width: | Height: | Size: 34 KiB |
BIN
content/ml-tech/one-step-diffusion-models/dm-scale-poorly.webp
Normal file
|
After Width: | Height: | Size: 80 KiB |
BIN
content/ml-tech/one-step-diffusion-models/few-steps-results.webp
Normal file
|
After Width: | Height: | Size: 49 KiB |
|
|
@ -13,7 +13,7 @@ Diffusion models (DMs), or more broadly speaking, score-matching generative mode
|
|||
|
||||
Most diffusion models work by coupling a forward diffusion process and a reverse denoising diffusion process. The forward diffusion process gradually adds noise to the ground truth clean data $X_0$, until noisy data $X_T$ that follows a relatively simple distribution is reached. The reverse denoising diffusion process starts from the noisy data $X_T$, and removes the noise component step-by-step until clean generated data $X_0$ is reached. The reverse process is essentially a Monte-Carlo process, meaning it cannot be parallelized for each generation, which can be inefficient for a process with a large number of steps.
|
||||
|
||||

|
||||

|
||||
|
||||
> The two processes in a typical diffusion model. *Source: Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models."*
|
||||
|
||||
|
|
@ -21,7 +21,7 @@ Most diffusion models work by coupling a forward diffusion process and a reverse
|
|||
|
||||
There are many ways to understand how Diffusion Models (DMs) work. One of the most common and intuitive approaches is that a DM learns an ordinary differential equation (ODE) or a stochastic differential equation (SDE) that transforms noise into data. Imagine an vector field between the noise $X_T$ and clean data $X_0$. By training on sufficiently large numbers of timesteps $t\in [0,T]$, a DM is able to learn the vector (tangent) towards the cleaner data $X_{t-\Delta t}$, given any specific timestep $t$ and the corresponding noisy data $X_t$. This idea is easy to illustrate in a simplified 1-dimensional data scenario.
|
||||
|
||||

|
||||

|
||||
|
||||
> Illustrated ODE and SDE flow of a diffusion model on 1-dimensional data. *Source: Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations."*
|
||||
|
||||
|
|
@ -35,13 +35,13 @@ Vanilla DDPM, which is essentially a discrete-timestep DM, can only perform the
|
|||
|
||||
Nevertheless, it is observed that their performance typically suffers catastrophic degradation when reducing the number of reverse process steps to single digits.
|
||||
|
||||

|
||||

|
||||
|
||||
> Images generated by conventional DMs with only a few steps of reverse process. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*
|
||||
|
||||
To understand why DMs scale poorly with few reverse process steps, we can return to the vector field perspective of DMs. When the target data distribution is complex, the vector field typically contains numerous intersections. When a given $X_t$ and $t$ is at these intersections, the vector points to the averaged direction of all candidates. This causes the generated data to approach the mean of the training data when only a few reverse process steps are used. Another explanation is that the learned vector field is highly curved. Using only a few reverse process steps means attempting to approximate these curves with polylines, which is inherently difficult.
|
||||
|
||||

|
||||

|
||||
|
||||
> Illustration of the why DMs scale poorly with few reverse process steps. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*
|
||||
|
||||
|
|
@ -71,7 +71,7 @@ $$
|
|||
|
||||
This procedure produces increasingly straight flows that can be simulated with very few steps, ideally one step after several iterations.
|
||||
|
||||

|
||||

|
||||
|
||||
> Illustrations of vector fields after different times of reflow processes. *Source: Liu, Gong, and Liu, "Flow Straight and Fast."*
|
||||
|
||||
|
|
@ -91,7 +91,7 @@ x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon_t
|
|||
|
||||
In theory, without altering the fundamental formulation of DMs, the learnable denoiser network can be designed to predict any of these three components. Consistency models (CMs) follow this principle by training the denoiser to specifically predict the clean sample $x_0$. The benefit of this approach is that CMs can naturally scale to perform the reverse process with few steps or even a single step.
|
||||
|
||||

|
||||

|
||||
|
||||
> A consistency model that learns to map any point on the ODE trajectory to the clean sample. *Source: Song et al., "Consistency Models."*
|
||||
|
||||
|
|
@ -133,7 +133,7 @@ Based on this insight, on top of $x_t$ and $t$, shortcut models additionally inc
|
|||
\mathbf{s}_{\text{target}} = s_\theta(x_t, t, d)/2 + s_\theta(x'_{t+d}, t + d, d)/2 \quad \text{and} \quad x'_{t+d} = x_t + s_\theta(x_t, t, d)d
|
||||
{% end %}
|
||||
|
||||

|
||||

|
||||
|
||||
> Illustration of the training process of shortcut models. *Source: Frans et al., "One Step Diffusion via Shortcut Models."*
|
||||
|
||||
|
|
|
|||
BIN
content/ml-tech/one-step-diffusion-models/ode-sde-flow.webp
Normal file
|
After Width: | Height: | Size: 102 KiB |
BIN
content/ml-tech/one-step-diffusion-models/reflow-iterations.webp
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
content/ml-tech/one-step-diffusion-models/shortcut-training.webp
Normal file
|
After Width: | Height: | Size: 97 KiB |
|
|
@ -39,7 +39,7 @@ This vector is then directly added to the token embedding vector.
|
|||
To build intuition for how PE works, consider an analogy to old-fashioned electricity meters or car odometers.
|
||||
Imagine a mechanical meter with multiple rotating wheels. The rightmost wheel rotates the fastest, completing a full rotation for each unit of position. The next wheel rotates slower, completing a rotation every 10 units. The wheel to its left rotates even slower, once per 100 units, and so on. Each wheel to the left rotates at an increasingly slower rate than the one before it.
|
||||
|
||||

|
||||

|
||||
|
||||
In the vanilla PE formulation, different dimensions correspond to these different "wheels" rotating at different frequencies determined by $10000^{2i/d_{\text{model}}}$.
|
||||
The sine and cosine functions encode the continuous rotation angle of each wheel.
|
||||
|
|
@ -80,7 +80,7 @@ The dot-product of two rotated vectors depends on their angle difference, which
|
|||
You can also understand RoPE with the rotating meters analogy above, since it is literally rotating vectors as if they were meter hands.
|
||||
After receiving those vectors, the Transformer is like an electrician, who only cares about the relative angle difference of meter hands between two reads, rather than the absolute positions of the meter hands at each read.
|
||||
|
||||

|
||||

|
||||
|
||||
RoPE can be extended to arbitrary $d$ dimensions, by dividing the vector space into multiple 2-dimensional sub-spaces.
|
||||
|
||||
|
|
@ -149,7 +149,7 @@ Resonance RoPE addresses this by rounding wavelengths to the nearest integer.
|
|||
A wavelength of 10.3 becomes 10. Now positions 0, 10, 20, 30... all show identical rotation angles. When the model sees position 80 or 120 during inference, these align perfectly with positions seen during training. The model doesn't need to generalize to new rotation angles.
|
||||
This applies to all dimensions with wavelengths shorter than the training length. For these dimensions, Resonance RoPE provably eliminates the feature gap between training and inference positions. The rounding happens offline during model setup, so there's no computational cost.
|
||||
|
||||

|
||||

|
||||
|
||||
Resonance RoPE works with any RoPE-based method. Combined with YaRN, it provides a complete solution: YaRN handles the long-wavelength dimensions, Resonance handles the short-wavelength ones.
|
||||
Experiments show the combination consistently outperforms YaRN alone on long-context tasks.
|
||||
|
|
@ -162,7 +162,7 @@ The search process treats the rescale factors as parameters to optimize. Startin
|
|||
|
||||
LongRoPE also introduces a progressive extension strategy. Rather than jumping directly from the training length to the target length, it extends in stages: first from 4k to 256k with evolutionary search, then applies the same factors to reach 2048k. The model only needs 1000 fine-tuning steps at 256k tokens to adapt, making the extension process both effective and efficient. This progressive approach reduces the risk of performance degradation that can occur with aggressive single-step extensions.
|
||||
|
||||

|
||||

|
||||
|
||||
> **References:**
|
||||
>
|
||||
|
|
|
|||
BIN
content/ml-tech/rotary-pe/longrope.webp
Normal file
|
After Width: | Height: | Size: 98 KiB |
BIN
content/ml-tech/rotary-pe/odometer.webp
Normal file
|
After Width: | Height: | Size: 20 KiB |
BIN
content/ml-tech/rotary-pe/resonance-rope.webp
Normal file
|
After Width: | Height: | Size: 97 KiB |
BIN
content/ml-tech/rotary-pe/rope-rotation.webp
Normal file
|
After Width: | Height: | Size: 46 KiB |
BIN
content/ml-tech/train-multi-modal-llm/blip-bootstrap.webp
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
content/ml-tech/train-multi-modal-llm/deepseek-ocr.webp
Normal file
|
After Width: | Height: | Size: 58 KiB |
BIN
content/ml-tech/train-multi-modal-llm/diffusion-captions.webp
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
content/ml-tech/train-multi-modal-llm/image-text-pair.webp
Normal file
|
After Width: | Height: | Size: 49 KiB |
|
|
@ -14,11 +14,11 @@ The most straight-forward method to bridge multi-modal data and text is to train
|
|||
|
||||
For images, it is relatively easy to find a large-scale image dataset where each image is coupled with a text description. For example, you can scrape images from Wikipedia which often contain descriptions, or from social media where users write descriptions.
|
||||
|
||||

|
||||

|
||||
|
||||
There are some practices that you can improve efficiency of this training step. You do not necessary have to train an LLM from scratch, instead, you can train only the adaption layer between a pre-trained image encoder (like CLIP) and a text-only pre-trained LLM, like the design in LLaVA as shown below.
|
||||
|
||||

|
||||

|
||||
|
||||
> Liu, Haotian, et al. "Visual instruction tuning." _Advances in neural information processing systems_ 36 (2023): 34892-34916.
|
||||
|
||||
|
|
@ -30,13 +30,13 @@ If you have at least a few data-text pairs to begin with, there are methods to e
|
|||
|
||||
You can first train a smaller LLM with available data-text pairs at hand, then use it to generate more descriptions on unlabeled data. For example, with limited image-text pairs, you can first train a image descriptor, and apply it on unlabeled images to generate more image-text pairs. Images without text descriptions have much higher availability compared to those with.
|
||||
|
||||

|
||||

|
||||
|
||||
> Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." _International conference on machine learning_. PMLR, 2022.
|
||||
|
||||
Even crazier, you can train a new or use an off-the-shelf conditioned diffusion model that can generate images given descriptions. It should be relatively easy to make up descriptions using text-only LLMs.
|
||||
|
||||

|
||||

|
||||
|
||||
> Ma, Feipeng, et al. "Image captioning with multi-context synthetic data." _Proceedings of the AAAI Conference on Artificial Intelligence_. Vol. 38. No. 5. 2024.
|
||||
|
||||
|
|
@ -44,7 +44,7 @@ Based on the idea of instruction-tuning that is widely use to train LLMs, LLaVA
|
|||
- Original text description
|
||||
- Description of bounding boxes, as a textual representation of the spatial relationships of objects
|
||||
|
||||

|
||||

|
||||
|
||||
> Liu, Haotian, et al. "Visual instruction tuning." _Advances in neural information processing systems_ 36 (2023): 34892-34916.
|
||||
|
||||
|
|
@ -60,7 +60,7 @@ You can try to apply the vast available self-supervising methods that have been
|
|||
|
||||
STIC also demonstrates an interesting implementation of self-supervised learning: Use LLMs to generate positive and negative (less preferred) captions of the same image, which can then be used to perform contrastive learning or [direct preference optimization (DPO)](https://arxiv.org/abs/2305.18290).
|
||||
|
||||

|
||||

|
||||
|
||||
> Deng, Yihe, et al. "Enhancing large vision language models with self-training on image comprehension." _Advances in Neural Information Processing Systems_ 37 (2024): 131369-131397.
|
||||
|
||||
|
|
@ -72,7 +72,7 @@ Here a work that is not directly related to the topic of this post, but I feel m
|
|||
|
||||
DeepSeek-OCR is a recently published and very interesting work. The core idea is, when feeding text input into LLMs, compared to directly using the text, it is actually more token-efficient to paste the text into a Word document, take a screenshot, and feed the image to LLMs.
|
||||
|
||||

|
||||

|
||||
|
||||
> Wei, Haoran, Yaofeng Sun, and Yukun Li. "DeepSeek-OCR: Contexts Optical Compression." _arXiv preprint arXiv:2510.18234_ (2025).
|
||||
|
||||
|
|
|
|||
BIN
content/ml-tech/train-multi-modal-llm/llava-architecture.webp
Normal file
|
After Width: | Height: | Size: 14 KiB |
BIN
content/ml-tech/train-multi-modal-llm/llava-instruction.webp
Normal file
|
After Width: | Height: | Size: 167 KiB |
BIN
content/ml-tech/train-multi-modal-llm/stic-self-training.webp
Normal file
|
After Width: | Height: | Size: 82 KiB |