compress more images to webp

This commit is contained in:
Yan Lin 2026-01-30 22:13:23 +01:00
parent ee7245f82f
commit fa9090cacb
54 changed files with 45 additions and 45 deletions

View file

@ -311,7 +311,7 @@ MCP was introduced by Anthropic in 2024 and has rapidly become the standard for
MCP's architecture is composed of three types of applications: hosts, servers, and clients. **Hosts** are AI applications that users interact with directly, such as Claude Code and IDEs. These applications contain LLMs that need access to external capabilities. **Servers** are external applications that expose specific capabilities to AI models through standardized interfaces. These might include database connectors, file system access tools, or API integrations with third-party services. **Clients** live within host applications and manage connections between hosts and servers. Each client maintains a dedicated one-to-one connection with a specific server, similar to how we saw individual connections in our previous protocol examples.
![](mcp-architecture.png)
![](mcp-architecture.webp)
MCP servers can provide three types of capabilities to AI systems: resources, tools, and prompts. **Resources** act like read-only data sources, similar to HTTP `GET` endpoints. They provide contextual information without performing significant computation or causing side effects. For example, a file system resource might provide access to documentation, while a database resource could offer read-only access to customer data. **Tools** are executable functions that AI models can call to perform specific actions. Unlike resources, tools can modify state, perform computations, or interact with external services. Examples include sending emails, creating calendar events, or running data analysis scripts. **Prompts** are pre-defined templates that help AI systems use resources and tools most effectively. They provide structured ways to accomplish common tasks and can be shared across different AI applications.

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

View file

@ -11,7 +11,7 @@ Imagine you've trained a new version of your AI model that should be faster and
Simply replacing your old system with the new one is risky. In July 2024, a [routine software update from cybersecurity firm CrowdStrike](https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_outages) caused widespread system crashes, grounding flights and disrupting hospitals worldwide. While AI deployments might not have such dramatic impacts, pushing an untested model update to all users simultaneously can lead to degraded user experience, complete service outages, or lost trust if users encounter errors.
![CrowdStrike outage](crowdstrike.png)
![CrowdStrike outage](crowdstrike.webp)
This is where deployment strategies come in. These are industry-proven patterns that major tech companies use to update their systems safely. They let you roll out updates gradually to minimize impact, test new versions without affecting real users, compare performance between versions, and switch back quickly if something goes wrong.
@ -27,7 +27,7 @@ Let's explore four fundamental deployment patterns that you can use when updatin
In a blue-green deployment, you maintain two identical production environments called "blue" and "green." At any time, only one is live and serving user traffic. When you want to deploy a new version of your AI system, you deploy it to the idle environment, test it thoroughly, and then switch all traffic to that environment in one instant cutover. The switch is typically done by updating your load balancer or DNS settings to point to the new environment.
![Blue-green deployment](blue-green.png)
![Blue-green deployment](blue-green.webp)
Suppose your blue environment is currently serving users with version 1.0 of your AI model. You deploy version 2.0 to the green environment and run tests to verify everything works correctly. Once you're confident, you update your load balancer to route all traffic to green. Now green is live and blue sits idle. If users report issues with version 2.0, you can immediately switch traffic back to blue. The entire rollback takes seconds.
@ -41,7 +41,7 @@ The term "[canary deployment](https://semaphore.io/blog/what-is-canary-deploymen
In a canary deployment, you gradually roll out a new version to an increasing percentage of users. You might start by routing 5% of traffic to the new version while 95% continues using the old version. You monitor the canary group closely for errors, performance issues, or user complaints. If everything looks good, you increase the percentage to 25%, then 50%, then 100%. If problems emerge at any stage, you can halt the rollout and route all traffic back to the old version.
![Canary deployment](canary.png)
![Canary deployment](canary.webp)
Imagine you've deployed a new AI model that you believe is more accurate. You configure your load balancer to send 10% of requests to the new model while the rest go to the old model. Over the next few hours, you monitor response times, error rates, and user feedback from the canary group. The new model performs well, so you increase to 50%. After another day of monitoring shows no issues, you complete the rollout to 100% of users.
@ -55,7 +55,7 @@ The challenge with canary deployment is that it requires good monitoring and met
In a shadow deployment, you deploy the new version alongside your current production system. Every request that comes to your system gets processed by both versions. Users receive responses only from the stable version, while responses from the new version are logged and analyzed but never used. This lets you test how the new version behaves under real production load and compare its performance to the current version without any user impact.
![Shadow deployment](shadow.png)
![Shadow deployment](shadow.webp)
Suppose you've built a new AI model and want to check it produces better results before showing it to users. You deploy it in shadow mode, where every user request gets sent to both the old model and the new model. Users see only the old model's responses. Meanwhile, you collect data comparing response times, resource usage, and output quality between the two models. After a week of shadow testing shows the new model is faster and more accurate, you confidently move it to production.
@ -69,7 +69,7 @@ The downside is infrastructure cost and complexity. You're running two complete
In A/B testing deployment, you run two versions of your system side by side and split users between them. Unlike canary deployment where the goal is to gradually roll out a new version safely, A/B testing aims to compare performance between versions to make data-driven decisions. You might run both versions at 50/50 for weeks or months, collecting metrics on user satisfaction, response quality, speed, or business outcomes. The version that performs better according to your chosen metrics becomes the winner.
![A/B testing](ab-testing.png)
![A/B testing](ab-testing.webp)
Suppose you have two AI models: model A is faster but slightly less accurate, while model B is more accurate but slower. You're not sure which one will provide better user experience. You deploy both models and randomly assign 50% of users to each. Over the next month, you track metrics like user satisfaction ratings, task completion rates, and how often users retry their requests. The data shows that users with model B complete tasks more successfully and rate their experience higher, even though responses take a bit longer. Based on this evidence, you choose model B as the primary model.

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 508 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

View file

@ -9,10 +9,10 @@ description = ""
Unless you have been living off-grid for the last few years, you have probably been tired of hearing "AI computers" or something similar.
![](ai-pc-1.png)
![](ai-pc-2.png)
![](ai-pc-3.png)
![](ai-pc-4.png)
![](ai-pc-1.webp)
![](ai-pc-2.webp)
![](ai-pc-3.webp)
![](ai-pc-4.webp)
Despite those vendors trying to convince you that you need a new generation of computers to catch up with the AI hype. In the last year of WWII, [John von Neumann](https://en.wikipedia.org/wiki/John_von_Neumann) introduced the [Von Neumann architecture](https://www.geeksforgeeks.org/computer-organization-architecture/computer-organization-von-neumann-architecture/). 80 years later, most computers on Earth are still based on this architecture, including most so-called AI computers.
@ -24,7 +24,7 @@ In 1945, John von Neumann documented what would become the most influential comp
The below illustration shows the Von Neumann architecture. To help you understand the concepts in this architecture, we will use an analogy to a restaurant kitchen. Imagine a busy restaurant kitchen, with orders and recipes (instruction) coming by and ingredients (data) ready to be cooked. With chefs (CPU) following orders and recipes and prepare dishes, a pantry and a counter (memory unit) for storing ingredients and recipes, waiters (input/output devices) bringing in orders and deliver dishes, and corridors (bus) connecting all staff and rooms.
![](von-neumann.png)
![](von-neumann.webp)
### Instruction & Data
@ -85,13 +85,13 @@ A [**bus system**](https://www.geeksforgeeks.org/computer-organization-architect
Another analogy for any of you who have played [Factorio](https://www.factorio.com/) (a factory management/automation game): for scalable production, you will usually also have a bus system connecting storage boxes, I/O endpoints, and machines actually producing or consuming stuff. Such system make it easy to add a new sub-system to existing ones.
![](factorio-bus.png)
![](factorio-bus.webp)
### Von Neumann Architecture in Practice
To showcase how this architecture is implemented in real-world, we will use the [Raspberry Pi 5](https://www.raspberrypi.com/products/raspberry-pi-5/)--a small yet complete computer--as an example.
![](raspberry-pi.png)
![](raspberry-pi.webp)
To start, we have **CPU** in the center-left of the board (labelled *BCM2712 processor* in the figure). Worth noting that like most modern CPUs, this CPU has multiple cores: like multiple chefs working together.
@ -130,17 +130,17 @@ The fundamental mismatch between CPU architecture and AI workload calls for spec
GPU is the representative type of hardware specialized for AI computing. You could tell from its name that it is originally designed for processing computer graphics. More specifically, it was originally designed in the 1980s to accelerate 3D graphics rendering for video games. Rendering a 3D video game involves calculation of lighting, shading, and texture mapping, and display millions of pixels, with [highly optimized algorithms](https://developer.nvidia.com/gpugems/gpugems3/part-ii-light-and-shadows/chapter-10-parallel-split-shadow-maps-programmable-gpus) that breaks such calculation into small units that are composed of simple instructions and can be done in parallel.
![](gpu-rendering.png)
![](gpu-rendering.webp)
To compute such algorithms more efficiently, GPUs are designed to excel at parallel processing. While [a modern CPU](https://www.amd.com/en/products/processors/desktops/ryzen/9000-series/amd-ryzen-9-9950x3d.html) usually features less than 100 powerful cores, [a modern GPU](https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/) usually contains thousands of weak cores. Each core can only handle simple instructions--just like a primary school student, but all the cores combined can finish a parallelized task much faster than a CPU.
![](cpu-vs-gpu.png)
![](cpu-vs-gpu.webp)
The memory on a GPU is also designed around high-bandwidth, so that large chunks of data can be accessed quickly. For example, the bandwidth of [DDR memory](https://en.wikipedia.org/wiki/DDR5_SDRAM) for CPUs sits around 50 to 100 GB/s, while the [GDDR memory](https://en.wikipedia.org/wiki/GDDR7_SDRAM) for GPUs can deliver up to 1.5 TB/s bandwidth, and the [HBM memory](https://en.wikipedia.org/wiki/High_Bandwidth_Memory) specifically designed for AI workloads can deliver up to 2 TB/s bandwidth.
Interestingly, the need for parallel processing and high-bandwidth of computer graphics aligns quite well with AI computing. Thus, GPU has become the dominant type of specialized hardware for AI workloads in recent years. Sadly this leads to major GPU brands don't give a sh\*t about gamers and general consumers anymore.
![](nvidia-jensen.png)
![](nvidia-jensen.webp)
### Tensor Processing Unit (TPU)
@ -148,7 +148,7 @@ Although GPU accidentally became perfect for AI workloads by repurposing compute
One example is Google's [TPU](https://cloud.google.com/tpu). TPU adopts an architecture where thousands of simple processor cores aligned in a grid, and the incoming data and instructions flow through the grid like waves: each processor core does a small calculation and passes the result to its neighbors.
![](tpu-architecture.png)
![](tpu-architecture.webp)
Hardware like TPUs is highly specialized in AI computing, which means they can be more efficient for AI workloads compared to GPU, which still need to handle graphics and other general computing tasks. However, this also means they are impractical for any other tasks. Nowadays TPUs are largely seen in data centers, especially those built by Google themselves.

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

View file

@ -12,7 +12,7 @@ In [Interact with AI Systems](@/ai-system/interact-with-ai-systems/index.md) we'
> **Example:**
> ChatGPT can be accessed through OpenAI's official website, mobile/desktop apps, other AI-based applications (such as Perplexity), Python scripts, or even command line scripts, all through the same family of APIs OpenAI has published.
![API overview](api-overview.png)
![API overview](api-overview.webp)
## The Three Pillars of APIs
@ -30,7 +30,7 @@ Finally, we have ports. Just as some people run several businesses in the same l
We should also briefly address the [difference between a URL and a domain](https://www.geeksforgeeks.org/computer-networks/difference-between-domain-name-and-url/) here. Think of the domain `api.openai.com` as the building address like *Fredrik Bajers Vej 7K* that usually corresponds to a certain group of hardware resources. The full URL is like an address with floor and room number like *Fredrik Bajers Vej 7K, 3.2.50*, which in the below example specifies the version of the API (v1) and the specific function (conversation completion).
![URL structure](url-structure.png)
![URL structure](url-structure.webp)
> **Videos:**
> - [The OSI model of computer networks](https://www.youtube.com/watch?v=keeqnciDVOo)
@ -177,11 +177,11 @@ Before we proceed to integrate interactions with APIs into our applications, we
[Postman](https://www.postman.com/) is a popular API testing tool. To send an API request with Postman, fill in the components of an [HTTP Request](#http-request) into its interface:
![Postman request](postman-request.png)
![Postman request](postman-request.webp)
Click send, and after a while you should be able to see the response with components of an [HTTP Response](#http-response):
![Postman response](postman-response.png)
![Postman response](postman-response.webp)
Feel free to explore other functionalities of Postman yourself. Apart from being able to send API requests in a graphical user interface, you can also form a collection of requests for reuse and structured testing. Postman also comes with collaboration tools that can come in handy when developing in a team. Alternatives to Postman include [Hoppscotch](https://hoppscotch.io/) and [Insomnia](https://insomnia.rest/), [among others](https://apisyouwonthate.com/blog/http-clients-alternatives-to-postman/), all with similar core functionalities.

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

View file

@ -17,13 +17,13 @@ In this module we will learn how to deploy our system on to the ["cloud"](https:
When we talk about "the cloud," we're really talking about computers in [data centers](https://en.wikipedia.org/wiki/Data_center) that you can access over the internet. The term comes from old network diagrams where engineers would draw a cloud shape to represent "the internet" or any network whose internal details weren't important at that moment. Over time, this symbol became associated with computing resources accessed remotely.
![Cloud diagram](cloud-diagram.png)
![Cloud diagram](cloud-diagram.webp)
Cloud infrastructure emerged from a practical problem. Companies like Amazon and Google built massive computing facilities to handle peak loads (holiday shopping spikes, search traffic surges), but these expensive resources sat mostly idle during normal times. They realized they could rent out this spare capacity to others, and the modern cloud industry was born. What started as monetizing excess capacity evolved into a fundamental shift in how we provision computing resources.
The key technical innovation that made cloud practical is [virtualization](https://en.wikipedia.org/wiki/Virtualization). This technology allows one physical machine to be divided into many isolated virtual machines, each acting like a separate computer with its own operating system. A single powerful server might run dozens of virtual machines for different customers simultaneously. This sharing model dramatically improved efficiency, since physical servers could be fully utilized rather than sitting idle.
![Virtualization](virtualization.png)
![Virtualization](virtualization.webp)
You might recall from [Packaging & containerization](@/ai-system/packaging-containerization/index.md) that containers also provide isolation, but they work at a different level. Virtual machines virtualize the entire hardware, giving each VM its own complete operating system. Containers, in contrast, share the host's operating system kernel and only isolate the application and its dependencies. This makes VMs heavier but more isolated, suitable for running entirely different operating systems or providing stronger security boundaries. Containers are lighter and faster, ideal for packaging applications. In practice, cloud infrastructure often uses both: VMs to divide physical servers among customers, and containers running inside those VMs to package and deploy applications.
@ -129,7 +129,7 @@ We'll use the image classification API server from [Wrap AI Models with APIs](@/
When creating a VM through your cloud provider's interface, you'll need to make several decisions about its configuration. These choices affect both performance and cost, but the good news is you can always resize or recreate your VM later if your needs change.
![VM creation](vm-creation.png)
![VM creation](vm-creation.webp)
**Operating System**: Choose a Linux distribution. [Ubuntu LTS (Long Term Support)](https://ubuntu.com/about/release-cycle) versions like 22.04 or 24.04 are excellent choices because they receive security updates for five years and have extensive community documentation. Most cloud providers offer Ubuntu as a one-click option. Other good alternatives include Debian or Rocky Linux, but Ubuntu's popularity means you'll find more tutorials and troubleshooting help online.
@ -462,7 +462,7 @@ Your API server is now running and accessible at `http://your-server-ip:8000`. T
**Professional Expectations**: Users expect to see a padlock icon in their browser's address bar. Browsers display prominent warnings for HTTP sites, damaging trust before users even interact with your service. Search engines also penalize HTTP sites in rankings.
![HTTPS warning](https-warning.png)
![HTTPS warning](https-warning.webp)
To make your API production-ready, you need HTTPS, which requires a domain name and an SSL/TLS certificate. Let's walk through the process.
@ -488,7 +488,7 @@ Before obtaining an SSL certificate, you need a domain name. Certificates are ti
DuckDNS also provides an API for updating your IP if it changes, useful for home servers. The main limitation is that your domain will be longer (e.g., `my-ai-api.duckdns.org`) and less professional than a custom domain. For learning and testing HTTPS setup, DuckDNS is perfect.
![DuckDNS](duckdns.png)
![DuckDNS](duckdns.webp)
**Paid Option: Domain Registrars**
@ -500,7 +500,7 @@ For production applications, consider purchasing your own domain. As of 2024, se
When choosing a registrar, focus on renewal prices, not just first-year promotional rates.
![Cloudflare registrar](cloudflare-registrar.png)
![Cloudflare registrar](cloudflare-registrar.webp)
**Setting Up DNS Records**

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

View file

@ -22,7 +22,7 @@ Before diving into implementation, we need to understand what these terms mean a
The term "edge" simply refers to devices at the boundary of a network, where data is actually created or where people interact with systems. Think smartphones, smart home devices, sensors in factories, self-driving cars, or a server sitting in an office closet.
![Edge computing diagram](edge-computing.png)
![Edge computing diagram](edge-computing.webp)
You've probably seen edge computing in action without realizing it. A [Raspberry Pi](https://www.raspberrypi.com/), that tiny €50 computer the size of a credit card, can run AI models for home projects. The [Raspberry Pi AI Camera](https://www.raspberrypi.com/documentation/accessories/ai-camera.html) runs object detection directly on the camera itself, spotting people, cars, or pets in real-time without ever sending video to the cloud. Tech YouTuber Jeff Geerling has built some impressive setups, like [a Raspberry Pi AI PC with multiple neural processors](https://www.jeffgeerling.com/blog/2024/55-tops-raspberry-pi-ai-pc-4-tpus-2-npus) for local AI processing. For more demanding applications, [Nvidia Jetson](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/) boards pack serious GPU power into a small package. The [Jetson community](https://developer.nvidia.com/embedded/community/jetson-projects) has built everything from bird identification systems that recognize 80 species by sound, to indoor robots that map your home and remember where you left things.
@ -34,11 +34,11 @@ Here's something you might not have realized: you've been doing self-hosted depl
The beauty of self-hosting is that it works at any scale. At the simplest level, you're repurposing hardware you already have. That old laptop collecting dust in a drawer? Install Linux on it and you have a perfectly capable home server. An old desktop that would otherwise go to the landfill can run your AI models, host your files, or serve your applications. Even a Raspberry Pi or a NAS (Network Attached Storage) device can run containerized services.
![Home server setup](home-server.png)
![Home server setup](home-server.webp)
But self-hosting isn't just about recycling old hardware. Building a new system from scratch can make economic sense too. Consider storage: major cloud providers charge around €18-24 per terabyte per month (budget providers like Backblaze start around €5/TB). If you need 10TB of storage from a major provider, that's €180-240 monthly, adding up to €2,160-2,880 per year. You could build a dedicated storage server with multiple hard drives for €900-1,400, breaking even in under a year. After that, it's essentially free (minus electricity). Plus, transferring files over your home network is dramatically faster than uploading or downloading from the cloud. Gigabit ethernet gives you around 100MB/s transfer speeds, while most home internet uploads max out at 10-50MB/s.
![Storage cost comparison](storage-cost.png)
![Storage cost comparison](storage-cost.webp)
Beyond economics, self-hosting gives you complete control. Your data stays on your hardware, in your home or office. There are no monthly bills that can suddenly increase, no vendor lock-in forcing you to use proprietary APIs, and no worrying about whether a cloud provider will shut down your account. For learners, self-hosting offers hands-on experience with real infrastructure that you can't get from managed cloud services. And if you need specialized hardware like GPUs for AI work, owning the equipment often makes more sense than paying cloud providers' premium hourly rates, especially if you're using it regularly.
@ -61,11 +61,11 @@ The hardware you choose depends on your use case, budget, and what you might alr
For learning and light workloads, a **[Raspberry Pi](https://www.raspberrypi.com/products/raspberry-pi-5/)** (around €50-95 for the Pi 5 with 4-8GB RAM) is hard to beat. It's tiny, power-efficient (using about 3-5 watts), and runs a full Linux operating system. Perfect for running lightweight AI models, home automation, or small API servers. The Pi 5 with 8GB RAM can comfortably handle our image classification API from earlier modules.
![Raspberry Pi](raspberry-pi.png)
![Raspberry Pi](raspberry-pi.webp)
If you need more power for AI workloads, **[Nvidia Jetson](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/)** boards (around €230-240 for the [Jetson Orin Nano Super Developer Kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/)) come with integrated GPUs designed specifically for AI inference. They're overkill for simple projects but shine when running larger models or processing video streams in real-time.
![Nvidia Jetson](jetson.png)
![Nvidia Jetson](jetson.webp)
Don't overlook that **old laptop or desktop** sitting unused. An x86 machine from the last decade probably has more RAM and storage than a Raspberry Pi, runs cooler than a gaming desktop, and costs nothing if you already own it. Laptops are particularly attractive because they're power-efficient and come with a built-in battery (basically a free UPS). [Repurposing an old laptop as a Linux server](https://dev.to/jayesh_w/this-is-how-i-turned-my-old-laptop-into-a-server-1elf) is a popular project that teaches you server management without any upfront cost. Old workstations with dedicated GPUs can even handle serious AI workloads.
@ -97,7 +97,7 @@ For your custom images, you have two options. The simple approach is building di
A quick way to check if an image supports your architecture: look at the image's Docker Hub page. For example, the [official Python image](https://hub.docker.com/_/python) shows supported platforms including `linux/amd64` (x86), `linux/arm64` (64-bit ARM like Raspberry Pi 4/5), and `linux/arm/v7` (32-bit ARM like older Pis). If your architecture isn't listed, you'll need to build the image yourself or find an alternative.
![Docker Hub platform support](docker-hub-platforms.png)
![Docker Hub platform support](docker-hub-platforms.webp)
### Deploying Your Container
@ -115,7 +115,7 @@ If you just want to use your services at home or within your organization's netw
Every device on your network gets a local IP address, usually something like `192.168.1.100` or `10.0.0.50`. To find your device's IP, SSH into it and run `ip addr show` (or `ip a` for short), which shows all network interfaces and their addresses. Look for the interface connected to your network (often `eth0` for ethernet or `wlan0` for WiFi) and find the line starting with `inet`. Alternatively, check your router's admin interface, which usually lists all connected devices with their IPs and hostnames.
![IP address output](ip-addr.png)
![IP address output](ip-addr.webp)
Once you have the IP, access your service just like you would a cloud server, but using the local address. If your API runs on port 8000, visit `http://192.168.1.100:8000` from any device on the same network. SSH works the same way: `ssh username@192.168.1.100`. This is the same remote access concept we covered in [Cloud Deployment](@/ai-system/cloud-deployment/index.md), just with a local IP instead of a public one.

Binary file not shown.

After

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 210 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 441 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

View file

@ -8,7 +8,7 @@ description = ""
In October 2025, millions of people worldwide woke up to find ChatGPT unresponsive. Snapchat wouldn't load. Fortnite servers were down. Even some banking apps stopped working. All thanks to [a single issue in an AWS data center](https://9to5mac.com/2025/10/20/alexa-snapchat-fortnite-chatgpt-and-more-taken-down-by-major-aws-outage/) that cascaded across hundreds of services. For over half a day, these services were unavailable, and there was nothing users could do except wait, or find alternatives.
![](aws-outage.png)
![](aws-outage.webp)
Now imagine this happens to your AI API server. You've successfully deployed it to the cloud following [Cloud Deployment](@/ai-system/cloud-deployment/index.md), users are accessing it, and everything seems great. Then at 2 AM on a Saturday, something breaks. How long until users give up and try a competitor's service? How many will come back? In today's world where alternatives are just a Google search away, reliability is essential for survival.
@ -64,7 +64,7 @@ Why does MTTR matter so much? Because modern research shows that downtime is ver
For your AI API server, MTTR includes several steps. First, you notice something is wrong (through monitoring alerts or user complaints). Then you remote into your server and check logs. Next, you identify the root cause. Then you add the fix and check that it works. Finally, you confirm that users can access the service again. The faster you can complete this cycle, the lower your MTTR and the better your availability.
![](mttr-process.png)
![](mttr-process.webp)
#### The Availability Formula
@ -179,7 +179,7 @@ Imagine you have a room lit by a single light bulb. If that bulb burns out, the
A SPOF is any component in your system that, if it fails, causes everything to stop working. SPOFs are dangerous because they're often invisible until they actually fail. Your system runs fine for months, everything seems great, and then one day that critical component breaks and suddenly users can't access your service.
![](spof-diagram.png)
![](spof-diagram.webp)
We can use the AI API server deployed in [Cloud Deployment](@/ai-system/cloud-deployment/index.md) as an example to identify the potential SPOFs. If you're running everything on one virtual machine and it crashes (out of memory, hardware failure, data center issue), your entire service goes down. Users get connection errors and can't make any requests. If the database file gets corrupted (disk failure, power outage during write, software bug), you lose all your request history and any user data. The API might crash or return errors because it can't access the database. If the model file is deleted or corrupted, your API can still accept requests but can't make predictions. Every classification request fails. If the internet connection to your VM fails (ISP issue, data center network problem), users can't reach your service even though it's running perfectly. If your API calls another service (maybe for extra features) and that service goes down, your API might become unusable even though your own code is working fine.
@ -225,7 +225,7 @@ Both are valuable. Redundancy keeps your service running when components fail. B
Instead of running your AI API on a single cloud VM, you run it on two or more VMs simultaneously. A [load balancer](https://aws.amazon.com/what-is/load-balancing/) sits in front, distributing incoming requests across all healthy servers. When one server crashes, the load balancer stops sending traffic to it and routes everything to the remaining servers, and your API keeps responding to requests. Users might not even notice the problem. That's the beauty of redundancy, that your service keeps running and you can fix the failed server later.
![](load-balancer.png)
![](load-balancer.webp)
Suppose you currently run your containerized API on one cloud VM. Here's how to add hardware redundancy. Deploy the same Docker container on a second VM, maybe in a different availability zone or even region. Set up a load balancer using [Nginx](https://nginx.org/en/docs/http/load_balancing.html), cloud load balancers (like [AWS ELB](https://nginx.org/en/docs/http/load_balancing.html)), or simple [DNS round-robin](https://en.wikipedia.org/wiki/Round-robin_DNS). Configure health checks so the load balancer pings each server periodically (like `GET /health`). If a server doesn't respond, traffic stops going to it. If your API is stateless (each request independent), this just works. If you store state, you'll need shared storage or session replication.
@ -280,7 +280,7 @@ Set this to run automatically at 2 AM every day, and now if your database corrup
Security experts recommend the 3-2-1 rule for critical data. Keep 3 copies of your data (original plus two backups), on 2 different storage types (like local disk plus cloud storage), with 1 off-site backup (survives building fire, flood, or local disaster). For your AI API, this might look like keeping your original SQLite database on your cloud VM (`/app/data/ai_api.db`), a daily snapshot on the same VM but different disk/partition, and another daily snapshot uploaded to cloud storage (like AWS S3 or Google Cloud Storage). This protects against several scenarios. If you accidentally delete something, restore from Backup 1 on the same VM (very fast). If a disk fails, restore from Backup 2 in cloud storage (a bit slower). If your VM is terminated, restore from Backup 2 and rebuild the VM. If an entire data center fails, Backup 2 is in a different region and remains accessible. The cloud storage backup is particularly important. If your entire VM is deleted (you accidentally terminate it, cloud provider has issues, account compromised), your local backups disappear too. Cloud storage in a different region survives these disasters.
![](backup-321.png)
![](backup-321.webp)
Backups enable recovery (they reduce MTTR). But [replication](https://www.geeksforgeeks.org/system-design/database-replication-and-their-types-in-system-design/) prevents downtime in the first place (it increases MTBF). With replication, you maintain two or more copies of your database that stay continuously synchronized. How does it work? The primary database handles all write operations (create, update, delete). Replica databases continuously receive updates from the primary and stay in sync. Replicas can handle read operations, spreading the load. If the primary fails, you promote a replica to become the new primary.

Binary file not shown.

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 161 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

View file

@ -13,7 +13,7 @@ But before you go ahead and deploy your AI API server on all kinds of machines a
Recall the AI API server we implemented in [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md). You probably ran it directly on your machine, installed the required Python packages locally, and hoped everything would work a few months later. But what happens when you update your machine's operating system, or you want to deploy it on a different machine with a different operating system, or when your group members try to run it but have conflicting Python versions? Saying *it (used to) work on my machine* certainly doesn't help.
![Works on my machine](works-on-my-machine.png)
![Works on my machine](works-on-my-machine.webp)
Are there techniques that can ensure that the runtime environment for our programs is consistent regardless of the operating system and OS-level runtime, so we can deploy our programs to any computer with the confidence that they just work? Yes, and you guessed it: packaging and containerization techniques.
@ -33,7 +33,7 @@ Containers solve this by creating isolated environments that package your applic
Think of containers like this: at a traditional Chinese dinner, everyone shares dishes from the center of the table. But, what if one person needs gluten-free soy sauce while another needs regular? What if someone accidentally adds peanuts to a shared dish when another guest has allergies? Containers are like giving each person their own Western-style plated meal with exactly the seasonings and portions they need. No sharing conflicts, no contamination between dishes, and everyone gets precisely what works for them, while still sitting at the same table.
![Container analogy](container-analogy.png)
![Container analogy](container-analogy.webp)
The benefits of containers quickly made containerization become the industry standard for large-scale software deployment. Today, there is a very high chance that one of the applications you use everyday is running in containers. It is [reported](https://www.docker.com/blog/2025-docker-state-of-app-dev/) that by 2025, container usage in the IT industry has reached 92%. With the help of containers, companies can deploy updates without downtime, handle more users by scaling automatically, and run the same software reliably across different hardware infrastructures.
@ -60,7 +60,7 @@ Similarly, each container image is a system of layers. Each layer represents a s
Since containers running on one machine usually have common layers, especially the base layers such as Python runtime, containers will share the common layers so that only one copy of the layer exists. This means that duplicate layers do not have to be stored so storage space is saved. Also, an update to each container don't involve rebuilding of the whole container, just the layers that have been modified.
![Container layers](container-layers.png)
![Container layers](container-layers.webp)
> **Extended Reading:**
> When a container runs, it obviously needs to modify files in the layers, like storing temporary data. But it seems that this will break the reusability of layers. Thus, there is actually a [temporary writable layer](https://medium.com/@princetiwari97940/understanding-docker-storage-image-layers-copy-on-write-and-how-data-changes-work-caf38c2a3477) on top of the read-only layers when a container is running. All changes happen in this writable layer during the running of a container image, while the underlying layers of the image itself is untouched.

Binary file not shown.

After

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 281 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

View file

@ -11,7 +11,7 @@ In the previous two modules we've seen many industry-standard API techniques and
> **Example:**
> This blog site is a fully self-hosted website with basic HTTP-based APIs for handling `GET` requests from browsers. When you visit this post, your browser essentially sends a `GET` request to my server and the server responds with the HTML body for the browser to render. Knowing how to implement your own APIs enables you to do lots of cool stuff that you can control however you want!
![API server kitchen analogy](api-server-kitchen.png)
![API server kitchen analogy](api-server-kitchen.webp)
APIs are served by API servers—a type of application that listens to API requests sent to them and produces the corresponding responses. They are like kitchens that maintain order and delivery windows for accepting and fulfilling orders, but usually keep the process of how an order is processed behind the doors. Publicly accessible APIs that you've been playing with in previous modules are nothing magic: they are served by API servers run by providers on one or more machines identified by the APIs' corresponding domains. We will compare a few choices of Python frameworks for implementing API servers, and focus on one of them to demonstrate how to implement API fundamentals you learned from previous modules in practice.
@ -93,7 +93,7 @@ uvicorn main:app --reload --host 127.0.0.1 --port 8000
Where `main:app` points to the `app` object we implemented in the `main` program. `--reload` tells the server to automatically restart itself after we modify `main.py` for ease of development. `127.0.0.1` is the IP of "localhost"—the computer we run the server on, and `--host 127.0.0.1` means the server will only accept requests sent from the same computer. `8000` is the port our server listens on, in other words, the port used to identify our server application. You can now try to send a `GET` request to `http://127.0.0.1:8000` with another Python application and the `requests` library, or by accessing the URL in your browser, and you should be able to see the message.
![FastAPI browser response](fastapi-browser.png)
![FastAPI browser response](fastapi-browser.webp)
You will also be able to see the log messages from your server:
@ -401,7 +401,7 @@ async def classify_image(request: ImageRequest):
Now you have a little image classification API server! I sent it a picture of Spanish-style seafood casserole I made yesterday (it's delicious, by the way) by encoding the image to `base64` format.
![Seafood casserole](seafood-casserole.png)
![Seafood casserole](seafood-casserole.webp)
And I got the classification result from the server:

Binary file not shown.

After

Width:  |  Height:  |  Size: 284 KiB