migrate all posts from ai-system

This commit is contained in:
Yan Lin 2026-01-30 19:42:30 +01:00
parent 49f48aa27c
commit 2f908fc616
65 changed files with 3949 additions and 30 deletions

View file

@ -1,26 +0,0 @@
I am in the process of migrating content from my previous quartz 4-based blog site in /Users/yanlin/Documents/Projects/personal-blog to this Zola-based blog post.
For each blog post, follow the following process of migration:
1. Create an empty bundle (directory with the same name as the old markdown file) and under section of the same original one
2. Copy the old markdown file to the bundle as index.md (first copy the file using `cp` directly, then edit)
3. Edit the frontmatter:
```
+++
title = "(the original title)"
date = (the old created field)
description = "(leave blank)"
+++
```
4. Find, copy, and rename the images used in the post to the bundle
5. Replace the old Obsidian-flavor markdown links (images ![[]] and internal links [[]]) with standard markdown links
6. Turn callout blocks into standard markdown quote blocks, e.g., >[!note], >[!TLDR], >[!quote] → > **Note:**, > **TL;DR:**, > **References:**; e.g. > [!tip] Videos -> > **Videos:**, > [!info] Extended Reading -> > **Extended Reading**
7. For multiline math equations (those with \\), wrap the whole equation like below to avoid Zola's processing:
```
{% math() %}
f_{\{q,k\}}(x_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} W_{\{q,k\}}^{(11)} & W_{\{q,k\}}^{(12)} \\ W_{\{q,k\}}^{(21)} & W_{\{q,k\}}^{(22)} \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix}
{% end %}
```

View file

@ -3,3 +3,26 @@ title = "AI Systems"
sort_by = "date" sort_by = "date"
paginate_by = 10 paginate_by = 10
+++ +++
Companion literature for the [DAKI3 - AI Systems & Infrastructure](https://www.moodle.aau.dk/course/view.php?id=57016) course at Aalborg University in the form of blog posts.
## Course Outline
- Phase A: [Interact with AI Systems](@/ai-system/interact-with-ai-systems/index.md)
- Module 1: [API Fundamentals](@/ai-system/api-fundamentals/index.md)
- Module 2: [Advanced APIs in the Era of AI](@/ai-system/advanced-apis/index.md)
- Module 3: [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md)
- Phase B: [Infrastructure & Deployment of AI](@/ai-system/infrastructure-deployment/index.md)
- Module 4: [AI Compute Hardware](@/ai-system/ai-compute-hardware/index.md)
- Module 5: [Packaging & Containerization](@/ai-system/packaging-containerization/index.md)
- Module 6: [Cloud Deployment](@/ai-system/cloud-deployment/index.md)
- Module 7: [Edge & Self-hosted Deployment](@/ai-system/edge-self-hosted-deployment/index.md)
- Module 8: [Mini Project](@/ai-system/mini-project/index.md)
- Phase C: [Production-ready AI Systems](@/ai-system/production-ready-systems/index.md)
- Module 9: [High Availability & Reliability](@/ai-system/high-availability/index.md)
- Module 10: [Advanced Deployment Strategies](@/ai-system/advanced-deployment/index.md)
## Other Materials
- [Reference implementation of exercises](https://github.com/orgs/AI-Systems-Infrastructure/repositories)
- [Exam Format](@/ai-system/exam/index.md)

View file

@ -0,0 +1,368 @@
+++
title = "A.2-Advanced APIs in the Era of AI"
date = 2025-09-11
description = ""
+++
> **TL;DR:**
> Advanced API patterns and techniques enable high-performance, real-time, and message-driven communication essential for modern AI systems—like subscription services that deliver continuous updates rather than requiring individual requests.
In [API Fundamentals](@/ai-system/api-fundamentals/index.md) we established the three pillars of APIs and learned how to interact with them using basic HTTP methods. While these fundamentals work well for simple request-response patterns, modern AI systems demand more sophisticated communication approaches. Consider how sending a letter and waiting for a response works for occasional communication, but becomes impractical when you need continuous updates—like sending hourly request letters to a weather service instead of receiving automatic daily forecasts.
> **Example:**
> When you send a request to OpenAI/Anthropic's API, you wait for a few seconds for the complete response to appear. However, when you interact with ChatGPT/Claude on their official web/mobile app, their responses are continuously streamed to you word-by-word. In reality, the streaming behavior is also achievable through APIs.
![](stream1.gif)
In this module, we'll explore advanced API techniques that enable more flexible communication patterns especially relevant for modern AI systems. We'll start with additional fundamentals like rate limiting and versioning, then move to implementing streaming and message-driven protocols. We'll also touch on the shiny new star of AI communication protocols--model context protocol. Finally, we'll examine architectures that make it possible to process high-throughput data efficiently.
## Additional Fundamentals
Before exploring advanced protocols, let's examine some additional fundamental concepts that we encountered in the previous module but didn't explore in depth. These concepts are particularly relevant for AI APIs.
### API Versioning
As we've established, APIs are essential means of communication in the digital world, and most API-based interactions happen automatically—you wouldn't expect there to be a human behind the millions of API requests and responses happening every second. The premise of the digital world working correctly by itself is that the specifications of each API are consistent. Yet, it is also impractical that we never have to update the APIs to incorporate new features or make changes, especially for AI services where new features and updates to AI models are constantly introduced. [API versioning](https://www.postman.com/api-platform/api-versioning/) is a process to tackle this dilemma.
API versioning is the practice of managing different iterations of APIs, allowing providers to introduce changes and new features without breaking existing interactions. Think of it like maintaining backward compatibility—old systems continue working while new features become available in newer versions.
There are a few common [versioning strategies](https://api7.ai/learning-center/api-101/api-versioning) you will witness when exploring existing AI APIs.
**URL path versioning** is the most straightforward approach, embedding version information directly in the endpoint URL. For example, `https://api.example.com/v1/generate` versus `https://api.example.com/v2/generate`. This makes the version immediately visible and easy to understand. You probably have noticed that both OpenAI and Anthropic use this versioning approach.
**Header-based versioning** keeps URLs unchanged by specifying versions through HTTP headers like `API-Version: 2.1` or `Accept: application/vnd.api+json;version=2`. This approach is more flexible but less transparent.
**Query parameter versioning** uses URL parameters such as `?version=1.2` or `?api_version=latest`. While simple to implement, it can clutter URLs and may not feel as clean as other approaches. This approach also doesn't fit nicely with the REST standard we introduced before.
**Model-specific versioning** is particularly relevant for AI services, where different model versions (like `gpt-3.5-turbo` vs `gpt-4o`) represent distinct capabilities. This is usually specified with a key in the request body.
### Rate Limiting
As its name suggests, [rate limiting](https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway) is a strategy implemented by API providers to control the number of requests processed within a given time frame. Rate limiting is particularly important in AI services because advanced AI models are computationally expensive, and without proper limits, a few heavy users could overwhelm the entire service. You might not have encountered rate limiting during practice in the previous module since usage costs typically hit budget limits first. However, understanding rate limiting becomes crucial when scaling applications.
Rate limiting strategies vary across providers, with different rules typically applied to different AI models and user tiers. Take a look at [OpenAI](https://platform.openai.com/docs/guides/rate-limits) and [Anthropic](https://docs.anthropic.com/en/api/rate-limits#tier-1)'s rate limiting strategies for reference. Generally speaking, there are a few types of rate limiting:
- **Request-based**: X requests per minute/hour, common for many APIs
- **Token-based**: Limit by input/output tokens, common for conversational AI services where processing power is directly related to the number of tokens used
- **Concurrent requests**: Maximum simultaneous connections, more frequently seen in data storage services
- **Resource-based**: GPU time or compute units, common for cloud computing services
There are also different algorithms for determining when the rate limit is hit and recovered:
- **Fixed window**: A fixed limit within specific time frames (e.g., 100 requests per minute, reset every minute). Easy to implement but can cause traffic spikes at window boundaries.
- **Sliding window**: Continuously calculates usage based on recent activity, providing smoother request distribution and preventing burst abuse.
- **Token bucket**: Allows requests only when tokens are available in a virtual "bucket," with tokens replenished at a fixed rate. This allows short bursts while maintaining overall rate control.
> **Videos:**
> - [API versioning explained](https://www.youtube.com/watch?v=vsb4ZkUytrU)
> - [Rate limiting algorithms](https://www.youtube.com/watch?v=mQCJJqUfn9Y)
> **Note:**
> We will get a more concrete understanding of API versioning and rate limiting later in Module 3: [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md) when we have to implement these strategies ourselves.
## Advanced API Protocols
With these fundamentals in mind, let's explore advanced protocols that enable more sophisticated communication patterns.
### Streaming Protocols
Returning to the example at the beginning, word-by-word streaming is achievable through APIs using streaming protocols. Such protocols are widely supported in conversational AI APIs, since most AI models for conversation are grounded in [next token prediction (NTP) architecture](https://huggingface.co/blog/alonsosilva/nexttokenprediction), and they fit the natural way humans read text. We will take a look at two prominent streaming protocols: Server-Sent Events (SSE) and WebSocket.
#### Server-Sent Events
[Server-Sent Events (SSE)](https://dev.to/debajit13/deep-dive-into-server-sent-events-sse-52) enables a client to receive a continuous stream of data from a server, and is the technique used by most conversational AI services (chatbots) to stream text word-by-word to users. SSE is lightweight and easy to adopt since it is based on the HTTP protocol, but it only supports unidirectional communication from one application to another. SSE starts when a receiver application opens a connection to the sender application, with the sender responding and keeping the connection open. The sender then sends new data through the connection and the receiver automatically receives it.
Below is an example of enabling SSE-based streaming extending the code in [API Fundamentals](@/ai-system/api-fundamentals/index.md#interact-with-apis-with-python):
```python
import os
import requests
import json
url = "https://api.anthropic.com/v1/messages"
headers = {
"x-api-key": os.getenv("API_KEY"),
"Content-Type": "application/json",
"Accept": "text/event-stream", # Accept SSE format
"User-Agent": "SomeAIApp/1.0",
"anthropic-version": "2023-06-01"
}
json_body = {
"model": "claude-sonnet-4-20250514",
"max_tokens": 2048,
"temperature": 0.7,
"stream": True, # Enable streaming
"messages": [
{
"role": "user",
"content": "Explain the concept of APIs."
}
]
}
try:
response = requests.post(
url,
headers=headers,
json=json_body,
timeout=30,
stream=True # Enable streaming in requests
)
response.raise_for_status()
print("Streaming response:")
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:] # Remove 'data: ' prefix
if data == '[DONE]':
break
try:
event_data = json.loads(data)
# Extract and print the content delta
if 'delta' in event_data and 'text' in event_data['delta']:
print(event_data['delta']['text'], end='', flush=True)
except json.JSONDecodeError:
continue
print("\nStreaming complete!")
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
```
The key differences from the regular POST request are:
- `"stream": True` in the request body to enable streaming
- `"Accept": "text/event-stream"` header to specify SSE format
- `stream=True` parameter in `requests.post()` to handle streaming responses
- Using `response.iter_lines()` to process the continuous stream of data
- Parsing the SSE format where each chunk starts with `data: `
See it work in action:
![](streaming-demo.gif)
> **Extended Reading:**
> Take a look at the official documents for streaming messages from [OpenAI](https://platform.openai.com/docs/guides/streaming-responses?api-mode=responses) and [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/streaming), which provide different approaches towards implementing SSE-based text streaming.
#### WebSocket
You might have played with ChatGPT's [voice mode](https://help.openai.com/en/articles/8400625-voice-mode-faq) where you can talk with ChatGPT and interrupt it, just like phone calling someone in real-world. This is unachievable with unidirectional protocols like SSE. Instead, it can be achieved through bidirectional streaming protocols such as WebSocket.
Unlike SSE which is built on top of HTTP, [WebSocket](https://www.geeksforgeeks.org/web-tech/what-is-web-socket-and-how-it-is-different-from-the-http/) is a communication protocol of its own. For two applications to establish a WebSocket connection, one application first sends a standard HTTP request with upgrade headers, while the other application agrees to upgrade and maintains the connection through the WebSocket lifecycle. To create a WebSocket connection in Python, we no longer can use the `requests` package since it is specifically built for HTTP. Instead, we have to use `websocket` package. Below is a basic example of connect to [OpenAI's real-time API](https://platform.openai.com/docs/guides/realtime?connection-example=ws#connect-with-websockets):
```python
import os
import json
import websocket
OPENAI_API_KEY = os.getenv("API_KEY")
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"
headers = [
"Authorization: Bearer " + OPENAI_API_KEY,
"OpenAI-Beta: realtime=v1"
]
def on_open(ws):
print("Connected to server.")
# Send a request to the API when the connection opens
payload = {
"type": "response.create",
"response": {
"modalities": ["text"],
"instructions": "Say hello!"
}
}
ws.send(json.dumps(payload))
def on_message(ws, message):
data = json.loads(message)
print("Received event:", json.dumps(data, indent=2))
ws = websocket.WebSocketApp(
url,
header=headers,
on_open=on_open,
on_message=on_message
)
ws.run_forever()
```
> **Videos:**
> - [Comparison between SSE and WebSocket](https://www.youtube.com/watch?v=X_DdIXrmWOo&t=102s)
> **Extended Reading:**
> [WebRTC](https://www.tutorialspoint.com/webrtc/webrtc_quick_guide.htm) is another real-time protocol that provides [peer-to-peer connections](https://www.geeksforgeeks.org/computer-networks/what-is-p2p-peer-to-peer-process/) between applications. Compared to WebSocket which is more suitable for connections between servers or between a server and a client, WebRTC excels at streaming data between clients without relying on server architectures, and is widely used in video calling and live streaming softwares.
### Message-driven Protocols
While streaming protocols excel at delivering continuous data between applications—similar to two people communicating through phone calls—there are scenarios where data from multiple applications needs to be distributed to multiple other applications, like journalists producing newsletters for a publisher who then delivers them to subscribers. Direct communication between each application would be impractical in such cases. This is where [message-driven protocols](https://www.videosdk.live/developer-hub/websocket/messaging-protocols) come into play. We will introduce MQTT (Message Queuing Telemetry Transport) as a representative message-driven protocol, and take a look at Apache Kafka as a comprehensive message-driven system.
#### MQTT
[MQTT (Message Queuing Telemetry Transport)](https://www.emqx.com/en/blog/the-easiest-guide-to-getting-started-with-mqtt) is a publish-subscribe message protocol designed for resource-constrained devices like low-power computers and smart home devices. It operates on the publish-subscribe (pub-sub) pattern, where publishers send messages on specific topics without knowing who will receive them, while subscribers express interest by subscribing to specific topics. MQTT requires brokers to operate—devices or applications that receive messages from publishers and deliver them to subscribers. MQTT has various applications in IoT (Internet of Things) communications and can be utilized in AI systems where its pub-sub pattern is needed.
To implement MQTT in Python, you can use the `paho-mqtt` library and a public broker like the HiveMQ at `broker.hivemq.com`. Below is an [example implementation](https://github.com/tigoe/mqtt-examples) of publishers and subscribers. Both can be run as multiple instances on multiple devices.
```python
# publisher.py
import paho.mqtt.client as mqtt
broker = 'broker.hivemq.com'
port = 1883
topic = 'demo/ai-systems'
client = mqtt.Client()
client.connect(broker, port)
client.publish(topic, 'This is a very important message!')
client.disconnect()
```
```python
# subscriber.py
import paho.mqtt.client as mqtt
def on_message(client, userdata, message):
print(f"Received: {message.payload.decode()} on topic {message.topic}")
broker = 'broker.hivemq.com'
port = 1883
topic = 'demo/ai-systems'
client = mqtt.Client()
client.connect(broker, port)
client.subscribe(topic)
client.on_message = on_message
client.loop_forever()
```
#### Apache Kafka
Similar to MQTT, [Apache Kafka](https://www.geeksforgeeks.org/apache-kafka/apache-kafka/) also follows the pub-sub pattern to deliver messages. Unlike MQTT, Kafka is a comprehensive computing system that goes beyond a protocol and is capable of handling large amounts of messages with low latency.
Conceptually, Kafka is composed of three types of applications: producers (similar to MQTT's publishers), consumers (similar to MQTT's subscribers), and brokers. Their respective roles are very similar to those in MQTT. As a high-performance system, Kafka is usually built on top of a clustering architecture, where multiple computers work together to avoid system overload and maintain consistent speed even with messages produced at high rates. Due to its performance advantages, it is used in many large-scale IT infrastructures such as Netflix and Uber for streaming and processing real-time events.
Implementing a Kafka system with Python is a bit complicated. Usually you need to run ZooKeeper (Apache's clustering management system) and Kafka nodes separately, since Kafka's Python library `kafka-python` only provides interfaces to actual Kafka nodes. Once you have those set up, implementing producers and consumers is similar to implementing publishers and subscribers in MQTT. Below is an example implementation of producers and consumers.
```python
# producer.py
import os
from kafka import KafkaProducer
import json
import time
# Create a Kafka producer
producer = KafkaProducer(
bootstrap_servers=f"{os.getenv('KAFKA_ADDRESS')}:9092",
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Produce messages
for i in range(10):
message = {'number': i, 'message': f'Hello Kafka! Message {i}'}
producer.send('demo/ai-systems', value=message)
print(f'Produced: {message}')
time.sleep(1)
# Ensure all messages are sent
producer.flush()
producer.close()
print("All messages sent successfully!")
```
```python
# consumer.py
import os
from kafka import KafkaConsumer
import json
# Create a Kafka consumer
consumer = KafkaConsumer(
'demo/ai-systems',
bootstrap_servers=f"{os.getenv('KAFKA_ADDRESS')}:9092",
auto_offset_reset='earliest',
enable_auto_commit=True,
group_id='demo-consumer-group',
value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)
print("Waiting for messages...")
# Consume messages
for message in consumer:
message_value = message.value
print(f'Consumed: {message_value}')
```
> **Videos:**
> - [MQTT protocol explained](https://www.youtube.com/watch?v=0mlWIuPw34Y)
> - [Kafka basics](https://www.youtube.com/watch?v=uvb00oaa3k8)
### Model Context Protocol
Recent advancements in conversational AI models—large language models (LLMs)—have shown great potential in solving complex tasks. Their utilization is highly dependent on the comprehensiveness of the information they are given and the diversity of actions they can perform. When you interact with LLMs through the conversation APIs we introduced earlier, you can manually feed as much information as possible into the conversation context and instruct LLMs to tell you what to do in natural language. However, this process doesn't align with the philosophy of APIs: it is neither automatic nor reproducible, which means it cannot scale to production-level applications. The [Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction) addresses this challenge.
MCP was introduced by Anthropic in 2024 and has rapidly become the standard for conversational AI models to integrate with external information sources and tools. Built on [JSON-RPC 2.0](https://www.jsonrpc.org/specification)—the same foundation as other protocols we've explored—MCP provides a standardized approach that eliminates the need for custom integrations between every AI system and external service. While similar functionality could be achieved through hardcoded custom interactions using conventional API techniques, MCP's widespread adoption stems from its development simplicity and standardized approach.
MCP's architecture is composed of three types of applications: hosts, servers, and clients. **Hosts** are AI applications that users interact with directly, such as Claude Code and IDEs. These applications contain LLMs that need access to external capabilities. **Servers** are external applications that expose specific capabilities to AI models through standardized interfaces. These might include database connectors, file system access tools, or API integrations with third-party services. **Clients** live within host applications and manage connections between hosts and servers. Each client maintains a dedicated one-to-one connection with a specific server, similar to how we saw individual connections in our previous protocol examples.
![](mcp-architecture.png)
MCP servers can provide three types of capabilities to AI systems: resources, tools, and prompts. **Resources** act like read-only data sources, similar to HTTP `GET` endpoints. They provide contextual information without performing significant computation or causing side effects. For example, a file system resource might provide access to documentation, while a database resource could offer read-only access to customer data. **Tools** are executable functions that AI models can call to perform specific actions. Unlike resources, tools can modify state, perform computations, or interact with external services. Examples include sending emails, creating calendar events, or running data analysis scripts. **Prompts** are pre-defined templates that help AI systems use resources and tools most effectively. They provide structured ways to accomplish common tasks and can be shared across different AI applications.
MCP supports two primary communication methods depending on deployment needs: **stdio (Standard Input/Output)** for local integrations when clients and servers run on the same machine, and **HTTP with SSE** for remote connections—leveraging the same SSE protocol we explored earlier for streaming responses.
Implementing MCP servers and clients with Python is relatively straightforward. Examples of a [weather server](https://github.com/modelcontextprotocol/quickstart-resources/blob/main/weather-server-python/weather.py) and an [MCP client](https://github.com/modelcontextprotocol/quickstart-resources/blob/main/mcp-client-python/client.py) are provided in the official quick start tutorials.
> **Videos:**
> - [MCP protocol explained](https://www.youtube.com/watch?v=HyzlYwjoXOQ)
> **Extended Reading:**
> https://modelcontextprotocol.io/specification/ provides complete technical details of MCP, while https://modelcontextprotocol.io/docs/ provides tutorials and documentations for building MCP servers and clients.
>
> There are lots of public MCP servers run by major companies, such as [Zapier](https://zapier.com/mcp) and [Notion](https://www.notion.com/help/notion-mcp). Feel free to take a look at lists of MCP servers:
> - https://github.com/punkpeye/awesome-mcp-servers
> - https://github.com/wong2/awesome-mcp-servers
>
> Should you always use MCP for connecting LLMs with external resources and tools? Maybe not. Take a look at blog posts discussing this topic:
> - https://lucumr.pocoo.org/2025/7/3/tools/
> - https://decodingml.substack.com/p/stop-building-ai-agents
## High-Performance Data Pipelines
Building on these protocol foundations, we now turn to the infrastructure needed to handle large-scale data processing. In production environments, protocols alone might be insufficient for processing massive datasets, potentially creating bottlenecks in AI systems. High-performance data pipelines address this challenge by providing the processing power needed for large-scale data operations. We've already examined one such system (Kafka) above. Here we'll explore two additional systems from Apache: Hadoop and Spark. While Kafka excels at delivering high-throughput messages, Hadoop and Spark are designed to analyze large-scale data with high speed and performance.
### Apache Hadoop
[Hadoop](https://www.geeksforgeeks.org/data-engineering/hadoop-an-introduction/) is a framework for storing and processing large amounts of data in a distributed computing environment (clustering). In essence, it is actually a collection of open-source software with the key idea of utilizing clustering architecture to handle massive amounts of data. Without going deep into its hardware infrastructure, there are two core layers in Hadoop: a storage layer called HDFS, and a computation layer called MapReduce.
**Hadoop Distributed File System (HDFS)** is the architecture for storing large amounts of data in a cluster. It breaks large files into smaller blocks (usually 128 MB or 256 MB) and stores them across multiple machines. Each block is replicated multiple times (typically 3) to ensure fault tolerance—a common clustering practice where a few node failures won't compromise data integrity. It's like buying [three copies of a DVD](https://en.namu.wiki/w/%EC%9D%B4%EC%A6%88%EB%AF%B8%20%EC%BD%94%EB%82%98%ED%83%80#:~:text=I%20need%20at%20least%20three%20copies%20of%20the%20same%20thing.%20First%20of%20all%2C%20one%20sheet%20must%20be%20kept%20in%20a%20special%20case%20for%20permanent%20preservation%2C%20and%20the%20other%20sheet%20should%20be%20taken%20out%20occasionally%20and%20used%20for%20viewing%20purposes.) and storing them in your house and your friend's house so you're unlikely to lose them.
**MapReduce** is the computation layer for efficiently processing large amounts of data in a cluster. Input data is divided into chunks and processed in parallel, with each worker processing a chunk and producing key-value pairs. These key-value pairs are then grouped to generate final results. Think of how big IT companies split a large software project into multiple modules for every employee to work on individually, then merge everyone's work into the final product. A common way to interact with Hadoop systems with Python is writing [MapReduce jobs](https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/).
### Apache Spark
While [Spark](https://www.geeksforgeeks.org/dbms/overview-of-apache-spark/) and Hadoop are both designed for large-scale data workloads, they have [distinct architectural approaches and differences in detailed functionalities](https://www.geeksforgeeks.org/cloud-computing/difference-between-hadoop-and-spark/).
To begin, unlike HDFS in Hadoop, Spark doesn't have its own native file system but can be integrated with external storage systems including HDFS or databases. This makes its implementation and deployment more flexible. Part of this flexibility comes from the fact that Hadoop relies on its HDFS data architecture, while Spark's storage efficiency is primarily achieved through storing intermediate data in memory rather than on disks, which is usually much faster.
Spark's computation architecture is also different from Hadoop. There are two key concepts: RDDs (Resilient Distributed Datasets) and the DAG (Directed Acyclic Graph) Scheduler. RDDs are essentially immutable collections of data that are distributed across a cluster of machines, similar to each job assigned to each employee that do not conflict with each other. The DAG scheduler is Spark's brain for figuring out how to compute the results, similar to how a management team figures out how to split a big project into multiple jobs. Spark has built-in APIs that support several programming languages to interact with its system, including Python with the [`pyspark`](https://www.datacamp.com/tutorial/pyspark-tutorial-getting-started-with-pyspark) library.
> **Videos:**
> - [Apache Spark basics](https://www.youtube.com/watch?v=IELMSD2kdmk)
> - [Apache Hadoop basics](https://www.youtube.com/watch?v=aReuLtY0YMI)
## Exercise
Upgrade the chatbot program you implemented in [API Fundamentals](@/ai-system/api-fundamentals/index.md) to demonstrate the advanced API concepts covered in this module.
**Exercise: Streaming Chatbot Enhancement**
Upgrade your chatbot from [API Fundamentals](@/ai-system/api-fundamentals/index.md) to implement streaming capabilities:
- **SSE Implementation**: Use Server-Sent Events as demonstrated in the [Server-Sent Events](#server-sent-events) section to receive responses word-by-word instead of waiting for complete responses
- **Stream Processing**: Parse the streaming response format and handle the continuous data flow appropriately, including proper handling of connection termination signals

Binary file not shown.

After

Width:  |  Height:  |  Size: 138 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.3 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 275 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 948 KiB

View file

@ -0,0 +1,86 @@
+++
title = "C.10-Advanced Deployment Strategies"
date = 2025-11-08
description = ""
+++
> **TL;DR:**
> Learn industry-standard deployment patterns that let you roll out AI system updates safely, test new versions without user impact, and minimize risks when deploying to production.
Imagine you've trained a new version of your AI model that should be faster and more accurate. You're excited to deploy it to production and let your users benefit from the improvements. But what if the new version has an unexpected bug? What if it performs worse on certain types of inputs you didn't test? What if the "faster" model actually uses more memory and crashes under load?
Simply replacing your old system with the new one is risky. In July 2024, a [routine software update from cybersecurity firm CrowdStrike](https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_outages) caused widespread system crashes, grounding flights and disrupting hospitals worldwide. While AI deployments might not have such dramatic impacts, pushing an untested model update to all users simultaneously can lead to degraded user experience, complete service outages, or lost trust if users encounter errors.
![CrowdStrike outage](crowdstrike.png)
This is where deployment strategies come in. These are industry-proven patterns that major tech companies use to update their systems safely. They let you roll out updates gradually to minimize impact, test new versions without affecting real users, compare performance between versions, and switch back quickly if something goes wrong.
Throughout this course, we've built up the knowledge to deploy AI systems professionally, from [understanding APIs](/ai-system/api-fundamentals/) to [deploying to the cloud](/ai-system/cloud-deployment/) to [ensuring reliability](/ai-system/high-availability/). These deployment strategies represent the final piece of how companies keep their services running smoothly while continuously improving them.
Let's explore four fundamental deployment patterns that you can use when updating your AI systems.
## The Four Deployment Strategies
### Blue-Green Deployment
[Blue-green deployment](https://octopus.com/devops/software-deployments/blue-green-deployment/) is like having two identical stages at a concert venue. While one stage hosts the live performance, the other sits ready backstage. When it's time to switch acts, you simply rotate the stages. If anything goes wrong with the new act, you can instantly rotate back to the previous one.
In a blue-green deployment, you maintain two identical production environments called "blue" and "green." At any time, only one is live and serving user traffic. When you want to deploy a new version of your AI system, you deploy it to the idle environment, test it thoroughly, and then switch all traffic to that environment in one instant cutover. The switch is typically done by updating your load balancer or DNS settings to point to the new environment.
![Blue-green deployment](blue-green.png)
Suppose your blue environment is currently serving users with version 1.0 of your AI model. You deploy version 2.0 to the green environment and run tests to verify everything works correctly. Once you're confident, you update your load balancer to route all traffic to green. Now green is live and blue sits idle. If users report issues with version 2.0, you can immediately switch traffic back to blue. The entire rollback takes seconds.
The main advantage of blue-green deployment is the instant rollback capability. Because the old version remains fully deployed and ready, you can switch back the moment something goes wrong. This also gives you a clean, predictable deployment process with minimal downtime during the switch.
The downside is cost and resource requirements. You need to maintain two complete production environments, effectively doubling your infrastructure. You also need a sophisticated routing mechanism to handle the traffic switch cleanly. For these reasons, blue-green deployment works best when you need maximum reliability and can afford the infrastructure overhead, or when your deployment is small enough that duplicating it is inexpensive.
### Canary Deployment
The term "[canary deployment](https://semaphore.io/blog/what-is-canary-deployment)" comes from the old coal mining practice of bringing canary birds into mines. These birds were more sensitive to toxic gases than humans, so if the canary showed distress, miners knew to evacuate. In software deployment, the canary is a small group of users who receive the new version first. If they encounter problems, you know to stop the rollout before it affects everyone.
In a canary deployment, you gradually roll out a new version to an increasing percentage of users. You might start by routing 5% of traffic to the new version while 95% continues using the old version. You monitor the canary group closely for errors, performance issues, or user complaints. If everything looks good, you increase the percentage to 25%, then 50%, then 100%. If problems emerge at any stage, you can halt the rollout and route all traffic back to the old version.
![Canary deployment](canary.png)
Imagine you've deployed a new AI model that you believe is more accurate. You configure your load balancer to send 10% of requests to the new model while the rest go to the old model. Over the next few hours, you monitor response times, error rates, and user feedback from the canary group. The new model performs well, so you increase to 50%. After another day of monitoring shows no issues, you complete the rollout to 100% of users.
The main advantage of canary deployment is risk reduction through gradual exposure. If your new version has bugs or performance issues, only a small fraction of users encounter them. You catch problems early when the blast radius is small, rather than impacting your entire user base at once. This approach also lets you monitor real-world performance under actual production load, which is more reliable than synthetic testing.
The challenge with canary deployment is that it requires good monitoring and metrics to detect problems quickly. You need to track error rates, performance, and user experience for both the canary and stable groups. The gradual rollout also takes time, which might not be suitable if you need to deploy an urgent fix. Additionally, some users will experience the new version while others won't, which can complicate support and debugging if users report different behaviors.
### Shadow Deployment
[Shadow deployment](https://devops.com/what-is-a-shadow-deployment/) is like a dress rehearsal for a theater production. The actors perform the entire show with full lighting, props, and costumes, but the audience seats remain empty. This lets the production team find problems and check timing without any risk to the actual performance. Similarly, shadow deployment runs your new version in production with real traffic, but the results are never shown to users.
In a shadow deployment, you deploy the new version alongside your current production system. Every request that comes to your system gets processed by both versions. Users receive responses only from the stable version, while responses from the new version are logged and analyzed but never used. This lets you test how the new version behaves under real production load and compare its performance to the current version without any user impact.
![Shadow deployment](shadow.png)
Suppose you've built a new AI model and want to check it produces better results before showing it to users. You deploy it in shadow mode, where every user request gets sent to both the old model and the new model. Users see only the old model's responses. Meanwhile, you collect data comparing response times, resource usage, and output quality between the two models. After a week of shadow testing shows the new model is faster and more accurate, you confidently move it to production.
The main advantage of shadow deployment is zero user risk. Since users never see the new version's output, bugs or poor performance have no impact on user experience. You get to test with real production traffic patterns rather than fake tests, which reveals issues that might not appear in staging environments. This also gives you detailed performance metrics for comparison before making the switch.
The downside is infrastructure cost and complexity. You're running two complete systems and processing every request twice, which doubles your compute costs during the shadow period. You also need advanced infrastructure to copy traffic and collect comparison metrics. Shadow deployment is most useful when you need high confidence before switching to a new version, such as testing a very different AI model architecture or checking performance improvements before a major update.
### A/B Testing
[A/B testing](https://www.enov8.com/blog/a-b-testing-the-good-the-bad/) is like a taste test between two recipes. Instead of asking which recipe people think they'll prefer, you give half your customers recipe A and the other half recipe B, then measure which group comes back more often or spends more money. The data tells you which recipe actually performs better, not just which one sounds better on paper.
In A/B testing deployment, you run two versions of your system side by side and split users between them. Unlike canary deployment where the goal is to gradually roll out a new version safely, A/B testing aims to compare performance between versions to make data-driven decisions. You might run both versions at 50/50 for weeks or months, collecting metrics on user satisfaction, response quality, speed, or business outcomes. The version that performs better according to your chosen metrics becomes the winner.
![A/B testing](ab-testing.png)
Suppose you have two AI models: model A is faster but slightly less accurate, while model B is more accurate but slower. You're not sure which one will provide better user experience. You deploy both models and randomly assign 50% of users to each. Over the next month, you track metrics like user satisfaction ratings, task completion rates, and how often users retry their requests. The data shows that users with model B complete tasks more successfully and rate their experience higher, even though responses take a bit longer. Based on this evidence, you choose model B as the primary model.
The main advantage of A/B testing is that it removes guesswork from deployment decisions. Instead of assuming one version is better, you measure actual user behavior and outcomes. This is especially valuable when you're making tradeoffs, like speed versus accuracy, or when you've made changes that should improve the user experience but you're not certain. The statistical approach gives you confidence that observed differences are real and not just random chance.
The challenge with A/B testing is that it requires careful planning and longer timelines. You need to define what metrics matter, determine how much data you need for reliable results, and run the test long enough to reach statistical significance. You also need infrastructure to split traffic reliably and track metrics for each group separately. Some users will get the worse version during the test period, which is an acceptable tradeoff when you're trying to learn which version is actually better. A/B testing works best when you're comparing similar versions or incremental improvements, not when one version is clearly experimental or risky.
## Choosing Your Strategy
Which strategy should you use? It depends on your situation. Blue-green works when you need instant rollback and can afford duplicate infrastructure. Canary is the most common choice for routine updates, balancing safety with cost. Shadow deployment gives you zero-risk testing when validating major changes. A/B testing helps when you need data to choose between competing versions.
These strategies often work better together. A common pattern is to start with shadow deployment to validate your new version works correctly, then move to canary deployment for gradual rollout, all within blue-green infrastructure for instant rollback if needed. You might use A/B testing during the canary phase to gather comparison data between versions.
The key is matching the strategy to your needs based on how risky the change is, what infrastructure you can afford, and how much confidence you need before fully committing to the new version.

Binary file not shown.

After

Width:  |  Height:  |  Size: 362 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 521 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 618 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 431 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 881 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.8 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 295 KiB

View file

@ -0,0 +1,183 @@
+++
title = "B.4-AI Compute Hardware"
date = 2025-09-25
description = ""
+++
> **TL;DR:**
> Modern "AI computers" aren't fundamentally different: they still use the 80-year-old Von Neumann architecture. While CPUs excel at sequential processing, AI workloads need massive parallel computation and high memory bandwidth. This mismatch led to specialized hardware.
Unless you have been living off-grid for the last few years, you have probably been tired of hearing "AI computers" or something similar.
![](ai-pc-1.png)
![](ai-pc-2.png)
![](ai-pc-3.png)
![](ai-pc-4.png)
Despite those vendors trying to convince you that you need a new generation of computers to catch up with the AI hype. In the last year of WWII, [John von Neumann](https://en.wikipedia.org/wiki/John_von_Neumann) introduced the [Von Neumann architecture](https://www.geeksforgeeks.org/computer-organization-architecture/computer-organization-von-neumann-architecture/). 80 years later, most computers on Earth are still based on this architecture, including most so-called AI computers.
Admittedly, the capabilities of computer hardware have been growing rapidly ever since the architecture is introduced, and is one of the most important motivation and foundation for the development of AI models and systems. But for this course, we will need to start from the basics and take a look at the architecture that started everything.
## Computer Architecture
In 1945, John von Neumann documented what would become the most influential computer architecture design in history: the Von Neumann architecture. This architecture established the foundation that still governs most computers today, from smartphones to supercomputers.
The below illustration shows the Von Neumann architecture. To help you understand the concepts in this architecture, we will use an analogy to a restaurant kitchen. Imagine a busy restaurant kitchen, with orders and recipes (instruction) coming by and ingredients (data) ready to be cooked. With chefs (CPU) following orders and recipes and prepare dishes, a pantry and a counter (memory unit) for storing ingredients and recipes, waiters (input/output devices) bringing in orders and deliver dishes, and corridors (bus) connecting all staff and rooms.
![](von-neumann.png)
### Instruction & Data
For a computer to finish a certain task, it needs two types of information: instruction and data.
[**Instructions**](https://www.geeksforgeeks.org/computer-organization-architecture/computer-organization-basic-computer-instructions/) tell the computer exactly what operations to perform, like recipes in a restaurant. A recipe is usually a step-by-step guide on how to handle the ingredients and cooking tools:
```
1. Cut onion into pieces
2. Heat up pan to medium heat
3. Add 2 tablespoons oil
4. Sauté onions until golden
```
Instructions are also step-by-step specification on how to handle and process data:
```
1. LOAD dkk_price
2. MULTIPLY dkk_price by conversion_factor
3. STORE result in usd_price
4. DISPLAY usd_price
```
The computer also needs [**data**](https://thequickadvisor.com/what-is-the-difference-between-an-instruction-and-data/) itself, representing the information that needs to be processed, like ingredients in a restaurant. For the above recipe, you will need ingredients:
```
- 2 large onions
- Olive oil
```
And the computer will need data:
```
- dkk_price: 599
- conversion_factor: 0.1570
- usd_price: to be calculated
```
### Central Processing Unit (CPU)
This is the brain of the computer, similar to the group of chefs in the restaurant. They can be composed of [a variety types of sub-units](https://www.geeksforgeeks.org/computer-science-fundamentals/central-processing-unit-cpu/), especially in modern CPUs, but here we will be focusing on two essential types: control unit (CU) and arithmetic logic unit (ALU).
[**Control unit (CU)**](https://www.geeksforgeeks.org/computer-organization-architecture/introduction-of-control-unit-and-its-design/) is like the executive chef who reads orders and recipes, understands what needs to be done in order to fulfill the orders, and coordinates all the staff and equipment to perform each step of the process. To be more specific, CU is in charge of processes including: retrieving the next instruction from memory, interpreting the instruction's [operation code](https://en.wikipedia.org/wiki/Opcode) and [operands](https://en.wikipedia.org/wiki/Operand#Computer_science), and coordinating the execution by sending signals to other components.
[**Arithmetic logic unit (ALU)**](https://www.learncomputerscienceonline.com/arithmetic-logic-unit/) is like the chefs who do the actual cooking, processing ingredients following the commands from CU. ALU typically can handle a variety of computational operations including: arithmetic (addition, subtraction, multiplication, division), logical (AND, OR, NOT, XOR), comparison (equals to, greater than, less than), and bit manipulation (shifts, rotations, etc).
### Memory
[**Memory**](https://www.geeksforgeeks.org/computer-science-fundamentals/computer-memory/) is where both instructions and data are stored, like a comprehensive pantry where both ingredients and recipe books are stored. The memory will also have an address system, similar to the pantry having a unified shelving system so that all staff can more easily access it. More specifically, a memory system will have characteristics including: unified address space (both instructions and data use the same addressing scheme), random access (any memory location can be accessed directly in constant time), and volatile storage (contents are lost when power is removed).
### Input/Output
An [**input/output (I/O) system**](https://www.geeksforgeeks.org/operating-systems/i-o-hardware-in-operating-system/) manages communication between the computer and the external world, similar to waiters in the restaurant who bring in orders and deliver finished dishes. From an abstract standpoint, an I/O system will have I/O controllers for device management and protocol handling, and I/O methods for different types of interactions between I/O devices and the computer. From a physical standpoint, you have your common input devices like keyboard, mouse, trackpad, microphone, and camera, and output devices like monitor, speaker, and printer.
### Bus
A [**bus system**](https://www.geeksforgeeks.org/computer-organization-architecture/what-is-a-computer-bus/) provides the communication pathways across all components in a computer, similar to the corridors in the kitchen for staff to move around, communicate with other staff, access different components, and carry cooking tools, ingredients, and dishes. Such system can be roughly categorized into three sub-systems: address bus (specifies the memory or I/O device location to access), data bus (carries actual data transferred between components), and control bus (carries control signals and coordinates different components).
Another analogy for any of you who have played [Factorio](https://www.factorio.com/) (a factory management/automation game): for scalable production, you will usually also have a bus system connecting storage boxes, I/O endpoints, and machines actually producing or consuming stuff. Such system make it easy to add a new sub-system to existing ones.
![](factorio-bus.png)
### Von Neumann Architecture in Practice
To showcase how this architecture is implemented in real-world, we will use the [Raspberry Pi 5](https://www.raspberrypi.com/products/raspberry-pi-5/)--a small yet complete computer--as an example.
![](raspberry-pi.png)
To start, we have **CPU** in the center-left of the board (labelled *BCM2712 processor* in the figure). Worth noting that like most modern CPUs, this CPU has multiple cores: like multiple chefs working together.
We then have the **memory** (labelled *RAM* in the figure), which you might notice is positioned very close to the CPU. This is to lower the access latency, similar to in the kitchen putting the counters closer means quicker access to the things chefs need.
There are also lots of **I/O interfaces** on the board, like the *PCI Express interface* for high-speed peripherals, the *Ethernet and USB connectors*, and the *MIPI DSI/CSI connectors* for connecting cameras. The connection between the Raspberry Pi and the I/O devices are also managed by the *Raspberry Pi RP1 I/O controller*.
And if you look very closely, you can see traces everywhere on the board, this is the physical implementation of the **bus system**. These traces are essentially copper wires connecting all components together.
> **Videos:**
> - [History of John von Neumann](https://www.youtube.com/watch?v=QhBvuW-kCbM)
> - [Computer architecture explained](https://www.youtube.com/playlist?list=PL9vTTBa7QaQOoMfpP3ztvgyQkPWDPfJez)
> - [Computer architecture explained in Minecraft](https://www.youtube.com/watch?v=dV_lf1kyV9M)
## Limitations of Generic Hardware
As we mentioned, most modern computers still fundamentally adhere to the Von Neumann architecture. But there are indeed limitations of generic computing hardware, especially CPUs, for heavy AI workloads. And there are two major aspects that CPUs are not very suitable for AI computing.
### [Sequential Processing vs. Parallel Demands](https://www.starburst.io/blog/parallel-vs-sequential-processing/)
CPUs excel at sequential processing, which means they can execute complex instructions one after another. Think of a university professor capable of solving complex math problems, who can solve any known problems thrown at them, but they can only solve one problem at a time. Of course, modern CPUs usually have multiple cores, but the number of cores usually sits around 8 for consumer tier and 64 for professional server tier.
On the other hand, AI models (especially neural networks) heavily rely on matrix-related computation. For example, matrix manipulation accounts for 45-60% of runtime in Transformer models in most large language models. Those manipulations usually only involve relatively simple instructions like add and multiply, but each manipulation includes thousands of independent calculations that could happen in parallel. Imagine given a thousand of simple equations to a professor to solve, each equation is very simple for the professor, but will still take a lot of time to solve all of them. A group (hundreds) of primary school students, though each incapable of solving complex equations, will probably be faster at dealing with these one thousand equations.
### [Memory Bus Bottleneck](https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8)
Remember the bus system connecting different components in a computer? These buses are usually designed to be low latency, especially the bus between CPUs and memory chips. Since CPUs are usually in charge of executing complex instructions which involve fetching and storing data scattered in different locations of the memory, latency is a more important metric for CPUs.
However, as mentioned, AI models heavily rely on large-scale parallel instructions on matrices--usually stored in a relatively local block in memory, typical memory bus's advantage of low latency becomes disadvantage here, since a low latency memory bus usually comes with the downside of low bandwidth. In other words, the ability to move a large chunk of data quickly is a more critical metric for most AI models.
## Specialized Hardware
The fundamental mismatch between CPU architecture and AI workload calls for specialized hardware to speed up AI computing. Essentially, we need hardware that excels in parallel processing and have high-bandwidth memory.
### Graphics Processing Unit (GPU)
GPU is the representative type of hardware specialized for AI computing. You could tell from its name that it is originally designed for processing computer graphics. More specifically, it was originally designed in the 1980s to accelerate 3D graphics rendering for video games. Rendering a 3D video game involves calculation of lighting, shading, and texture mapping, and display millions of pixels, with [highly optimized algorithms](https://developer.nvidia.com/gpugems/gpugems3/part-ii-light-and-shadows/chapter-10-parallel-split-shadow-maps-programmable-gpus) that breaks such calculation into small units that are composed of simple instructions and can be done in parallel.
![](gpu-rendering.png)
To compute such algorithms more efficiently, GPUs are designed to excel at parallel processing. While [a modern CPU](https://www.amd.com/en/products/processors/desktops/ryzen/9000-series/amd-ryzen-9-9950x3d.html) usually features less than 100 powerful cores, [a modern GPU](https://www.nvidia.com/en-us/geforce/graphics-cards/50-series/rtx-5090/) usually contains thousands of weak cores. Each core can only handle simple instructions--just like a primary school student, but all the cores combined can finish a parallelized task much faster than a CPU.
![](cpu-vs-gpu.png)
The memory on a GPU is also designed around high-bandwidth, so that large chunks of data can be accessed quickly. For example, the bandwidth of [DDR memory](https://en.wikipedia.org/wiki/DDR5_SDRAM) for CPUs sits around 50 to 100 GB/s, while the [GDDR memory](https://en.wikipedia.org/wiki/GDDR7_SDRAM) for GPUs can deliver up to 1.5 TB/s bandwidth, and the [HBM memory](https://en.wikipedia.org/wiki/High_Bandwidth_Memory) specifically designed for AI workloads can deliver up to 2 TB/s bandwidth.
Interestingly, the need for parallel processing and high-bandwidth of computer graphics aligns quite well with AI computing. Thus, GPU has become the dominant type of specialized hardware for AI workloads in recent years. Sadly this leads to major GPU brands don't give a sh\*t about gamers and general consumers anymore.
![](nvidia-jensen.png)
### Tensor Processing Unit (TPU)
Although GPU accidentally became perfect for AI workloads by repurposing computer graphics hardware, as the AI industry rapidly grows, companies are also trying to introduce hardware specifically designed for AI computing.
One example is Google's [TPU](https://cloud.google.com/tpu). TPU adopts an architecture where thousands of simple processor cores aligned in a grid, and the incoming data and instructions flow through the grid like waves: each processor core does a small calculation and passes the result to its neighbors.
![](tpu-architecture.png)
Hardware like TPUs is highly specialized in AI computing, which means they can be more efficient for AI workloads compared to GPU, which still need to handle graphics and other general computing tasks. However, this also means they are impractical for any other tasks. Nowadays TPUs are largely seen in data centers, especially those built by Google themselves.
### Neural Processing Unit (NPU)
While TPUs target data centers and high-performance computers, with more and more integration of AI models in personal computing devices including PCs and smartphones (regardless of whether we want them or not), there is emerging hardware targeting those devices which emphasizes power efficiency. The specialized AI computing hardware in such devices is usually NPUs.
As mentioned, the goal of NPUs is to deliver AI computing acceleration while consuming minimal power and physical space. To achieve this, on top of the specialization of most AI computing hardware, they are also built around miniaturization: NPUs focus on running pre-trained models rather than training new ones, and they usually use low-precision arithmetic such as 8-bit or even 4-bit compared to the full 32-bit.
As for the specific architecture design, different companies have different designs. For example, Apple calls their NPUs [Neural Engine](https://en.wikipedia.org/wiki/Neural_Engine), integrated into their smartphone chips from iPhone 8. Qualcomm calls their NPUs [AI Engine](https://www.qualcomm.com/products/technology/processors/ai-engine), working collaboratively with the GPUs in their chips. Nowadays you will also see NPUs integrated into those so called "AI computers", such as in [Apple's M4](https://en.wikipedia.org/wiki/Apple_M4#NPU) desktop chip, [AMD's Ryzen AI series](https://www.amd.com/en/partner/articles/ryzen-ai-300-series-processors.html) laptop chip, and Qualcomm's [Snapragon X Elite](https://www.qualcomm.com/products/mobile/snapdragon/laptops-and-tablets/snapdragon-x-elite) laptop chip.
### Return to Von Neumann Architecture
Despite all the hyped-up specialized hardware for AI computing, most modern computers still fundamentally adhere to the Von Neumann architecture at the system level. Regardless of GPUs, TPUs, or NPUs integrated into computers, this hardware will still connect to CPUs via the bus system, share the unified memory address space, and is ultimately managed and coordinated by CPUs. The CPU remains the "executive chef" coordinating the system, while specialized processors act like highly skilled sous chefs handling specific tasks.
The Von Neumann architecture's genius lies not in its specific components, but in its modular design that continues to accommodate new types of processing units as computational needs evolve. Just like [Factorio](https://www.factorio.com/), while new assembly lines might need to be built to produce new types of products introduced by updates to the game, the bus system will remain the golden standard architecture if you want your factory to be scalable and productive.
> **Videos:**
> - [Comparison of computing hardwares](https://www.youtube.com/watch?v=r5NQecwZs1A)
## Exercise
**Run an AI model on different types of hardware provided by Google Colab.**
Spin up [Google Colab](https://colab.research.google.com/), an interactive playground (essentially Jupyter Notebook) for running Python code with different types of hardware (CPU, GPU, and TPU) and enough free computing hours for us to play with.
Have fun tinkering around with an AI model on it and try out different types of hardware:
- Run the image analysis model we used in [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md).
- Calculate the theoretical size of the model (hint: can be achieved by calculating the number of parameters in the model).
- Change your runtime to different types of hardware (CPU, GPU, and TPU) and rerun the model.
- Record the time the model needs to process one image and compare the time across different types of hardware.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.2 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 425 KiB

View file

@ -0,0 +1,418 @@
+++
title = "A.1-API Fundamentals"
date = 2025-09-03
description = ""
+++
> **TL;DR:**
> APIs are standardized interfaces that enable applications to communicate across different programming languages and infrastructure, serving as a universal postal system for the digital world.
In [Interact with AI Systems](@/ai-system/interact-with-ai-systems/index.md) we've established that we need a standardized interaction method beyond direct function calls. That method is what we call **Application Programming Interfaces (APIs)**. If applications need to communicate like humans do but face barriers like different programming languages and different deployment infrastructure, APIs are like having a universal postal office who knows where everyone lives and how they prefer to receive and send messages.
> **Example:**
> ChatGPT can be accessed through OpenAI's official website, mobile/desktop apps, other AI-based applications (such as Perplexity), Python scripts, or even command line scripts, all through the same family of APIs OpenAI has published.
![API overview](api-overview.png)
## The Three Pillars of APIs
When humans communicate through letters, three pillars are needed: where to send the letters (recipient's address), how to send the letters (postal services and delivery methods), and the letter itself (format and content of the message). Similarly, APIs also need three pillars to work: where to send the message (network fundamentals), how to send the message (HTTP protocol & methods), and a "common knowledge" of how the APIs should be designed and used (standards & design principles).
### Network Fundamentals
Just like you need an address to send a letter, APIs need addresses too. Without going too deep into computer networking, we will focus on three core concepts: IP addresses, domains, and ports.
An **[IP address](https://www.geeksforgeeks.org/computer-science-fundamentals/what-is-an-ip-address/)** is a unique identifier assigned to each device connected to a network, telling applications where to find each other. Think of it as a street address such as *Fredrik Bajers Vej 7K, 9220 Aalborg East, Denmark*. An IPv4 address looks something like `65.108.210.169`.
Technically speaking, APIs can identify themselves solely with IP addresses. The problem is that IP addresses are difficult for humans to read and remember, just like street addresses are usually too long for us to remember. We usually prefer a shorter, semantic-rich name like *Aalborg University*. Similarly, domain names provide this human-friendly alternative. A **[domain](https://www.geeksforgeeks.org/computer-networks/introduction-to-domain-name/)** is also a unique identifier pointing to some network resource and usually has one (or more) corresponding IP address(es). In the ChatGPT example above, `api.openai.com` is the domain name of the API, pointing to IP addresses like `162.159.140.245` and `172.66.0.243`.
Finally, we have ports. Just as some people run several businesses in the same location and have multiple corresponding mailboxes, computers run multiple applications simultaneously. A **[port](https://www.geeksforgeeks.org/computer-networks/what-is-ports-in-networking/)** is used to identify which specific application should receive the incoming message, and each IP address can have up to 65,535 ports. Typically we don't have to specify a port when calling an API, since there are default ports assigned to certain services, protocols, and applications. For example, HTTPS-based APIs usually run on port 443.
We should also briefly address the [difference between a URL and a domain](https://www.geeksforgeeks.org/computer-networks/difference-between-domain-name-and-url/) here. Think of the domain `api.openai.com` as the building address like *Fredrik Bajers Vej 7K* that usually corresponds to a certain group of hardware resources. The full URL is like an address with floor and room number like *Fredrik Bajers Vej 7K, 3.2.50*, which in the below example specifies the version of the API (v1) and the specific function (conversation completion).
![URL structure](url-structure.png)
> **Videos:**
> - [The OSI model of computer networks](https://www.youtube.com/watch?v=keeqnciDVOo)
> - [IP address explained](https://www.youtube.com/watch?v=7_-qWlvQQtY)
> - [How domains are mapped to IP addresses](https://www.youtube.com/watch?v=mpQZVYPuDGU)
> - [Network ports explained](https://www.youtube.com/watch?v=h5vq9hFROEA)
> - [Understanding URLs](https://www.youtube.com/watch?v=5Jr-_Za5yQM)
> **Extended Reading:**
> If you are interested in concepts in computer networking that we left behind, take a look at these materials:
> - https://www.geeksforgeeks.org/computer-networks/open-systems-interconnection-model-osi/
> - https://www.geeksforgeeks.org/computer-networks/basics-computer-networking/
> - https://learn.microsoft.com/en-us/training/modules/network-fundamentals/
### HTTP Protocol & Methods
To send a letter in the real world, you first have to choose from available postal services, which you will probably choose based on price, delivery time, previous experiences, etc. For APIs, you usually won't spend time choosing postal services (transfer protocols) since they are largely standardized, and that one standard protocol used in most APIs is called **[HTTP (HyperText Transfer Protocol)](https://www.geeksforgeeks.org/html/what-is-http/)**.
What you do have to choose is **HTTP methods**, similar to how a postal service usually has multiple delivery methods. Two methods that you will frequently encounter when using AI service APIs are `GET` and `POST`. `GET` means the API call wants to retrieve information, for example you can check OpenAI's available AI models by sending a `GET` request to `https://api.openai.com/v1/models`. `POST` is for sending data and expecting a response, which will be the primary method we use to send data to AI services and retrieve their response.
#### HTTP Request
Besides providing multiple methods, HTTP as a postal service for APIs also standardize how each envelope is addressed, in the form of several [HTTP request components](https://proxyelite.info/understanding-http-requests-what-are-they-made-of/): request line, headers, and body.
The **request line** will be something like this:
```
POST https://api.openai.com/v1/chat/completions HTTP/1.1
```
This contains the method, the URL stating where to send the request, and the protocol version.
The **[headers](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields)** are like the information you write on the envelope, and will be something like this:
```
Authorization: Bearer sk-abc1234567890qwerty
Content-Type: application/json
Accept: application/json
User-Agent: SomeAIApp/1.0
```
Here, `Authorization` is for identifying the user and protecting the API and is usually where we specify our API keys. `Content-Type` and `Accept` specify the format of data we're sending and the expected response, respectively. `User-Agent` identifies the type of application or client we are using to interact with the API.
> **Extended Reading:**
> Just the `Authorization` header alone could cost us a few modules if we were to explore all types of authorization. For now, just think of it as a place to enter our API keys. We will dive deeper into this topic when we implement our own API server in Module 3: [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md), and if you are curious, here are some materials that you can look into:
> - https://apidog.com/blog/http-authorization-header/
> - https://swagger.io/docs/specification/v3_0/authentication/bearer-authentication/
> - https://auth0.com/intro-to-iam/what-is-oauth-2
For the `GET` method, only the request line and headers, or sometimes just the request line, is enough. For the `POST` method, since we are sending data, we need the **body** which is the content of the letter itself. As you noticed, in the headers we've stated that the format of the body will be `application/json`, which means our body will look like this:
```json
{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Write a haiku about APIs"}
],
"temperature": 0.7,
"max_tokens": 50
}
```
The format of this JSON object is specified by the provider of the APIs. There are other [content types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Content-Type) that might be more suitable for certain types of data. Generally speaking, JSON is the most popular one since it's machine-parseable and human-friendly.
#### HTTP Response
Now you've sent the envelope (HTTP request) through the postal service called HTTP. The recipient will send a response letter back to you (HTTP response) if everything is working correctly, and if not, the postal service will at least write a response telling you what's wrong. Akin to HTTP request, a [HTTP response](https://www.tutorialspoint.com/http/http_responses.htm) is composed of a few components: status line, response headers, and response body.
The **status line** looks like this:
```
HTTP/1.1 200 OK
```
Composed of HTTP protocol version, status code, and reason phrase. Both status code and reason phrase provide immediate information about how your sent request went, and they correspond one-to-one.
The **response headers** are like headers in the request, providing metadata about the response. It might look something like this:
```
Content-Type: application/json
Content-Length: 1247
```
The types of headers included in a response depend on the design of the API service and are largely relevant to the purpose of the API. For example, ChatGPT's API will provide information about the AI model and your current usage in their response headers.
The **response body** is similar to the body in the request, containing the data the API provider sends back to you. A response body from the ChatGPT API with JSON format will look like this:
```json
{
"id": "chatcmpl-6pHh8Cw1ZKcO45PiAavgbhZMz3YRs",
"object": "chat.completion",
"created": 1677649420,
"model": "gpt-3.5-turbo-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 13,
"total_tokens": 25
}
}
```
Again, the format of this JSON object is specific to API providers and functions you requested.
> **Videos:**
> - [HTTP explained](https://www.youtube.com/watch?v=KvGi-UDfy00)
> - [HTTP request explained](https://www.youtube.com/watch?v=DBhEFG7zjFU)
> **Note:**
> You might have noticed that we've been saying HTTP protocol throughout the above section, but the URLs we are calling start with HTTPS. HTTPS is an extension of HTTP that additionally encrypts messages. Think of it as writing letters in a way that only you and the recipient can understand. Nowadays, almost all public APIs use HTTPS and most software blocks all non-secure HTTP communications. We will come back to HTTP and HTTPS when we are deploying our own APIs in Module 6: [Cloud Deployment](@/ai-system/cloud-deployment/index.md).
### Standards & Design Principles
In communications, beyond mandating rules (e.g., languages) we have "common knowledge"—for example, how an address is written (street and building number, then post code and city/area, finally country) and how a letter is structured (greetings and regards). You can technically refuse to adhere to such common knowledge, but it might lead to miscommunication and confusion, or you will need to attach a document stating how and why you do things differently. Similarly, when working with APIs, there are standards and design principles that are not mandatory but will make the APIs more predictable and intuitive, reducing the need for users and developers to extensively study the API documentation.
We'll briefly touch on one of the more prominent and widely adopted standards: **[REST (Representational State Transfer)](https://amplication.com/blog/rest-apis-what-why-and-how)**. Core REST principles include uniform interface, statelessness, cacheability, and layered system.
**Uniform interface** ensures all interactions between applications follow a consistent pattern, for example, making the formulation of URLs intuitive and HTTP methods consistent. API URLs that follow this principle include:
```
GET /v1/models # Get all models
GET /v1/models/gpt-4 # Get specific model
POST /v1/chat/completions # Create a chat completion
GET /v1/files # List uploaded files
POST /v1/files # Upload a new file
```
And bad examples include:
```
POST /getModels # Action in URL
GET /model?action=delete&id=123 # Action as parameter
POST /api?method=chat # Generic endpoint
```
**Statelessness** requires that each request contain all information necessary to understand and process the request. One example is that OpenAI's chat completion API always requires the full chat history to be provided in the body:
```json
{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"}
]
}
```
**Cacheability** means HTTP responses should clearly define themselves as cacheable or non-cacheable. This can make the communication and computation of applications more efficient. Especially for AI APIs, frequently requested AI outputs can be flagged as cacheable and don't need to be recalculated.
**Layered system** allows the architecture to be composed of multiple hierarchical layers, where each layer has specific roles and cannot see beyond the immediate layer it's communicating with. Typical AI APIs will include authentication layers for security, caching layers for reuse of frequently accessed AI results, and rate limiting layers to prevent abuse.
> **Videos:**
> - [What is a REST API?](https://www.youtube.com/watch?v=lsMQRaeKNDk)
> **Extended Reading:**
> If you want to use a more SQL query-like API interaction method, where you explicitly define the type and scope of data you want and receive exactly that, consider GraphQL:
> - https://graphql.org/learn/
## Interact with APIs in Practice
Now we've established the basic concepts related to APIs, we will look at how to interact with APIs in practice.
### API Testing Tools
Before we proceed to integrate interactions with APIs into our applications, we can play around with the APIs with API testing tools to first get a better idea of the behavior of the APIs. These tools will also come in handy when we implement our own APIs and want to test them ourselves before publishing them to the public.
[Postman](https://www.postman.com/) is a popular API testing tool. To send an API request with Postman, fill in the components of an [HTTP Request](#http-request) into its interface:
![Postman request](postman-request.png)
Click send, and after a while you should be able to see the response with components of an [HTTP Response](#http-response):
![Postman response](postman-response.png)
Feel free to explore other functionalities of Postman yourself. Apart from being able to send API requests in a graphical user interface, you can also form a collection of requests for reuse and structured testing. Postman also comes with collaboration tools that can come in handy when developing in a team. Alternatives to Postman include [Hoppscotch](https://hoppscotch.io/) and [Insomnia](https://insomnia.rest/), [among others](https://apisyouwonthate.com/blog/http-clients-alternatives-to-postman/), all with similar core functionalities.
### Interact with APIs with Python
To interact with APIs in a Python program, a universal method is to use the [`requests` package](https://docs.python-requests.org/en/latest/index.html). It is not a built-in package and you will have to install it with a package manager of your choice.
#### Sending `GET` Request
Below is an example of sending a `GET` request:
```python
import os
import requests
url = "https://api.anthropic.com/v1/messages"
headers = {
"x-api-key": os.getenv("API_KEY"),
"Content-Type": "application/json",
"Accept": "application/json",
"User-Agent": "SomeAIApp/1.0",
"anthropic-version": "2023-06-01"
}
try:
response = requests.get(url, headers=headers)
print(f"Status Code: {response.status_code}")
print(f"Response Headers: {response.headers}")
print(f"Response Body: {response.text}")
except requests.exceptions.RequestException as e:
print(f"GET request failed: {e}")
```
Let's break down each part and see how it connects to the concepts covered earlier.
The URL:
```python
url = "https://api.anthropic.com/v1/messages"
```
This maps directly to [Network Fundamentals](#network-fundamentals). The domain `api.anthropic.com` identifies where the API server is located, and the path `/v1/messages` specifies which specific endpoint (function) we want to access. This is like addressing a letter to a specific department in a building.
The headers:
```python
headers = {
"x-api-key": os.getenv("API_KEY"),
"Content-Type": "application/json",
"Accept": "application/json",
"User-Agent": "SomeAIApp/1.0",
"anthropic-version": "2023-06-01"
}
```
These are the [HTTP Request](#http-request) headers, metadata about the request. The `x-api-key` handles authorization (proving who you are), `Content-Type` and `Accept` specify we're working with JSON format, `User-Agent` identifies our application, and `anthropic-version` specifies the API version. Note the security best practice: using `os.getenv("API_KEY")` to retrieve the API key from environment variables rather than hardcoding it in your code.
The response:
```python
response = requests.get(url, headers=headers)
print(f"Status Code: {response.status_code}")
print(f"Response Headers: {response.headers}")
print(f"Response Body: {response.text}")
```
The response object contains all three components of an [HTTP Response](#http-response): `response.status_code` gives us the status code (e.g., 200 means success), `response.headers` provides the response headers with metadata about the response, and `response.text` contains the response body with the actual data the API returned.
#### Sending `POST` Request
And an example of sending a `POST` request:
```python
import os
import requests
import json
url = "https://api.anthropic.com/v1/messages"
headers = {
"x-api-key": os.getenv("API_KEY"),
"Content-Type": "application/json",
"Accept": "application/json",
"User-Agent": "SomeAIApp/1.0",
"anthropic-version": "2023-06-01"
}
json_body = {
"model": "claude-sonnet-4-20250514",
"max_tokens": 2048,
"temperature": 0.7,
"messages": [
{
"role": "user",
"content": "Explain the concept of APIs."
}
]
}
try:
response = requests.post(
url,
headers=headers,
json=json_body,
timeout=30 # 30 second timeout
)
response.raise_for_status() # Raises HTTPError for bad responses
result = response.json()
print("Success!")
print(f"Content: {result.get('content', [{}])[0].get('text', 'No content')}")
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as e:
print(f"HTTP error occurred: {e}")
print(f"Response content: {response.text}")
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
except json.JSONDecodeError:
print("Failed to decode JSON response")
```
Let's break down each part to understand how `POST` requests differ from `GET` requests and how they map to HTTP concepts.
The request body:
```python
json_body = {
"model": "claude-sonnet-4-20250514",
"max_tokens": 2048,
"temperature": 0.7,
"messages": [
{
"role": "user",
"content": "Explain the concept of APIs."
}
]
}
```
This is the [HTTP Request](#http-request) body, the actual data we're sending to the API. Unlike `GET` requests which only have headers, `POST` requests include a body with the information needed to process the request. Notice how we include the full `messages` array, following the [statelessness principle](#standards-design-principles) where each request contains all necessary information.
Sending the request:
```python
response = requests.post(
url,
headers=headers,
json=json_body,
timeout=30
)
```
The `requests.post()` function combines all the [HTTP Request](#http-request) components: the URL specifies where to send it, `headers` provides the metadata, and `json=json_body` automatically converts our Python dictionary to JSON format and sets it as the request body. The `timeout` parameter ensures we don't wait forever if something goes wrong.
Response handling and error management:
```python
response.raise_for_status()
result = response.json()
print(f"Content: {result.get('content', [{}])[0].get('text', 'No content')}")
```
The `raise_for_status()` method checks the [HTTP Response](#http-response) status code and raises an exception for error codes (4xx or 5xx). The `response.json()` parses the response body from JSON format into a Python dictionary, making it easy to extract specific fields.
Different error types:
```python
except requests.exceptions.Timeout:
print("Request timed out")
except requests.exceptions.HTTPError as e:
print(f"HTTP error occurred: {e}")
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
except json.JSONDecodeError:
print("Failed to decode JSON response")
```
Different exceptions handle different failure scenarios: `Timeout` for when requests take too long, `HTTPError` for bad [HTTP Response](#http-response) status codes (caught by `raise_for_status()`), `RequestException` for general network problems, and `JSONDecodeError` for malformed response bodies. This demonstrates robust error handling practices for API interactions.
> **Videos:**
> - [Python `requests` library tutorial](https://www.youtube.com/watch?v=tb8gHvYlCFs)
> **Extended Reading:**
> To get started with AI APIs, you'll need to register accounts, obtain API keys, and familiarize yourself with provider documentation. Here are the two major AI API platforms to explore:
>
> - [OpenAI platform](https://platform.openai.com/welcome)
> - [Anthropic developer console](https://console.anthropic.com/)
>
> And their documentation:
> - [OpenAI API Documentation](https://platform.openai.com/docs/overview)
> - [Anthropic API Documentation](https://docs.anthropic.com/en/api/overview)
## Exercise
Build two Python programs that demonstrate the API fundamentals covered in this module.
**Exercise 1: Command-line Chatbot**
Build a chatbot using an AI API of your choice (e.g., OpenAI, Anthropic, or others) that takes users' input from the command line interface and display the response. It demonstrates:
- **HTTP Request Components**: Properly implement headers including Authorization, Content-Type, and User-Agent as shown in the [HTTP Request](#http-request) section
- **REST Statelessness**: Follow the statelessness principle by including full conversation history in each request
- **HTTP Status Code Handling**: Handle different status codes with user-friendly messages referencing the [HTTP Response](#http-response) section
- **Response Processing**: Parse and display relevant response (content, usage tokens, model information)
**Exercise 2: Image Analysis Tool**
Build a command-line tool that analyzes image content using a multi-modal AI API of your choice that takes users' input file (path) from the command line and display the analysis result (content of the image, class labels, or others). It demonstrates the same points as above plus:
- **Content-Type Handling**: Choose a proper input format for images (file upload, base64 encoding, or URL references)
**Implementation Tips:**
Both programs should demonstrate robust practices:
- **Security**: Follow Authorization header best practices from the [HTTP Request](#http-request) section by using environment variables for API keys instead of hardcoding them
- **Transparency**: Implement using the `requests` package following the request-response patterns shown in the [Interact with APIs with Python](#interact-with-apis-with-python) section, which provides better understanding of HTTP fundamentals than provider-specific SDKs

Binary file not shown.

After

Width:  |  Height:  |  Size: 217 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 262 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 390 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 146 KiB

View file

@ -0,0 +1,860 @@
+++
title = "B.6-Cloud Deployment"
date = 2025-10-09
description = ""
+++
> **TL;DR:**
> The "cloud" is just computers in data centers you can rent remotely. Learn how to deploy your containerized AI API to cloud VMs, covering remote access, Docker installation, and production-ready HTTPS setup.
After [AI compute hardware](@/ai-system/ai-compute-hardware/index.md) and [Packaging & containerization](@/ai-system/packaging-containerization/index.md), now we have the confidence that we can deploy our AI system to a computer other than our own PC. Other than the fact that we can do it, there are also reasons that we might want to do it. AI systems, especially API servers like the one we implemented earlier, usually need to run 24/7, which you probably shouldn't rely on your own PC. Running AI systems often use lots of computational resources, which in turn make computers produce heat and noise, something you probably don't enjoy at home.
In this module we will learn how to deploy our system on to the ["cloud"](https://en.wikipedia.org/wiki/Cloud_computing). This is probably a buzzword you have heard for quite a while, only recently overtaken by the "AI" hype. You will learn that cloud deployment has nothing to do with clouds in the sky: the cloud infrastructure essentially is composed of computers inside data centers that are setup in a way that can be accessed remotely, thus cloud deployment (in most cases) comes down to deployment on to a remote computer.
## Cloud Infrastructure
### What is "the Cloud"?
When we talk about "the cloud," we're really talking about computers in [data centers](https://en.wikipedia.org/wiki/Data_center) that you can access over the internet. The term comes from old network diagrams where engineers would draw a cloud shape to represent "the internet" or any network whose internal details weren't important at that moment. Over time, this symbol became associated with computing resources accessed remotely.
![Cloud diagram](cloud-diagram.png)
Cloud infrastructure emerged from a practical problem. Companies like Amazon and Google built massive computing facilities to handle peak loads (holiday shopping spikes, search traffic surges), but these expensive resources sat mostly idle during normal times. They realized they could rent out this spare capacity to others, and the modern cloud industry was born. What started as monetizing excess capacity evolved into a fundamental shift in how we provision computing resources.
The key technical innovation that made cloud practical is [virtualization](https://en.wikipedia.org/wiki/Virtualization). This technology allows one physical machine to be divided into many isolated virtual machines, each acting like a separate computer with its own operating system. A single powerful server might run dozens of virtual machines for different customers simultaneously. This sharing model dramatically improved efficiency, since physical servers could be fully utilized rather than sitting idle.
![Virtualization](virtualization.png)
You might recall from [Packaging & containerization](@/ai-system/packaging-containerization/index.md) that containers also provide isolation, but they work at a different level. Virtual machines virtualize the entire hardware, giving each VM its own complete operating system. Containers, in contrast, share the host's operating system kernel and only isolate the application and its dependencies. This makes VMs heavier but more isolated, suitable for running entirely different operating systems or providing stronger security boundaries. Containers are lighter and faster, ideal for packaging applications. In practice, cloud infrastructure often uses both: VMs to divide physical servers among customers, and containers running inside those VMs to package and deploy applications.
Cloud infrastructure is built in three layers. The **physical layer** forms the foundation: thousands of servers organized in racks inside data centers, connected by high-speed networks, with massive storage arrays and redundant power and cooling systems. The **virtualization layer** sits on top, where [hypervisors](https://en.wikipedia.org/wiki/Hypervisor) create and manage virtual machines, allocating slices of physical resources while ensuring isolation between customers. The **management layer** ties everything together with APIs for programmatic control, orchestration systems for resource allocation, monitoring tools for health tracking, and billing systems that measure usage.
Together, these layers transform a pile of hardware into a self-service platform where you can spin up a server in seconds with a few clicks or API calls.
> **Videos:**
> - [Cloud computing introduction](https://www.youtube.com/watch?v=N0SYCyS2xZA)
> - [Virtual machines vs. containers](https://www.youtube.com/watch?v=eyNBf1sqdBQ)
> **Extended Reading:**
> To dive deeper into cloud infrastructure architecture:
> - [What is a Data Center?](https://aws.amazon.com/what-is/data-center/) from AWS explains the physical infrastructure
> - [Understanding Hypervisors](https://www.redhat.com/en/topics/virtualization/what-is-a-hypervisor) from Red Hat covers virtualization technology in detail
> - [What is Cloud Computing?](https://aws.amazon.com/what-is-cloud-computing/) from AWS provides a comprehensive overview
### Major Cloud Providers
Three companies dominate the cloud infrastructure market, each with distinct strengths.
**[Amazon Web Services (AWS)](https://aws.amazon.com/)** is the market leader, launched in 2006 when Amazon started renting out its excess computing capacity. AWS offers the most comprehensive service catalog with over 200 services covering everything from basic compute to specialized AI tools. This breadth makes AWS powerful but can also be overwhelming for beginners. The platform is known for its maturity, global reach with data centers in dozens of regions, and extensive documentation. Most enterprise companies use AWS in some capacity.
**[Google Cloud Platform (GCP)](https://cloud.google.com/)** entered the market later but brought Google's expertise in handling massive scale. GCP excels in data analytics and AI/ML services, offering tools like BigQuery for data warehousing and Vertex AI for machine learning. The platform tends to be more developer-friendly with cleaner interfaces and better default configurations. For AI system deployment, GCP's strengths in machine learning infrastructure and competitive GPU pricing make it particularly attractive.
**[Microsoft Azure](https://azure.microsoft.com/)** holds strong appeal for enterprises already using Microsoft products. Azure integrates seamlessly with Windows Server, Active Directory, and Office 365. This makes it the natural choice for organizations with existing Microsoft infrastructure. Azure has grown rapidly and now rivals AWS in service offerings, with particular strength in hybrid cloud scenarios where companies need to connect on-premises systems with cloud resources.
> **Extended Reading:**
> Beyond the "big three," many alternatives exist for different needs:
>
> **Affordable and Simple**: [DigitalOcean](https://www.digitalocean.com/) and [Linode](https://www.linode.com/) offer straightforward interfaces and competitive pricing, ideal for startups and smaller projects.
>
> **GPU-Focused for AI**: [Lambda Labs](https://lambdalabs.com/) and [CoreWeave](https://www.coreweave.com/) specialize in providing cost-effective GPU instances optimized for machine learning workloads.
>
> **European Providers**: For those prioritizing data sovereignty and GDPR compliance, European providers offer compelling alternatives. [Hetzner](https://www.hetzner.com/) (Germany) is known for exceptional price-performance ratios with data centers across Europe. [OVHcloud](https://www.ovhcloud.com/) (France) operates one of Europe's largest cloud infrastructures. [Scaleway](https://www.scaleway.com/) (France) positions itself as a European alternative with strong AI capabilities. These providers often cost significantly less than US hyperscalers while keeping data within EU jurisdiction.
### Common Cloud Services
Cloud providers offer various service types, each with different tradeoffs between control, convenience, and cost. Understanding these options helps you choose the right approach for deploying your AI systems.
**Virtual Machines** provide dedicated computing instances that behave like traditional servers. You get full control over the operating system and can install whatever software you need. This familiarity makes VMs approachable if you're comfortable with traditional server management. However, you're responsible for all maintenance, security patches, and configuration. You also pay for the VM whether it's actively processing requests or sitting idle. Examples include [EC2](https://aws.amazon.com/ec2/) on AWS, [Compute Engine](https://cloud.google.com/compute) on GCP, and [Virtual Machines](https://azure.microsoft.com/en-us/products/virtual-machines) on Azure.
**Container Services** let you run the Docker containers we learned about in [Packaging & containerization](@/ai-system/packaging-containerization/index.md). The cloud provider manages the underlying infrastructure while you focus on your containerized applications. Many container services offer automatic scaling, spinning up more containers when traffic increases and shutting them down when traffic drops. This means you only pay for actual usage. The learning curve can be steeper than VMs, and debugging containerized applications in production requires different skills. Examples include [ECS/EKS](https://aws.amazon.com/containers/) on AWS, [Cloud Run](https://cloud.google.com/run) on GCP, and [Container Instances](https://azure.microsoft.com/en-us/products/container-instances) on Azure.
**GPU Instances** are virtual machines with attached graphics processing units, essential for training large AI models or running inference on complex models. Without buying expensive hardware upfront, you get access to cutting-edge GPUs. The downside is cost. GPU instances can run hundreds of dollars per day, and during peak times (like when new AI research creates demand), they may be unavailable. Examples include [P-series and G-series instances](https://aws.amazon.com/ec2/instance-types/) on AWS, [A2 and G2 instances](https://cloud.google.com/compute/docs/gpus) on GCP, and [NC and ND-series](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/) on Azure.
**Managed AI Services** provide pre-configured platforms specifically for deploying machine learning models. These services handle infrastructure scaling, model versioning, monitoring, and often include tools for A/B testing different model versions. They're the easiest way to deploy AI systems, requiring minimal DevOps knowledge. The tradeoff is less flexibility and potential vendor lock-in, as these platforms often use proprietary APIs. Examples include [SageMaker](https://aws.amazon.com/sagemaker/) on AWS, [Vertex AI](https://cloud.google.com/vertex-ai) on GCP, and [Azure Machine Learning](https://azure.microsoft.com/en-us/products/machine-learning) on Azure.
**Object Storage** provides scalable storage for large datasets, model files, and other unstructured data. Unlike traditional file systems, object storage is designed for durability and massive scale. Files are typically replicated across multiple data centers, making data loss extremely unlikely. Storage costs are remarkably cheap, often a few cents per gigabyte per month. However, object storage isn't designed for real-time access. Operations have higher latency than local disks, making it suitable for storing training data and model weights but not for serving predictions. Examples include [S3](https://aws.amazon.com/s3/) on AWS, [Cloud Storage](https://cloud.google.com/storage) on GCP, and [Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs) on Azure.
> **Extended Reading:**
> Beyond compute and storage services, cloud providers offer specialized services for different use cases:
>
> [**Serverless computing**](https://aws.amazon.com/serverless/) (like AWS Lambda or Google Cloud Functions) runs your code in response to events without any server management. You write individual functions and pay only for execution time measured in milliseconds. This is fundamentally different from VMs or containers where you manage long-running processes.
>
> [**Managed databases**](https://aws.amazon.com/products/databases/) (like RDS, Cloud SQL, or Cosmos DB) handle database administration automatically. Unlike object storage which stores files, these provide structured data storage with queries, transactions, and relational integrity.
>
> [**Content delivery networks**](https://aws.amazon.com/cloudfront/) (like CloudFront or Cloud CDN) cache and serve your content from servers distributed worldwide. Rather than running your application, they focus on delivering static assets (images, videos, model outputs) with minimal latency to users anywhere.
### Choosing the Right Service
When selecting a cloud service for your AI system, you need to balance several considerations.
Your technical expertise matters because some services require deep knowledge of server management while others abstract that away. Think about whether you're comfortable SSHing into a server, managing operating system updates, and debugging infrastructure issues, or whether you'd prefer to focus purely on your application code.
Your scaling requirements also play a role. If you expect steady, predictable traffic, a simple always-on server works fine. But if traffic fluctuates dramatically (say, high during business hours and nearly zero at night), you might benefit from services that scale automatically.
Budget is obviously important, but it's not just about the total amount you can spend. Consider whether you need predictable monthly costs for planning purposes, or whether you're comfortable with variable bills that depend on actual usage.
The level of control you need over the infrastructure influences this decision too. Some applications require specific system configurations, custom networking setups, or particular security arrangements that only low-level services like VMs can provide.
Finally, your timeline matters. How quickly do you need to get your system running? Some services let you deploy in hours, while others require days or weeks of setup and learning.
While managed services and usage-based pricing sound appealing with their promises of convenience and "only pay for what you use," there are significant benefits to starting with simpler services that offer strong control and transparent pricing.
**Virtual machines with fixed pricing provide cost predictability.** When you rent a VM at a fixed monthly rate, you know exactly what you'll pay. There are no surprises. You can run your application, make mistakes during development, and experiment freely without worrying about an unexpected bill at month's end. This predictability is particularly valuable when you're learning or running services with steady traffic patterns.
**Direct control means you understand what's happening.** With a VM, you manage the operating system, install software, and configure everything yourself. While this requires more work upfront, it builds your understanding of how systems actually work. You can troubleshoot issues by logging into the server, checking processes, and examining logs directly. This transparency makes debugging much simpler compared to managed services where problems might be hidden behind abstraction layers.
**Beware of usage-based pricing pitfalls.** The cloud industry has numerous horror stories of unexpected bills. In 2024, a developer woke up to find a [\$104,500 bill from Netlify](https://serverlesshorrors.com/) for a simple documentation site. Another case saw [Cloudflare demanding \$120,000 within 24 hours](https://serverlesshorrors.com/). AWS Lambda functions can see costs [spike 11x from network delays alone](https://www.serverless.com/blog/understanding-and-controlling-aws-lambda-costs). Even a misconfigured S3 bucket resulted in [\$1,300 in charges from unauthorized requests](https://www.tyolab.com/blog/2025/01-24-the-dark-side-of-cloud-computing-the-unexpected-cost/) in a single day. These aren't rare edge cases. They happen regularly because usage-based pricing makes costs difficult to predict and easy to lose track of.
**Vendor lock-in with managed services.** When you use managed AI platforms like SageMaker or Vertex AI, you often write code that depends on their proprietary APIs. [Research shows 71% of organizations](https://journalofcloudcomputing.springeropen.com/articles/10.1186/s13677-016-0054-z) cite vendor lock-in as a deterrent to adopting more cloud services. Migrating away requires rewriting significant portions of your application. Data formats may be incompatible. Features you relied on might not exist elsewhere. The switching costs become so high that you're effectively locked into that provider's ecosystem, even if prices increase or service quality declines.
For the image classification API server we've built in this course, a sensible starting point would be a small VM running your Docker container. You get full control, predictable monthly costs (often $5-20 for basic instances), and the ability to scale up by switching to a larger VM when needed. This approach teaches you cloud fundamentals without the risk of surprise bills or vendor lock-in. As you gain experience and your requirements grow clearer, you can make informed decisions about whether managed services justify their additional complexity and cost uncertainty.
> **Videos:**
> - [Business model of cloud computing](https://www.youtube.com/watch?v=4Wa5DivljOM)
> - [Usage-based pricing pitfalls](https://www.youtube.com/watch?v=SCIfWhAheVw)
> **Extended Reading:**
> For deeper exploration of cloud economics and vendor lock-in:
> - [The Dark Side of Cloud Computing: Unexpected Costs](https://www.tyolab.com/blog/2025/01-24-the-dark-side-of-cloud-computing-the-unexpected-cost/) examines billing horror stories and lessons learned
> - [Critical Analysis of Vendor Lock-in](https://journalofcloudcomputing.springeropen.com/articles/10.1186/s13677-016-0054-z) provides academic perspective on cloud migration challenges
> - [Understanding AWS Lambda Costs](https://www.serverless.com/blog/understanding-and-controlling-aws-lambda-costs) breaks down serverless pricing complexities
## Cloud Deployment in Practice
Now that you understand cloud infrastructure and have chosen to start with VMs for their transparency and cost predictability, let's walk through actually deploying your containerized AI API server. While cloud providers differ in their web interfaces and specific features, the core deployment process remains remarkably similar across platforms. Whether you're using AWS, GCP, Azure, Hetzner, or any other provider, you'll follow the same fundamental steps: create a VM, access it via SSH, install Docker, and run your container.
We'll use the image classification API server from [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md) that we containerized in [Packaging & containerization](@/ai-system/packaging-containerization/index.md) as our running example. The beauty of containers is that once you have your Dockerfile and image ready, deployment becomes straightforward regardless of where you're running it.
### Selecting Your Virtual Machine
When creating a VM through your cloud provider's interface, you'll need to make several decisions about its configuration. These choices affect both performance and cost, but the good news is you can always resize or recreate your VM later if your needs change.
![VM creation](vm-creation.png)
**Operating System**: Choose a Linux distribution. [Ubuntu LTS (Long Term Support)](https://ubuntu.com/about/release-cycle) versions like 22.04 or 24.04 are excellent choices because they receive security updates for five years and have extensive community documentation. Most cloud providers offer Ubuntu as a one-click option. Other good alternatives include Debian or Rocky Linux, but Ubuntu's popularity means you'll find more tutorials and troubleshooting help online.
**CPU and Memory**: For running our containerized AI API server without GPU acceleration, start with a modest configuration. A VM with 2-4 virtual CPUs and 4-8 GB of RAM handles most small to medium traffic loads comfortably. Remember, you're running the model inference on CPU, not training it. If you find performance lacking later, you can upgrade to a larger instance. Starting small keeps costs down while you're learning and testing.
**Storage**: Allocate 20-30 GB of disk space. This covers the operating system (typically 5-10 GB), Docker itself (a few GB), your container images (varies by model size, but usually under 5 GB for our API server), and room for logs and temporary files. Most providers charge extra for additional storage beyond a base amount, so don't over-allocate. You can expand storage later if needed.
**Network Configuration**: Ensure your VM gets a public IP address so you can access it from the internet. Most providers assign one automatically, but some require you to explicitly request it. You'll also need to configure security groups or firewall rules to allow incoming traffic on specific ports. At minimum, open port 22 for SSH access (so you can log in) and port 8000 for your API server. Many providers default to blocking all incoming traffic for security, so you must explicitly allow these ports.
**Authentication**: Most providers offer SSH key-based authentication during VM creation. If given the option, provide your public SSH key now. This is more secure than password authentication and saves setup time later. If you don't have an SSH key yet, you can generate one locally before creating the VM (more on this in the next section).
A typical small VM suitable for our purposes costs $5-20 per month depending on the provider and region. European providers like Hetzner often offer better price-performance ratios than the major cloud providers for basic VMs. Start with the smallest configuration that meets the minimum requirements. You can always scale up, but you can't get money back for over-provisioning.
### Accessing Your Remote Server
Once your VM is created, you need a way to access it remotely to install software and configure it. This is done through [SSH (Secure Shell)](https://en.wikipedia.org/wiki/Secure_Shell), a protocol that lets you securely connect to and control a remote computer over the internet.
**What is SSH?** Think of SSH as a secure remote control for your server. It encrypts all communication between your local computer and the remote server, so passwords and commands can't be intercepted. When you SSH into a server, you get a command-line interface just as if you were sitting at that machine's keyboard. This is how system administrators manage servers around the world.
**Your First Connection**: After your VM is created, your cloud provider will give you its public IP address (something like `203.0.113.42`). You'll also need a username, which varies by provider. Many VMs default to the `root` user (the administrator account with full system privileges). Ubuntu VMs from major cloud providers typically use `ubuntu`, Azure often uses `azureuser`, and some providers let you choose during creation. To connect, open a terminal on your local machine and run:
```bash
ssh username@203.0.113.42
```
Replace `username` with your actual username and `203.0.113.42` with your server's IP address. The first time you connect, you'll see a warning asking if you trust this server. Type `yes` to continue. If you set up password authentication, you'll be prompted for a password. Once authenticated, you'll see a command prompt indicating you're now controlling the remote server.
**SSH Keys (More Secure)**: Password authentication works, but [SSH keys](https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server) are more secure and convenient. An SSH key pair consists of a private key (kept secret on your computer) and a public key (shared with servers). Think of it like a special lock and key: you give servers a copy of the lock (public key), and only your key (private key) can open it.
To generate an SSH key pair on your local machine:
```bash
ssh-keygen -t ed25519 -C "your-email@example.com"
```
This creates two files in `~/.ssh/`: `id_ed25519` (private key, never share this) and `id_ed25519.pub` (public key). When prompted for a passphrase, you can press Enter to skip it for convenience, though adding one provides extra security. If you created your VM without providing a public key, you can add it now by logging in with password authentication and running:
```bash
mkdir -p ~/.ssh
chmod 700 ~/.ssh
echo "your-public-key-content-here" >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
```
Paste the entire contents of your `id_ed25519.pub` file in place of `your-public-key-content-here`. From then on, you can SSH without passwords.
**First Tasks After Login**: When you first access your new server, perform these essential setup steps:
```bash
# Update package lists and upgrade existing packages
sudo apt update && sudo apt upgrade -y
```
This ensures your system has the latest security patches. The `sudo` command runs commands as administrator (root user). On Ubuntu, the default user has sudo privileges. If you're logged in as root directly, you can omit `sudo` from commands, though it's still good practice to create a regular user for daily work:
```bash
# Create a new user (skip if you already have a non-root user)
adduser yourname
# Give the new user sudo privileges
usermod -aG sudo yourname
```
Working as root for routine tasks is risky because it's too easy to accidentally damage the system with a mistyped command.
You should also configure a basic firewall using UFW (Uncomplicated Firewall), which comes pre-installed on Ubuntu:
```bash
# Allow SSH so you don't lock yourself out
sudo ufw allow 22/tcp
# Allow your API server port
sudo ufw allow 8000/tcp
# Enable the firewall
sudo ufw enable
```
This firewall runs on the VM itself and adds an additional layer of protection beyond your cloud provider's security groups. Now you have a freshly configured, secure server ready for installing Docker and deploying your application.
### Installing Docker
With your server configured, the next step is installing Docker so you can run containerized applications. The [official Docker installation guide for Ubuntu](https://docs.docker.com/engine/install/ubuntu/) provides several installation methods, but for production servers, we'll use the repository method rather than convenience scripts.
**Why Not Convenience Scripts?** You might find guides suggesting you can install Docker with a single command using `curl https://get.docker.com | sh`. While this works, Docker's own documentation warns against using it in production environments. The script doesn't give you control over which version gets installed and can behave unexpectedly during system updates. For learning and production deployments, taking the proper approach builds better habits.
**Installation Steps**: First, remove any old or conflicting Docker installations:
```bash
sudo apt remove docker docker-engine docker.io containerd runc
```
If these packages aren't installed, apt will simply report they're not found. That's fine. Next, install prerequisite packages and add Docker's official repository:
```bash
# Install packages to allow apt to use repositories over HTTPS
sudo apt install -y ca-certificates curl gnupg lsb-release
# Add Docker's official GPG key
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
# Set up the Docker repository
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
```
These commands set up Docker's official package repository so apt knows where to download Docker from. Now install Docker Engine:
```bash
# Update apt package index
sudo apt update
# Install Docker Engine, containerd, and Docker Compose
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
This installs several components: `docker-ce` is the Docker Engine itself, `docker-ce-cli` provides the command-line interface, `containerd.io` is the container runtime, and the plugins add useful features like building images and managing multi-container applications.
**Verify Installation**: Test that Docker installed correctly by running the hello-world container:
```bash
sudo docker run hello-world
```
You should see a message explaining that Docker successfully pulled and ran a test image. This confirms everything is working.
**Adding Your User to the Docker Group**: By default, Docker commands require root privileges (hence `sudo docker`). For convenience, you can add your user to the `docker` group:
```bash
# Add your user to the docker group
sudo usermod -aG docker $USER
# Apply the group change (or log out and back in)
newgrp docker
```
Now you can run Docker commands without `sudo`. Be aware this is a security consideration: users in the docker group effectively have root-level privileges because they can run containers with full system access. For a personal learning server, this trade-off is acceptable. In multi-user production environments, you'd want more careful access controls.
Test that you can run Docker without sudo:
```bash
docker run hello-world
```
If this works without requiring a password, you're all set. Docker is now installed and ready to run your containerized applications.
### Deploying Your Container
Now comes the exciting part: actually running your containerized AI API server on the cloud. You have two main options for getting your container image onto the server.
**Option 1: Pull from a Registry** (Recommended). If you pushed your image to Docker Hub or another registry as described in [Packaging & containerization](@/ai-system/packaging-containerization/index.md), you can pull it directly on your server:
```bash
docker pull yourusername/my-ai-classifier:v1.0
```
This downloads your image from the registry. It's the cleanest approach because the image building happened on your local machine or in an automated build system, and the server just needs to run it.
**Option 2: Build on the Server**. If you haven't published your image to a registry, you can build it directly on the server. First, transfer your project files (Dockerfile, requirements.txt, main.py) to the server using SCP (Secure Copy):
```bash
# Run this on your local machine, not the server
scp -r ~/path/to/my-ai-api username@203.0.113.42:~/
```
Then SSH into your server and build the image:
```bash
cd ~/my-ai-api
docker build -t my-ai-classifier:v1.0 .
```
Building on the server works but uses server resources and takes longer. For production workflows, using a registry is cleaner and allows you to test images locally before deploying them.
**Running Your Container**: Once you have the image (either pulled or built), run it with the following command:
```bash
docker run -d -p 8000:8000 --restart unless-stopped --name ai-api my-ai-classifier:v1.0
```
Let's break down what each flag does:
- `-d` runs the container in detached mode (in the background)
- `-p 8000:8000` maps port 8000 on the host to port 8000 in the container, making your API accessible
- `--restart unless-stopped` tells Docker to automatically restart the container if it crashes or when the server reboots (but not if you manually stopped it)
- `--name ai-api` gives the container a friendly name so you can reference it easily
- `my-ai-classifier:v1.0` is the image name and tag to run
If your application needs persistent data (like the SQLite database from our API server), mount a volume:
```bash
docker run -d -p 8000:8000 --restart unless-stopped \
-v ~/ai-data:/app/data \
--name ai-api my-ai-classifier:v1.0
```
This creates a directory `~/ai-data` on your server and mounts it to `/app/data` inside the container, so database files persist even if the container is recreated.
**Verification**: Check that your container is running:
```bash
docker ps
```
You should see your `ai-api` container listed with status "Up". View the container's logs to ensure it started properly:
```bash
docker logs ai-api
```
You should see output from uvicorn indicating the server started successfully. Now test the API locally on the server:
```bash
curl http://localhost:8000
```
If you get a response (likely your API's root endpoint message), the container is working. Finally, test from your local machine by visiting `http://203.0.113.42:8000` in your browser (replace with your server's actual IP). If you see your API respond, congratulations! Your containerized AI application is now running on the cloud.
If you can't access it externally, double-check that your cloud provider's security group allows incoming traffic on port 8000, and that UFW allows it (`sudo ufw status` should show port 8000 allowed).
### Production Considerations
Your container is running, but a production deployment requires thinking beyond just getting it started. Here are essential practices for keeping your application running reliably.
**Container Persistence**: We used `--restart unless-stopped` when running the container, which handles two important scenarios. If your application crashes due to a bug or runs out of memory, Docker automatically restarts it. More importantly, when you reboot your server for system updates, the container starts back up automatically. Without this flag, you'd have to manually run `docker start ai-api` after every server restart.
You can verify the restart policy is working:
```bash
# View container details including restart policy
docker inspect ai-api | grep -A 5 RestartPolicy
```
**Basic Monitoring**: Regularly check your container's health with these commands:
```bash
# Check if the container is running
docker ps
# View recent logs (last 50 lines)
docker logs --tail 50 ai-api
# Follow logs in real-time
docker logs -f ai-api
# Check resource usage
docker stats ai-api
```
The `docker stats` command shows CPU and memory usage. If you notice memory climbing steadily over days, you might have a memory leak in your application. For our API server, memory usage should stay relatively stable.
Monitor disk space regularly because Docker images and logs consume space:
```bash
# Check overall disk usage
df -h
# See Docker's disk usage
docker system df
```
If disk space becomes an issue, clean up unused Docker resources:
```bash
# Remove unused images
docker image prune -a
# Remove everything unused (images, containers, networks)
docker system prune
```
Be careful with `docker system prune` as it removes all stopped containers and unused images. Only run it when you're sure you don't need them.
**Simple Maintenance**: Eventually you'll need to update your application. Here's the basic process:
```bash
# Pull the new version of your image
docker pull yourusername/my-ai-classifier:v2.0
# Stop and remove the old container
docker stop ai-api
docker rm ai-api
# Run the new version
docker run -d -p 8000:8000 --restart unless-stopped \
-v ~/ai-data:/app/data \
--name ai-api yourusername/my-ai-classifier:v2.0
```
This approach causes downtime while you switch containers. For most learning and small production use cases, a few seconds of downtime during off-peak hours is acceptable.
**Backing Up Data**: If your container uses volumes for persistent data (like our SQLite database), back up those directories regularly:
```bash
# Create a backup directory
mkdir -p ~/backups
# Backup the data directory
tar -czf ~/backups/ai-data-$(date +%Y%m%d).tar.gz ~/ai-data
```
Run this as a cron job for automatic daily backups. You can also copy backups to your local machine:
```bash
# On your local machine
scp username@203.0.113.42:~/backups/ai-data-*.tar.gz ~/local-backups/
```
With these basic practices, you have a solid foundation for running containerized applications in production. Your system will automatically recover from crashes, you can monitor its health, perform updates, and protect against data loss.
> **Extended Reading:**
> For more advanced production practices as your deployment grows:
>
> [**BorgBackup**](https://www.borgbackup.org/) provides automated, encrypted, and deduplicated backups. Unlike simple tar backups, Borg only stores changed data, saving significant space for daily backups. The [quickstart guide](https://borgbackup.readthedocs.io/en/stable/quickstart.html) shows how to set up encrypted repositories with automated cron jobs, ideal for production backup strategies.
>
> [**Docker Log Rotation**](https://docs.docker.com/engine/logging/configure/) prevents logs from consuming all disk space. Configure maximum log sizes and file counts in your container run command or daemon.json to automatically rotate and compress logs. The `local` logging driver is recommended for production as it handles rotation by default.
>
> **Zero-Downtime Deployments** using blue-green strategies allow you to update containers without service interruption. By running both old and new versions simultaneously and switching traffic with a reverse proxy like Nginx, you eliminate downtime during updates. Tutorials for [Docker Compose blue-green deployments](https://thomasbandt.com/blue-green-deployments) show practical implementations.
## Enabling HTTPS for Production
Your API server is now running and accessible at `http://your-server-ip:8000`. This works for testing, but it's unusable for production. Modern web browsers enforce strict security policies that make HTTP APIs impractical for real applications.
**The Mixed Content Problem**: If your frontend website is served over HTTPS (which it must be for users to trust it), browsers will block any HTTP requests it tries to make. This is called [mixed content blocking](https://developer.mozilla.org/en-US/docs/Web/Security/Mixed_content). In 2024, approximately 93% of all web requests use HTTPS, and browsers like Firefox automatically upgrade or block non-HTTPS resources. You simply cannot have a modern web application that makes HTTP API calls from an HTTPS page.
**Security Implications**: HTTP traffic is transmitted in plain text. Anyone between your users and your server (your ISP, coffee shop WiFi, or malicious actors) can read and modify the data. With an AI API potentially handling sensitive information or user data, this is unacceptable. HTTPS encrypts all communication, ensuring data integrity and confidentiality.
**Professional Expectations**: Users expect to see a padlock icon in their browser's address bar. Browsers display prominent warnings for HTTP sites, damaging trust before users even interact with your service. Search engines also penalize HTTP sites in rankings.
![HTTPS warning](https-warning.png)
To make your API production-ready, you need HTTPS, which requires a domain name and an SSL/TLS certificate. Let's walk through the process.
### Understanding SSL/TLS Basics
[HTTPS](https://en.wikipedia.org/wiki/Transport_Layer_Security) works through SSL/TLS certificates, which are digital documents that prove you own a domain and enable encrypted communication. When a user connects to `https://yourdomain.com`, their browser and your server perform a "handshake" where they exchange certificates and establish an encrypted connection. All subsequent data flows through this encrypted channel, preventing eavesdropping and tampering.
**Certificate Authorities** (CAs) are trusted organizations that issue certificates after verifying you control a domain. Historically, SSL certificates cost hundreds of dollars per year, creating a barrier for small projects and hobbyists. This changed in 2016 when [Let's Encrypt](https://letsencrypt.org/), a nonprofit CA, began offering free automated certificates. Today, Let's Encrypt has issued certificates to over 700 million websites, making HTTPS accessible to everyone.
**The Role of Reverse Proxies**: Your containerized application runs on port 8000 inside the server, listening for plain HTTP requests. We don't want to modify the container to handle HTTPS directly because managing certificates inside containers is complex and inflexible. Instead, we'll use a reverse proxy (Nginx) that sits in front of your container. The proxy handles HTTPS on port 443 (the standard HTTPS port), terminates the SSL connection, and forwards decrypted requests to your container on port 8000. Your container never knows HTTPS is involved, keeping the architecture simple.
### Getting a Domain Name
Before obtaining an SSL certificate, you need a domain name. Certificates are tied to specific domains, not IP addresses. You have both free and paid options.
**Free Option: DuckDNS**
[DuckDNS](https://www.duckdns.org/) provides free subdomains perfect for learning and personal projects. You get a domain like `yourname.duckdns.org` without paying anything. The service is simple:
1. Visit duckdns.org and log in with GitHub, Google, or Twitter (no separate registration needed)
2. Choose an available subdomain name
3. Point it to your server's IP address through their web interface
DuckDNS also provides an API for updating your IP if it changes, useful for home servers. The main limitation is that your domain will be longer (e.g., `my-ai-api.duckdns.org`) and less professional than a custom domain. For learning and testing HTTPS setup, DuckDNS is perfect.
![DuckDNS](duckdns.png)
**Paid Option: Domain Registrars**
For production applications, consider purchasing your own domain. As of 2024, several registrars offer competitive pricing:
- [**Cloudflare Registrar**](https://www.cloudflare.com/products/registrar/): Sells domains at cost with no markup. A .com domain costs around $10/year. Highly recommended by developers for transparent pricing and excellent DNS management tools.
- **Porkbun**: Known for consistent pricing with no renewal hikes. .com domains around $11/year.
- **Namecheap**: Popular choice with good features and support. .com domains around $16/year for renewals. Includes free WHOIS privacy.
When choosing a registrar, focus on renewal prices, not just first-year promotional rates.
![Cloudflare registrar](cloudflare-registrar.png)
**Setting Up DNS Records**
Once you have a domain, you need to point it to your server:
1. Find your server's public IP address (shown in your cloud provider's dashboard)
2. In your domain provider's DNS settings, create an **A record**:
- Name: `@` (or leave blank for root domain) or a subdomain like `api`
- Type: A
- Value: Your server's IP address (e.g., `203.0.113.42`)
- TTL: 3600 (or default)
For DuckDNS, you simply enter your IP in their web interface. For other registrars, navigate to the DNS management section of your dashboard. DNS changes can take a few minutes to a few hours to propagate worldwide, though they're usually effective within 15 minutes.
You can verify DNS is working by pinging your domain:
```bash
ping yourdomain.com
```
If it resolves to your server's IP address, DNS is configured correctly.
### Manual SSL Setup
The traditional approach to HTTPS uses Nginx as a reverse proxy and Certbot to obtain SSL certificates from Let's Encrypt. This method requires manual configuration but provides full transparency and control over how everything works.
#### Setting Up Nginx Reverse Proxy
With your domain pointing to your server, the next step is installing and configuring Nginx to act as a reverse proxy. Nginx will accept incoming requests on ports 80 (HTTP) and 443 (HTTPS) and forward them to your Docker container on port 8000.
**Install Nginx**:
```bash
sudo apt update
sudo apt install nginx -y
```
Nginx starts automatically after installation. You can verify it's running:
```bash
sudo systemctl status nginx
```
**Create Nginx Configuration**:
Create a configuration file for your domain:
```bash
sudo nano /etc/nginx/sites-available/your-domain
```
Add the following configuration (replace `yourdomain.com` with your actual domain):
```nginx
server {
listen 80;
server_name yourdomain.com;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
```
This configuration tells Nginx to:
- Listen on port 80 for HTTP requests
- Accept requests for your domain
- Forward all requests to `localhost:8000` (where your Docker container is running)
- Pass along important headers so your application knows the original client's IP and protocol
**Enable the Configuration**:
```bash
# Create a symbolic link to enable the site
sudo ln -s /etc/nginx/sites-available/your-domain /etc/nginx/sites-enabled/
# Test the configuration for syntax errors
sudo nginx -t
# If the test passes, reload Nginx
sudo systemctl reload nginx
```
**Test the Proxy**:
Now you should be able to access your API through your domain name using HTTP:
```bash
curl http://yourdomain.com
```
You should see the response from your API server. Your browser should also work at `http://yourdomain.com`. The container is still running on port 8000, but Nginx is now proxying requests to it from port 80.
At this point, you have a working reverse proxy, but you're still using HTTP. The next step is adding SSL certificates for HTTPS.
#### Obtaining SSL Certificates with Certbot
Certbot is the official tool for obtaining Let's Encrypt certificates. It automates the entire process, including modifying your Nginx configuration to enable HTTPS.
**Install Certbot**:
```bash
sudo apt install certbot python3-certbot-nginx -y
```
The `python3-certbot-nginx` package includes the Nginx plugin that allows Certbot to automatically configure Nginx for HTTPS.
**Obtain and Install Certificate**:
Before running Certbot, ensure traffic on ports 80 and 443 is allowed. Let's Encrypt uses port 80 for domain validation, and port 443 is for HTTPS traffic. You need to configure this in two places:
First, update your server's firewall (UFW):
```bash
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
```
Second, ensure your cloud provider's security group or firewall rules also allow these ports. In your cloud provider's dashboard, check the security group attached to your VM and verify that inbound rules allow TCP traffic on ports 80 and 443 from anywhere (0.0.0.0/0). Without this, Let's Encrypt cannot reach your server to validate domain ownership.
Now run Certbot:
```bash
sudo certbot --nginx -d yourdomain.com
```
Replace `yourdomain.com` with your actual domain. If you're using a subdomain (like `api.yourdomain.com`), use that instead.
Certbot will:
1. Ask for your email address (for renewal notifications)
2. Ask you to agree to the terms of service
3. Ask if you want to receive EFF newsletters (optional)
4. Validate that you control the domain by placing a temporary file on your server and verifying Let's Encrypt can access it via HTTP
5. Obtain the SSL certificate
6. Automatically modify your Nginx configuration to use the certificate
7. Set up automatic HTTP to HTTPS redirection
**How Domain Validation Works**:
Let's Encrypt needs to verify you control the domain before issuing a certificate. The HTTP-01 challenge works by:
1. Certbot creates a file in `/var/www/html/.well-known/acme-challenge/`
2. Let's Encrypt's servers request this file via `http://yourdomain.com/.well-known/acme-challenge/[random-string]`
3. If the file is successfully retrieved, domain ownership is proved
4. The certificate is issued
This is why your domain must already be pointing to your server's IP and port 80 must be accessible from the internet.
**Automatic Renewal**:
Let's Encrypt certificates expire every 90 days, but Certbot automatically sets up a systemd timer to renew certificates before they expire. You can test the renewal process:
```bash
sudo certbot renew --dry-run
```
If this command succeeds, automatic renewal is configured correctly. You don't need to do anything else; certificates will renew automatically in the background.
Check the renewal timer status:
```bash
sudo systemctl status certbot.timer
```
**What Certbot Changed**:
After Certbot finishes, your Nginx configuration file (`/etc/nginx/sites-available/your-domain`) will look significantly different. Certbot added:
- A new `server` block listening on port 443 for HTTPS
- Paths to your SSL certificate and private key
- SSL security settings
- A redirect from HTTP (port 80) to HTTPS (port 443)
You can view the updated configuration:
```bash
sudo cat /etc/nginx/sites-available/your-domain
```
Your API is now accessible via HTTPS at `https://yourdomain.com`.
### Automatic SSL Setup
While the Nginx and Certbot approach works well, it requires manual configuration for each domain and updating Nginx configuration files. Traefik offers an alternative approach designed specifically for Docker environments, where SSL certificates are obtained and renewed automatically through container labels.
**What is Traefik?** Traefik is a modern reverse proxy built for dynamic container environments. Unlike Nginx which requires configuration files, Traefik watches your Docker containers and configures itself automatically based on labels you add to those containers. When a new container starts with appropriate labels, Traefik immediately begins routing traffic to it and can automatically request an SSL certificate.
**Why Choose Traefik?** Traefik excels in environments running multiple containerized services. Instead of editing Nginx configuration files and running Certbot for each new domain, you simply add labels to your Docker container. Traefik handles the rest: routing, SSL certificates, renewals, and load balancing. For a single API server, this might seem like overkill, but it demonstrates modern cloud-native patterns and scales effortlessly as your infrastructure grows.
#### Setting Up Traefik
First, create a Docker network that both Traefik and your application containers will use:
```bash
docker network create traefik-network
```
Create a directory to store SSL certificates:
```bash
mkdir ~/traefik-certs
chmod 600 ~/traefik-certs
```
Now start the Traefik container with the necessary configuration:
```bash
docker run -d \
--name traefik \
--network traefik-network \
-p 80:80 \
-p 443:443 \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v ~/traefik-certs:/letsencrypt \
--restart unless-stopped \
traefik:v3.0 \
--providers.docker=true \
--providers.docker.exposedbydefault=false \
--entrypoints.web.address=:80 \
--entrypoints.websecure.address=:443 \
--certificatesresolvers.letsencrypt.acme.tlschallenge=true \
--certificatesresolvers.letsencrypt.acme.email=your-email@example.com \
--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json
```
Replace `your-email@example.com` with your actual email address for Let's Encrypt notifications.
This command tells Traefik to:
- Watch Docker for containers (via the Docker socket mounted with `-v`)
- Listen on ports 80 and 443
- Use the TLS challenge for Let's Encrypt validation
- Store certificates in the mounted volume
#### Configuring Your Application Container
Instead of running your container with `-p 8000:8000`, you connect it to the Traefik network and add labels that tell Traefik how to route traffic. Stop your existing container and restart it with Traefik labels:
```bash
# Stop the existing container
docker stop ai-api
docker rm ai-api
# Start with Traefik labels
docker run -d \
--name ai-api \
--network traefik-network \
--label "traefik.enable=true" \
--label "traefik.http.routers.api.rule=Host(\`yourdomain.com\`)" \
--label "traefik.http.routers.api.entrypoints=websecure" \
--label "traefik.http.routers.api.tls=true" \
--label "traefik.http.routers.api.tls.certresolver=letsencrypt" \
--label "traefik.http.services.api.loadbalancer.server.port=8000" \
--restart unless-stopped \
my-ai-classifier:v1.0
```
Notice:
- No `-p 8000:8000` port publishing (Traefik handles external access)
- `--network traefik-network` connects to Traefik's network, this is important for the two containers to talk to each other
The critical part is the labels which configure routing and SSL. These labels tell Traefik:
- `traefik.enable=true`: Manage this container
- `traefik.http.routers.api.rule`: Route requests for `yourdomain.com` to this container
- `traefik.http.routers.api.entrypoints=websecure`: Use HTTPS (port 443)
- `traefik.http.routers.api.tls=true`: Enable TLS
- `traefik.http.routers.api.tls.certresolver=letsencrypt`: Use Let's Encrypt for certificates
- `traefik.http.services.api.loadbalancer.server.port=8000`: Forward to port 8000 inside the container
Note that `api` is just an identifier for this service, and you can use any name as long as it doesn't conflict with other containers. The port configuration is often optional since Traefik can auto-detect exposed ports, but we include it here for clarity.
#### How Automatic SSL Works
When you start the container with these labels:
1. Traefik detects the new container through the Docker socket
2. Reads the labels and creates routing rules
3. Sees that TLS is enabled with the Let's Encrypt resolver
4. Automatically requests a certificate from Let's Encrypt for `yourdomain.com`
5. Completes the TLS challenge (similar to Certbot's HTTP challenge)
6. Installs the certificate
7. Begins routing HTTPS traffic to your container
All of this happens automatically within seconds of starting your container. No manual Certbot commands, no Nginx configuration editing.
**Automatic Renewal**: Traefik monitors certificate expiration dates and automatically renews them before they expire. You don't set up cron jobs or systemd timers; Traefik handles it internally.
**HTTP to HTTPS Redirect**: To automatically redirect HTTP to HTTPS, add these additional labels when starting your container:
```bash
--label "traefik.http.routers.api-http.rule=Host(\`yourdomain.com\`)" \
--label "traefik.http.routers.api-http.entrypoints=web" \
--label "traefik.http.routers.api-http.middlewares=redirect-to-https" \
--label "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"
```
#### Trade-offs
Traefik offers significant advantages for container-based deployments. Fully automatic SSL management means you never manually request or renew certificates. The label-based configuration eliminates separate configuration files, and adding new services is as simple as starting a container with the right labels. Native Docker integration makes it scale effortlessly as you add more containers. This approach represents modern cloud-native practices and works particularly well when managing multiple services.
However, these benefits come with trade-offs. The initial setup is more complex than Nginx, requiring understanding of Docker networks, labels, and how Traefik discovers services. The automation also means less transparency; configuration happens "magically" based on labels, which can be harder to debug when things go wrong. For a single container deployment like our API server, Traefik might be overkill. The Nginx and Certbot approach teaches fundamental concepts and provides clear visibility into each step, making it better for learning. Traefik's value becomes apparent when managing multiple services where its automation significantly reduces maintenance overhead.
> **Extended Reading:**
> The [Traefik v3 Docker Compose Guide](https://www.simplehomelab.com/traefik-v3-docker-compose-guide-2024/) provides comprehensive setup instructions including dashboard configuration, middleware usage, and advanced routing patterns. The [official Traefik documentation](https://doc.traefik.io/traefik/user-guides/docker-compose/acme-tls/) covers TLS configuration in depth.
>
> **DNS-Based Challenges for Fully Automatic SSL**: Beyond the TLS challenge we used above, Traefik supports DNS-based challenges that can be even more powerful, especially with providers like Cloudflare and DuckDNS. With a [DNS challenge](https://doc.traefik.io/traefik/https/acme/#dnschallenge), you provide Traefik with your domain provider's API token, and it automatically creates DNS records to prove domain ownership. This approach works even if port 80 isn't publicly accessible and can issue wildcard certificates (like `*.yourdomain.com`).
>
> For [Cloudflare](https://doc.traefik.io/traefik/https/acme/#providers), you'd add your API token as an environment variable and change the certificate resolver configuration to use `dnsChallenge` with `provider=cloudflare`. For [DuckDNS](https://github.com/traefik/traefik/issues/4728), you'd use your DuckDNS token similarly. Once configured, the entire SSL setup becomes fully automated—Traefik handles domain validation, certificate issuance, and renewal without any manual intervention or port requirements. This represents the cutting edge of automated infrastructure management.
### Testing Your HTTPS Setup
Now that everything is configured, verify that HTTPS works correctly.
**Access Your API via HTTPS**:
Open your browser and navigate to `https://yourdomain.com`. Notice you don't need to specify port 443 because it's the default HTTPS port, just like port 80 is the default for HTTP. You should see:
1. A padlock icon in your browser's address bar
2. Your API's response (likely the root endpoint message)
3. No security warnings
Click the padlock icon to view certificate details. You should see the certificate is issued by "Let's Encrypt" and is valid for your domain.
**Test HTTP to HTTPS Redirect**:
Try accessing `http://yourdomain.com` (explicitly using HTTP). You should be automatically redirected to `https://yourdomain.com`. This ensures users always use the encrypted connection even if they type or bookmark the HTTP version.
```bash
# Test with curl, following redirects
curl -L http://yourdomain.com
```
The `-L` flag tells curl to follow redirects. You should see your API's response.
Your API is now production-ready with HTTPS. Users can access it at `https://yourdomain.com` with full encryption, and certificates will renew automatically. Your container continues running unchanged on port 8000, completely unaware of the HTTPS complexity happening in front of it.
> **Videos:**
> - [HTTPS explained](https://www.youtube.com/watch?v=hExRDVZHhig)
> - [Let's encrypt tutorial](https://www.youtube.com/watch?v=WPPBO-QpiJ0)
> - [Traefik vs. Nginx](https://www.youtube.com/watch?v=scrtJ1U4wJU)
> - [Obtaining SSL certificate for local machines](https://www.youtube.com/watch?v=qlcVx-k-02E&t=43s)
## Exercise
**Deploy Your API Server Container on the Cloud**
Deploy the container we built in [the previous module](@/ai-system/packaging-containerization/index.md) on a cloud machine to free up the computational resources of your personal computer.
- Connect to a cloud machine of your choice and install the necessary container runtime
- Pull the container image you pushed to container image registry, or upload all source code and build the image on the cloud machine directly
- Run the container and test its functionality
**Advanced Challenges (Optional):**
- If your cloud machine has a publicly accessible IP address, setup a domain pointing to your API server and enable HTTPS
- Interact with your API server using client program running on another machine (e.g., your personal computer) through the HTTPS-enabled API endpoints

Binary file not shown.

After

Width:  |  Height:  |  Size: 342 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 451 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 103 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 400 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 990 KiB

View file

@ -0,0 +1,211 @@
+++
title = "B.7-Edge & Self-hosted Deployment"
date = 2025-10-16
description = ""
+++
> **TL;DR:** Cloud deployment isn't the only option. Learn about edge computing and self-hosted deployment: running AI systems closer to where data is generated or on your own hardware. Discover when these approaches make sense and how to implement them on devices like Raspberry Pi and home servers.
In [Cloud Deployment](@/ai-system/cloud-deployment/index.md), we explored how to deploy AI systems to remote data centers. For many applications, the cloud is indeed the most practical choice. But there are compelling scenarios where bringing computation closer to users or running on your own hardware makes more sense.
Consider a smart security camera that detects suspicious activity. Sending every video frame to the cloud for analysis creates privacy concerns, wastes bandwidth, and introduces latency. What if the internet connection drops? A better approach runs the AI model directly on the camera or a nearby device, processing video locally and only sending alerts when something important is detected. This is edge computing.
Or imagine you're a researcher with sensitive medical data, or a small business wanting to avoid recurring cloud bills, or simply someone who values control over your infrastructure. In these cases, self-hosted deployment (running services on hardware you own and control) becomes attractive. For example, I have a home server where I self-host most of my entertainment, daily, and storage needs: [Plex media server](https://plex.yanlincs.com), [Immich photo server](https://photo.yanlincs.com), [Nextcloud file sharing](https://cloud.yanlincs.com), and more.
## Understanding Edge & Self-hosting
Before diving into implementation, we need to understand what these terms mean and how they differ from the cloud deployment we've already covered.
### What is Edge Computing?
[Edge computing](https://en.wikipedia.org/wiki/Edge_computing) brings computation and data storage closer to where data is being generated and where it's needed, rather than relying on a central cloud location that might be hundreds or thousands of miles away.
The term "edge" simply refers to devices at the boundary of a network, where data is actually created or where people interact with systems. Think smartphones, smart home devices, sensors in factories, self-driving cars, or a server sitting in an office closet.
![Edge computing diagram](edge-computing.png)
You've probably seen edge computing in action without realizing it. A [Raspberry Pi](https://www.raspberrypi.com/), that tiny €50 computer the size of a credit card, can run AI models for home projects. The [Raspberry Pi AI Camera](https://www.raspberrypi.com/documentation/accessories/ai-camera.html) runs object detection directly on the camera itself, spotting people, cars, or pets in real-time without ever sending video to the cloud. Tech YouTuber Jeff Geerling has built some impressive setups, like [a Raspberry Pi AI PC with multiple neural processors](https://www.jeffgeerling.com/blog/2024/55-tops-raspberry-pi-ai-pc-4-tpus-2-npus) for local AI processing. For more demanding applications, [Nvidia Jetson](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/) boards pack serious GPU power into a small package. The [Jetson community](https://developer.nvidia.com/embedded/community/jetson-projects) has built everything from bird identification systems that recognize 80 species by sound, to indoor robots that map your home and remember where you left things.
Why does edge computing matter for AI? Processing data locally means faster responses since information doesn't need to travel across the internet. It also saves bandwidth by only sending relevant results instead of raw data. Privacy improves because sensitive information stays on your device. And perhaps most importantly, edge systems work even when the internet goes down.
### What is Self-Hosted Deployment?
Here's something you might not have realized: you've been doing self-hosted deployment throughout this entire course. Every time you ran your AI API server on your own computer, that was self-hosting. Self-hosted deployment simply means running your applications on hardware you own and control, rather than renting resources from cloud providers.
The beauty of self-hosting is that it works at any scale. At the simplest level, you're repurposing hardware you already have. That old laptop collecting dust in a drawer? Install Linux on it and you have a perfectly capable home server. An old desktop that would otherwise go to the landfill can run your AI models, host your files, or serve your applications. Even a Raspberry Pi or a NAS (Network Attached Storage) device can run containerized services.
![Home server setup](home-server.png)
But self-hosting isn't just about recycling old hardware. Building a new system from scratch can make economic sense too. Consider storage: major cloud providers charge around €18-24 per terabyte per month (budget providers like Backblaze start around €5/TB). If you need 10TB of storage from a major provider, that's €180-240 monthly, adding up to €2,160-2,880 per year. You could build a dedicated storage server with multiple hard drives for €900-1,400, breaking even in under a year. After that, it's essentially free (minus electricity). Plus, transferring files over your home network is dramatically faster than uploading or downloading from the cloud. Gigabit ethernet gives you around 100MB/s transfer speeds, while most home internet uploads max out at 10-50MB/s.
![Storage cost comparison](storage-cost.png)
Beyond economics, self-hosting gives you complete control. Your data stays on your hardware, in your home or office. There are no monthly bills that can suddenly increase, no vendor lock-in forcing you to use proprietary APIs, and no worrying about whether a cloud provider will shut down your account. For learners, self-hosting offers hands-on experience with real infrastructure that you can't get from managed cloud services. And if you need specialized hardware like GPUs for AI work, owning the equipment often makes more sense than paying cloud providers' premium hourly rates, especially if you're using it regularly.
### The Relationship Between Edge and Self-Hosted
Edge computing and self-hosted deployment are distinct ideas, but we cover them together in this module because they share practical challenges. Both involve working with hardware you have physical access to, whether that's a Raspberry Pi on your desk or a server in your office. Both require you to manage limited resources compared to the cloud's seemingly infinite capacity. When something breaks, you can't just open a support ticket; you need to troubleshoot and fix it yourself. The deployment techniques are also similar: you're installing operating systems, configuring networks, running containers, and ensuring services stay up, whether on edge devices or self-hosted servers. Most importantly, the skills you learn deploying to a Raspberry Pi at the edge transfer directly to managing a self-hosted server at home, and vice versa.
> **Videos:**
> - [Edge computing explained](https://www.youtube.com/watch?v=cEOUeItHDdo)
> - [Introduction to self-hosting](https://www.youtube.com/watch?v=4tiNUaeAM1w)
> - [Replacing cloud services with self-hosted ones](https://www.youtube.com/watch?v=vpiiqbpdkNk)
## Edge & Self-Hosted Deployment in Practice
Now that you understand the concepts, let's get practical. We'll walk through deploying your containerized AI API server to edge and self-hosted hardware. Since we already covered Docker installation and running containers in [Cloud Deployment](@/ai-system/cloud-deployment/index.md), we'll focus on what's different when working with physical hardware you control.
### Choosing Your Hardware
The hardware you choose depends on your use case, budget, and what you might already have available.
For learning and light workloads, a **[Raspberry Pi](https://www.raspberrypi.com/products/raspberry-pi-5/)** (around €50-95 for the Pi 5 with 4-8GB RAM) is hard to beat. It's tiny, power-efficient (using about 3-5 watts), and runs a full Linux operating system. Perfect for running lightweight AI models, home automation, or small API servers. The Pi 5 with 8GB RAM can comfortably handle our image classification API from earlier modules.
![Raspberry Pi](raspberry-pi.png)
If you need more power for AI workloads, **[Nvidia Jetson](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/)** boards (around €230-240 for the [Jetson Orin Nano Super Developer Kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/)) come with integrated GPUs designed specifically for AI inference. They're overkill for simple projects but shine when running larger models or processing video streams in real-time.
![Nvidia Jetson](jetson.png)
Don't overlook that **old laptop or desktop** sitting unused. An x86 machine from the last decade probably has more RAM and storage than a Raspberry Pi, runs cooler than a gaming desktop, and costs nothing if you already own it. Laptops are particularly attractive because they're power-efficient and come with a built-in battery (basically a free UPS). [Repurposing an old laptop as a Linux server](https://dev.to/jayesh_w/this-is-how-i-turned-my-old-laptop-into-a-server-1elf) is a popular project that teaches you server management without any upfront cost. Old workstations with dedicated GPUs can even handle serious AI workloads.
For more demanding self-hosted setups, you might build a **purpose-built server** using standard PC components. This gives you flexibility to choose exactly the CPU, RAM, storage, and GPU you need. Popular projects include [DIY NAS builds](https://blog.briancmoses.com/2024/11/diy-nas-2025-edition.html) for storage or [general-purpose home servers](https://www.wundertech.net/complete-home-server-build-guide/) for running multiple services. Budget builds can start around €180-450, while more capable systems run €450-1400 depending on requirements.
### Installing the Operating System
Once you have your hardware, you need to install an operating system. The process varies depending on what hardware you're using, but the goal is the same: get a Linux system up and running that you can access remotely.
#### Raspberry Pi
The Raspberry Pi makes OS installation remarkably easy with the [Raspberry Pi Imager](https://www.raspberrypi.com/software/). This official tool handles everything: downloading the OS, writing it to your SD card, and even preconfiguring settings like WiFi and SSH access. The process is straightforward: select your Pi model, choose "Raspberry Pi OS Lite (64-bit)" for a headless server, configure your settings (hostname, SSH, WiFi), and write to an SD card. The [official getting started guide](https://www.raspberrypi.com/documentation/computers/getting-started.html) walks through each step with screenshots.
#### Nvidia Jetson
Nvidia Jetson boards come with [JetPack SDK](https://developer.nvidia.com/embedded/jetpack), which includes the operating system (based on Ubuntu) plus all the NVIDIA AI libraries and tools. The [official getting started guide](https://developer.nvidia.com/embedded/learn/get-started-jetson-orin-nano-devkit) provides an SD card image you can download and write to a microSD card, similar to the Raspberry Pi process. After first boot, you'll run through an initial setup wizard to configure your username, password, and network settings. For more advanced setups, NVIDIA's SDK Manager lets you install different JetPack versions or flash the built-in storage directly.
#### x86 PC or Laptop
For standard x86 computers (Intel or AMD processors), [Ubuntu Server](https://ubuntu.com/download/server) is an excellent choice. Download the ISO file, create a bootable USB drive using tools like [Rufus](https://rufus.ie/) (Windows) or [balenaEtcher](https://etcher.balena.io/) (cross-platform), boot from the USB, and follow the text-based installer. The [official installation tutorial](https://ubuntu.com/tutorials/install-ubuntu-server) covers the entire process, including partitioning, network configuration, and SSH setup. Ubuntu Server is lightweight, well-documented, and receives long-term support.
### Architecture Considerations
In [Packaging & containerization](@/ai-system/packaging-containerization/index.md), we learned how containers package applications to "run consistently anywhere." However, there's an important caveat we didn't discuss: CPU architecture. The Raspberry Pi and Nvidia Jetson use ARM processors, while most PCs and cloud servers use x86 processors. This matters because container images are built for specific architectures.
If you try to run an x86 container image on a Raspberry Pi, it simply won't work. The ARM processor can't execute x86 instructions. It's like trying to play a Blu-ray disc in a DVD player; the physical format is similar, but the underlying technology is incompatible. Fortunately, many popular images on Docker Hub are [multi-architecture images](https://docs.docker.com/build/building/multi-platform/) that include versions for both ARM and x86. When you run `docker pull python:3.11`, Docker automatically detects your system's architecture and pulls the appropriate version.
For your custom images, you have two options. The simple approach is building directly on your target hardware. If you're deploying to a Raspberry Pi, build your Docker image on the Pi itself (or another ARM machine). The image will naturally be ARM-compatible. The more sophisticated approach uses Docker's `buildx` feature to create multi-architecture images that work on both ARM and x86. This is what professional projects do, but it requires a bit more setup. The [Docker documentation on multi-platform builds](https://docs.docker.com/build/building/multi-platform/) explains the process.
A quick way to check if an image supports your architecture: look at the image's Docker Hub page. For example, the [official Python image](https://hub.docker.com/_/python) shows supported platforms including `linux/amd64` (x86), `linux/arm64` (64-bit ARM like Raspberry Pi 4/5), and `linux/arm/v7` (32-bit ARM like older Pis). If your architecture isn't listed, you'll need to build the image yourself or find an alternative.
![Docker Hub platform support](docker-hub-platforms.png)
### Deploying Your Container
Once you have your OS installed and understand architecture considerations, the actual deployment process is identical to what we covered in [Cloud Deployment](@/ai-system/cloud-deployment/index.md). Install Docker using the [official installation guide](https://docs.docker.com/engine/install/), pull or build your container image (making sure it matches your architecture), and run it with `docker run`.
Resource considerations depend more on your specific hardware than whether it's "edge" or "cloud." A Raspberry Pi 5 with its quad-core CPU and 8GB RAM is actually more powerful than many low-end cloud VMs. Budget cloud instances often give you 1-2 virtual CPU cores with heavily shared resources, while your Pi's dedicated cores can outperform them for many workloads. On the other hand, a 10-year-old laptop you're repurposing might struggle compared to even basic cloud offerings. The key is understanding your hardware's capabilities and choosing appropriate workloads. Our image classification API from earlier modules runs perfectly fine on a Raspberry Pi 5, and likely faster than on a €3-5/month cloud VM.
## Remote Access to Edge & Self-Hosted Devices
Getting your edge or self-hosted device online is different from cloud deployment. Cloud VMs come with public IP addresses that anyone on the internet can reach. Your home server or Raspberry Pi sits behind your router on a private network, invisible to the outside world by default. Let's explore how to access your devices both locally and from anywhere.
### Accessing Within Your Local Network
If you just want to use your services at home or within your organization's network, local access is straightforward and secure.
Every device on your network gets a local IP address, usually something like `192.168.1.100` or `10.0.0.50`. To find your device's IP, SSH into it and run `ip addr show` (or `ip a` for short), which shows all network interfaces and their addresses. Look for the interface connected to your network (often `eth0` for ethernet or `wlan0` for WiFi) and find the line starting with `inet`. Alternatively, check your router's admin interface, which usually lists all connected devices with their IPs and hostnames.
![IP address output](ip-addr.png)
Once you have the IP, access your service just like you would a cloud server, but using the local address. If your API runs on port 8000, visit `http://192.168.1.100:8000` from any device on the same network. SSH works the same way: `ssh username@192.168.1.100`. This is the same remote access concept we covered in [Cloud Deployment](@/ai-system/cloud-deployment/index.md), just with a local IP instead of a public one.
For convenience, configure your router to assign a static local IP to your device so the address doesn't change when the device reboots. Look for "DHCP reservation" or "static IP assignment" in your router settings. This way, you always know where to find your server.
> **Extended Reading:**
> You can actually get SSL certificates for local services even if they're not accessible from the internet. Remember the DNS challenge method we mentioned in [Cloud Deployment](@/ai-system/cloud-deployment/index.md)? With DNS-based validation, certificate authorities like Let's Encrypt verify domain ownership through DNS records rather than HTTP requests. This means you can obtain valid SSL certificates for services running purely on your local network.
>
> Using tools like [Traefik with DNS challenge](https://www.youtube.com/watch?v=qlcVx-k-02E) or [cert-manager](https://cert-manager.io/docs/configuration/acme/dns01/), you can automatically request and renew certificates for domains like `homeserver.yourdomain.com` that resolve to local IPs like `192.168.1.100`. Your devices will trust these certificates just like they trust any public website's certificate, eliminating browser security warnings for your local services.
### Making Services Publicly Accessible
What if you want to access your home server from anywhere, or share your service with others? This is trickier because incoming connections to home networks are blocked by default. Your router uses [NAT (Network Address Translation)](https://en.wikipedia.org/wiki/Network_address_translation) to share one public IP among all your devices, and without special configuration, external requests can't reach specific devices on your private network.
You have several options, each with different tradeoffs.
#### Option 1: Public IP Address (Static or Dynamic)
The most straightforward approach is using a public IP address from your ISP. This comes in two flavors:
**Static Public IP**: Some ISPs offer static public IPs as an add-on service for €5-18/month. The IP never changes, making it the simplest option. You point your domain directly to this IP, configure port forwarding on your router, and you're done. The downside is the extra cost and limited availability (not all ISPs offer this, especially for residential connections).
**Dynamic Public IP**: Many home internet connections already come with a public IP, it just changes periodically (every few days, weeks, or when your router reboots). This is where Dynamic DNS (DDNS) becomes essential. Services like [DuckDNS](https://www.duckdns.org/) give you a domain name (like `yourname.duckdns.org`) that automatically updates to point to your current IP. You run a small script on your server that periodically checks your public IP and updates the DNS record whenever it changes. This solution is free and works for most people.
With either approach, you configure port forwarding on your router to direct incoming traffic on specific ports (like 80 and 443 for HTTPS) to your server's local IP. The benefit is complete control and direct access. The downside is security responsibility: your home network is exposed to the internet, requiring proper firewall configuration and ongoing maintenance.
Important caveat: Some ISPs use [CGNAT (Carrier-grade NAT)](https://en.wikipedia.org/wiki/Carrier-grade_NAT), where multiple customers share a single public IP. In this case, you don't have a truly public IP address, and this option won't work. You'll need to either request a public IP from your ISP (sometimes available for a fee) or use one of the tunneling solutions below.
> **Extended Reading:**
> Many ISPs now provide [IPv6](https://en.wikipedia.org/wiki/IPv6) connectivity alongside IPv4. Unlike IPv4, IPv6 was designed with enough addresses that every device can have its own globally routable public address, so no NAT is needed. If your ISP supports IPv6, each device on your network gets a public IPv6 address, making them directly accessible from the internet (subject to firewall rules). This bypasses the entire NAT problem. The main challenges are that not all internet users have IPv6 yet (so you might need both IPv4 and IPv6), and many home routers still block incoming IPv6 by default for security. Check if your ISP provides IPv6 and configure your router's IPv6 firewall rules accordingly.
#### Option 2: WireGuard VPN Tunnel + Cloud Proxy
If you can't get a usable public IP (due to CGNAT or ISP restrictions), you can work around this by using a cheap cloud VM as an intermediary. This solution works for any edge or self-hosted device as long as it has outgoing internet access.
The setup: rent a small VPS with a public IP (often €3-5/month), have your device establish a [WireGuard](https://www.wireguard.com/) VPN tunnel to the cloud VM (an outgoing connection that bypasses NAT), and run a reverse proxy (like Nginx or Traefik from [Cloud Deployment](@/ai-system/cloud-deployment/index.md)) on the cloud VM to forward traffic through the tunnel to your device.
From the internet's perspective, people connect to your cloud VM's public IP. The cloud VM proxies requests through the encrypted WireGuard tunnel to your device, which processes them and sends responses back through the same tunnel. Your device never needs to accept incoming connections; it only maintains an outgoing VPN connection.
This approach is secure, flexible, and works from anywhere your device can reach the internet. An additional benefit: if you're running resource-hungry services (like AI models) or need lots of storage, you can use powerful hardware for your edge/self-hosted device while keeping the cloud VM minimal and cheap. Since the cloud VM only handles traffic proxying, even a €3/month VPS with 1 CPU core and 1GB RAM works fine. You get the best of both worlds: cheap public accessibility and powerful local compute/storage.
The downside is managing both a cloud VM and the VPN tunnel, plus a small latency increase from the extra hop. Resources like [this tutorial](https://blog.fuzzymistborn.com/vps-reverse-proxy-tunnel/) walk through the complete setup.
> **Extended Reading:**
> If manual WireGuard configuration sounds intimidating, [Tailscale](https://tailscale.com/) offers a simpler alternative. Tailscale is built on top of WireGuard but handles all the configuration complexity for you. Instead of manually generating keys and editing config files, you sign in with your Google or GitHub account, install the Tailscale client on your devices, and it automatically creates a secure mesh network.
>
> The key difference: Tailscale creates peer-to-peer connections. If you're okay with the requirement that all client devices connecting to your edge/self-hosted server also need to install Tailscale, then you don't need a cloud VM at all. Your laptop, phone, and server all join the same Tailscale network and can talk to each other directly (or through Tailscale's relay servers if direct connection fails). This is perfect for personal use where you control all the client devices.
>
> If you need to expose services to the public internet where users don't have Tailscale installed, you can still use the WireGuard + cloud VM approach described above, or combine Tailscale with a cloud VM to get the best of both worlds: easy VPN setup between your devices and the cloud proxy.
>
> Tailscale is [free for personal use](https://tailscale.com/pricing) (up to 3 users and 100 devices). For those who want Tailscale's ease without depending on their coordination servers, [Headscale](https://github.com/juanfont/headscale) is a self-hosted, open-source alternative you can run on your own infrastructure.
#### Option 3: Cloudflare Tunnel
If you want the simplest solution and don't mind depending on a third-party service, [Cloudflare Tunnel](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/) is hard to beat. It's free for personal use and handles all the complexity for you.
You install a lightweight daemon called `cloudflared` on your device, which creates a secure outbound tunnel to Cloudflare's network. Cloudflare then routes traffic from your domain to your device through this tunnel. No VPN setup, no cloud VM to manage, no port forwarding. You configure everything through Cloudflare's dashboard, point your domain's DNS to Cloudflare (which is free), and you're done. The [official getting started guide](https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/get-started/) walks you through the entire process.
The benefits are significant: extremely easy setup, free, managed security and DDoS protection, and fast global performance thanks to Cloudflare's CDN. Your home IP address stays hidden, adding privacy. The downside is vendor lock-in. You're dependent on Cloudflare's service, and if they change their terms or have an outage, your services go down. Some users also prefer not to route all their traffic through a third party, even one as reputable as Cloudflare.
> **Extended Reading:**
> Cloudflare Tunnel isn't the only managed tunneling service. [Ngrok](https://ngrok.com/) is the most popular alternative, known for its developer-friendly features like request inspection and webhook testing. It has a free tier limited to development, with paid plans for production use. [Pinggy](https://pinggy.io/) offers similar functionality with competitive pricing and collaborative features. [LocalTunnel](https://theboroer.github.io/localtunnel-www/) is a free, open-source option that's simpler but less feature-rich.
>
> For a comprehensive list of tunneling solutions including self-hostable options, check out [awesome-tunneling](https://github.com/anderspitman/awesome-tunneling) on GitHub. This curated list includes everything from commercial services to open-source projects you can run on your own infrastructure, giving you alternatives if you want to avoid depending on third-party services entirely.
### Choosing the Right Approach
Which option makes sense depends on your situation and priorities.
For learning and local-only use, stick with LAN access. No need to expose services to the internet while you're experimenting. If you have a public IP (static or dynamic) and want full control, Option 1 is the simplest. Just remember you're responsible for security. If you're behind CGNAT or want to keep your home IP hidden, Option 2 (WireGuard/Tailscale) gives you maximum flexibility and control, though with added complexity. If you want the easiest solution for public access and don't mind trusting a third party, Option 3 (Cloudflare Tunnel) is perfect for personal projects.
Whatever you choose, remember that exposing services to the internet comes with security responsibilities. Keep your software updated, use strong authentication, monitor your logs, and only expose services you actually need to be public.
> **Videos:**
> - [VPNs and secure tunnels explained](https://www.youtube.com/watch?v=32KKwgF67Ho)
> - [Cloudflare tunnel review](https://www.youtube.com/watch?v=oqy3krzmSMA)
## Exercise
**Self-host Your API Server**
Deploy your containerized API server from [Packaging & containerization](@/ai-system/packaging-containerization/index.md) on local hardware instead of the cloud. This exercise teaches you the fundamentals of edge and self-hosted deployment.
- Choose your hardware: Raspberry Pi, an old laptop, a desktop PC, or even your daily computer for testing
- Install a Linux operating system if not already running one (Ubuntu Server recommended for x86 machines, Raspberry Pi OS for Pi)
- Install Docker and deploy your containerized API server on the local hardware
- Verify the service works on your local network by accessing it from another device (phone, laptop) connected to the same WiFi
**Advanced Challenges (Optional):**
- Make your locally-hosted service accessible from the internet using one of the three approaches covered in this module, then test by accessing from a different network (mobile data, coffee shop WiFi, etc.)
- Set up HTTPS for your local service using one of the approaches covered in this module and the previous one
- If using Raspberry Pi or other ARM hardware, build a multi-architecture Docker image that works on both x86 and ARM platforms

Binary file not shown.

After

Width:  |  Height:  |  Size: 145 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 MiB

View file

@ -0,0 +1,102 @@
+++
title = "Exam Format"
date = 2025-09-02
description = ""
+++
## Modality and duration
Individual oral exam based on submitted project. The duration will be 15 minutes, followed by 5 minutes deliberation.
### Agenda
- Students give a (roughly) 5-minute presentation of their completed mini-project
- Students randomly draw a topic (from the 6 topics listed below) and explain the basic concepts within that topic
- Examiner and censor ask follow-up questions, which may relate to other course topics and include questions about practical applications and implementation; students can optionally refer back to their mini-project
## Assessment
Students are not required to write code during the exam, nor to remember any specific commands or code syntax, but may be asked to draw diagrams or solve small tasks manually. The grade will be based on an overall assessment of both the mini-project and oral performance, and in accordance with the 7-point grading scale.
## Pre-approved aids
Notes (related to literature, slides, notes from the module, and project documentation) are permitted. Do note that if you read directly from notes or copy them verbatim, you may be asked to put the notes away. Answers based solely on reading from notes will result in failure.
## Prerequisites for participation
Timely hand-in of project documentation.
## Topics
Note that the questions listed below are examples and may be formulated differently during the exam. An exhaustive explanation of each topic is not expected. While ideally students should be able to discuss each topic covered in the course, this is not required to pass the exam. The number and formulation of questions can serve as an indicator of the importance and expected level of knowledge for each topic.
### 1. Interacting with APIs
- What is the primary purpose of APIs in software development? How do they enable standardized communication between applications?
- Explain the three pillars of APIs: network fundamentals (IP addresses, domains, ports), HTTP protocol & methods, and standards & design principles
- Walk through the components of an HTTP request (request line, headers, body) and response (status line, headers, body).
- What are the key request headers required for API authentication and content specification?
- Explain the difference between GET and POST methods. When would you use each for AI API interactions?
- What are the key components of interacting with APIs using Python's requests library? Explain the principles of proper error handling and API key management
- What is rate limiting and why is it important for AI APIs? Compare different rate limiting strategies
- Compare traditional request-response APIs and streaming APIs. Why is streaming preferred for conversational AI applications?
### 2. Building APIs
- Explain the concept of routes in FastAPI and how GET and POST methods are used.
- What are URL parameters and how do they enable dynamic request handling?
- Explain the role of Pydantic data models in FastAPI for request/response validation. Why is this important for API reliability?
- Explain how API versioning can be implemented in FastAPI. Why is maintaining backwards compatibility important?
- What are the key considerations when integrating AI models (like image classification) into FastAPI servers? Why might asynchronous operations be important?
- What are the principles of implementing API key authentication in FastAPI? What are the security considerations?
- Explain why database integration (using SQLAlchemy) is important for user management and API usage tracking.
### 3. Computing architecture & hardware
- Computer architecture fundamentals
- Explain the Von Neumann architecture and its main components. How do modern computers (including "AI computers") relate to this 80-year-old architecture?
- What is the difference between instructions and data in a computer system? Why does the Von Neumann architecture store both in the same memory?
- Explain the roles of the Control Unit (CU) and Arithmetic Logic Unit (ALU) in a CPU. Use an analogy to illustrate their relationship
- Describe the role of bus systems in computer architecture. What are the three main types of buses and what does each carry?
- AI computing hardware
- Why are CPUs designed for sequential processing? What makes this approach less suitable for AI workloads?
- Explain the difference between sequential and parallel processing using an analogy. Why do AI models (especially neural networks) benefit from parallel processing?
- What is the memory bus bottleneck for AI workloads? Why is memory bandwidth more critical than latency for AI computing?
- Why are GPUs particularly well-suited for AI computing? Explain in terms of core architecture and memory design
- Compare GPUs, TPUs, and NPUs in terms of their design goals, strengths, and typical use cases
- How does specialized AI hardware (GPU, TPU, NPU) relate to the Von Neumann architecture at the system level? Does it fundamentally replace the Von Neumann architecture?
- What factors should you consider when choosing hardware for different AI applications (training vs inference, data center vs edge device)?
### 4. Containerization
- Container fundamentals
- What deployment problem do containers solve? Explain the "it works on my machine" syndrome and how containerization addresses it
- What is a container and how does it achieve isolation? Explain the benefits of containers over traditional deployment approaches
- Describe the layered structure of container images using an analogy. How does this contribute to efficiency and reusability?
- What is a Dockerfile and why is it the preferred approach for building container images? How does it relate to the layered structure and reproducibility?
- What are the main components of the Docker ecosystem? Explain the roles of Docker Engine, CLI, Dockerfile, and registries like Docker Hub
- What is the purpose of container registries, and how do they enable image distribution? Compare different registry options
- Practical implementation
- Explain how containers handle port mapping. Why is this important for deploying web applications and API servers?
- How do you manage persistent storage in containerized applications? Why is this important for AI applications with databases or model files?
- How should you handle configuration in containerized applications to avoid hardcoding them in the code? What approaches are available?
- How does the order of instructions in a Dockerfile relate to the layered structure of images? What are the important layers you will consider when containerizing an AI API server?
- What is the difference between data stored in container image layers versus data stored in volumes? Why does this distinction matter for large files like AI models or databases?
### 5. Cloud deployment
- What is cloud computing and how does virtualization enable cloud infrastructure?
- Compare virtual machines, container services, GPU instances, and managed AI services. What are the trade-offs?
- What are the main steps and considerations when deploying a containerized AI service on a cloud virtual machine?
- How would you choose between different cloud providers for your AI deployment? Consider factors like pricing models, service offerings, and vendor lock-in
- What are the advantages and pitfalls of usage-based pricing vs fixed monthly pricing for cloud services?
- Why is HTTPS necessary for production AI APIs? Explain the user experience and security implications
- How do you obtain a domain name for your service? Compare free and paid options
### 6. Edge and self-hosted deployment
- What is edge computing and what motivates edge deployment for AI applications? Discuss latency, privacy, bandwidth, and offline operation
- What is self-hosted deployment and how does it differ from cloud deployment? Discuss the economics, control, and practical considerations
- Compare different hardware options for edge/self-hosted deployment: Raspberry Pi, NVIDIA Jetson, repurposed laptops, and purpose-built servers
- What are CPU architecture considerations when deploying containers to edge devices? Explain the ARM vs x86 challenge
- How do you access services on your local network vs making them publicly accessible? What is NAT and why does it matter?
- What are the different approaches for making self-hosted services publicly accessible? What are the trade-offs to consider when choosing an approach?

Binary file not shown.

After

Width:  |  Height:  |  Size: 512 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

View file

@ -0,0 +1,316 @@
+++
title = "C.9-High Availability & Reliability"
date = 2025-10-26
description = ""
+++
> **TL;DR:**
> When your service goes down, users leave. Learn how to measure availability with industry-standard metrics (MTBF, MTTR), and implement practical strategies like redundancy and backups to keep your AI API running reliably.
In October 2025, millions of people worldwide woke up to find ChatGPT unresponsive. Snapchat wouldn't load. Fortnite servers were down. Even some banking apps stopped working. All thanks to [a single issue in an AWS data center](https://9to5mac.com/2025/10/20/alexa-snapchat-fortnite-chatgpt-and-more-taken-down-by-major-aws-outage/) that cascaded across hundreds of services. For over half a day, these services were unavailable, and there was nothing users could do except wait, or find alternatives.
![](aws-outage.png)
Now imagine this happens to your AI API server. You've successfully deployed it to the cloud following [Cloud Deployment](@/ai-system/cloud-deployment/index.md), users are accessing it, and everything seems great. Then at 2 AM on a Saturday, something breaks. How long until users give up and try a competitor's service? How many will come back? In today's world where alternatives are just a Google search away, reliability is essential for survival.
## Understanding High Availability
This is where availability comes in. It's the proportion of time your system is operational and ready when users need it. But "my system works most of the time" isn't a useful metric when you're trying to run a professional service. How do you measure availability objectively, and what targets should you aim for?
### Measuring Availability
When you tell someone "my service is reliable," what does that actually mean? Does it fail once a day? Once a month? Once a year? And when it does fail, does it come back in 30 seconds or 3 hours? Without objective measurements, "reliable" is just a feeling. You can't promise it to users or improve it in an organized way.
The industry uses standard metrics to measure and talk about availability. Understanding these metrics helps you answer key questions: Is my system good enough for users? Where should I focus my efforts? How do I compare to competitors?
#### Mean Time Between Failures ([MTBF](https://www.ibm.com/think/topics/mtbf))
The first metric tells you how long your system typically runs before something breaks.
Think of MTBF like a car's reliability rating. One car runs 50,000 miles between breakdowns, while another only makes it 5,000 miles. The first car has a much higher MTBF, which means it fails less frequently. The same concept applies to your AI service. Does it run for days, weeks, or months between failures?
MTBF is calculated by observing your system over time:
```
MTBF = Total Operating Time / Number of Failures
```
For example, if your AI API server runs for 720 hours (30 days) and experiences 3 failures during that period:
```
MTBF = 720 hours / 3 failures = 240 hours
```
This means on average, your system runs for 240 hours (10 days) between failures. The higher this number, the more reliable your system.
For AI systems specifically, failures might include model crashes, server running out of memory, dependency issues, network problems, or database corruption. Each of these counts as a failure event that reduces your MTBF.
#### Mean Time To Repair ([MTTR](https://www.ibm.com/think/topics/mttr))
MTBF tells you how often things break, but MTTR tells you how quickly you can fix them when they do.
MTTR measures your "time to recover" - in other words, from the moment users can't access your service until the moment it's working again. This includes detecting the problem, diagnosing what went wrong, applying a fix, and checking that everything works.
```
MTTR = Total Repair Time / Number of Failures
```
Using our previous example, suppose those 3 failures took 2 hours, 1 hour, and 3 hours to fix:
```
MTTR = (2 + 1 + 3) hours / 3 failures = 2 hours
```
Why does MTTR matter so much? Because modern research shows that downtime is very expensive. [ITIC's 2024 study](https://itic-corp.com/itic-2024-hourly-cost-of-downtime-report/) found that 90% of medium and large businesses lose over $300,000 for every hour their systems are down. Even for smaller operations, every minute of downtime means frustrated users, lost revenue, and damaged reputation.
For your AI API server, MTTR includes several steps. First, you notice something is wrong (through monitoring alerts or user complaints). Then you remote into your server and check logs. Next, you identify the root cause. Then you add the fix and check that it works. Finally, you confirm that users can access the service again. The faster you can complete this cycle, the lower your MTTR and the better your availability.
![](mttr-process.png)
#### The Availability Formula
Here's where these metrics come together. Availability combines both how often failures happen (MTBF) and how quickly you recover (MTTR):
```
Availability = MTBF / (MTBF + MTTR) × 100%
```
For example, suppose your AI API server has:
- MTBF = 240 hours (fails every 10 days on average)
- MTTR = 2 hours (takes 2 hours to fix on average)
```
Availability = 240 / (240 + 2) × 100%
= 240 / 242 × 100%
= 99.17%
```
Your service has 99.17% availability, meaning it's operational 99.17% of the time.
This formula reveals a crucial insight about improving availability. You can either make failures rarer (increase MTBF) by writing better code, using more reliable hardware, or adding redundancy. Or you can recover faster (decrease MTTR) by implementing better monitoring, automating recovery processes, or having clear runbooks for common problems.
In fact, for many systems, improving MTTR gives you more bang for your buck. It's often easier to detect and fix problems faster than to prevent every possible failure.
> **Videos:**
> - [MTBF and MTTR visualized](https://www.youtube.com/watch?v=qlegqBZor4A)
> - [System reliability metrics](https://www.youtube.com/watch?v=BQXnKpP2lrI)
> **Extended Reading:**
> To learn more about reliability metrics:
> - [Atlassian's Guide to Incident Metrics](https://www.atlassian.com/incident-management/kpis/common-metrics) explains MTBF, MTTR, and related metrics used by modern software teams
> - [AWS: Distributed System Availability](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/distributed-system-availability.html) explores how availability metrics apply to distributed systems
### Uptime Targets and SLAs
Now that you understand how to measure availability, you need to know what counts as "good." Is 99% availability impressive? Or unacceptably low? The industry has developed a standard vocabulary for talking about availability targets, centered around the concept of "nines."
#### The "Nines" System
Availability is usually expressed as a percentage with a certain number of 9s. More 9s mean better availability, but each additional nine becomes much harder and more expensive to achieve.
Here's what each level actually means in practice:
| Availability | Common Name | Downtime per Year | Downtime per Month | Downtime per Day |
|--------------|-------------|-------------------|--------------------|--------------------|
| 99% | "Two nines" | 3.65 days | 7.2 hours | ~14.4 minutes |
| 99.9% | "Three nines" | 8.76 hours | 43.8 minutes | ~1.4 minutes |
| 99.99% | "Four nines" | 52.6 minutes | 4.4 minutes | ~8.6 seconds |
| 99.999% | "Five nines" | 5.3 minutes | 26 seconds | ~0.9 seconds |
In the industry, "five nines" (99.999% availability) is often called the gold standard. It sounds impressive to promise your users less than 6 minutes of downtime per year. Some critical systems, like emergency services, air traffic control, or financial trading platforms, really need this level of reliability.
But even [Google's senior vice president for operations has publicly stated](https://iamondemand.com/blog/high-availability-of-your-cloud-expectations/), "We don't believe Five 9s is attainable in a commercial service, if measured correctly." Why? Because achieving five nines requires several things. You need redundant systems at every level with no single points of failure. You need automatic failover mechanisms that work flawlessly. You need 24/7 monitoring and on-call engineering teams. You need geographic distribution to survive data center outages. And you need extensive testing and disaster recovery procedures.
The cost grows very quickly with each additional nine. Going from 99.9% to 99.99% might double your infrastructure costs. Going from 99.99% to 99.999% might triple them again. For most services, especially AI systems that aren't mission-critical, this investment doesn't make business sense.
The sweet spot for many professional services is 99.9% to 99.99%. This provides good reliability that users trust, without requiring the very high costs of five nines.
#### Service Level Agreements (SLAs)
Once you've decided on an availability target, how do you communicate this commitment to users? Enter the [Service Level Agreement (SLA)](https://aws.amazon.com/what-is/service-level-agreement/), a formal promise about what level of service users can expect.
An SLA typically specifies the availability target (like "99.9% uptime per month"), the measurement period for how and when availability is calculated, remedies for missing the target (refunds, service credits), and exclusions like planned maintenance windows or user-caused issues.
For example, AWS's SLA for EC2 promises 99.99% availability. If they fail to meet this in a given month, customers receive service credits: 10% credit for 99.0%-99.99% availability, 30% credit for below 99.0%. This financial penalty motivates AWS to maintain high availability while providing compensation when things go wrong.
For your AI service, an SLA serves several purposes. It builds trust. Users need to know what to expect, and "we guarantee 99.9% uptime" is more reassuring than "we try to keep things running." It sets expectations. Users understand that some downtime is normal. If you promise 99.9%, users know that occasional brief outages are part of the deal. In a crowded AI market, a strong SLA can set you apart from competitors who make no promises. SLAs also give your team clear targets to design and operate toward.
Choosing the right SLA involves balancing user expectations with costs. A student project or internal tool might not need any formal SLA. A business productivity tool should promise at least 99.9%. Critical healthcare or financial AI applications might need 99.99% or higher. Also, it's better to promise 99.9% and consistently exceed it than to promise 99.99% and frequently fall short.
### What Downtime Actually Costs
#### The Obvious Cost
When your service is down, you can't process requests. No requests means no revenue. For a simple AI API charging $0.01 per request and serving 1,000 requests per hour:
```
1 hour down = 1,000 requests lost × $0.01 = $10 lost
8.76 hours/year (99.9% uptime) = ~$88 lost per year
```
That doesn't sound too bad, right? But this is just direct revenue, and for most services, it's actually the smallest part of the cost. Recent research reveals the true scale of downtime costs. [Fortune 500 companies collectively lose $1.4 trillion per year to unscheduled downtime, which represents 11% of their revenue](https://www.theaemt.com/resource/the-true-cost-of-downtime-2024-a-comprehensive-analysis.html). For [41% of large enterprises, one hour of downtime costs between 1 million and 5 million](https://itic-corp.com/itic-2024-hourly-cost-of-downtime-part-2/). Industry-specific costs are even more dramatic. The [automotive industry loses 2.3 million per hour (that's $600 per second)](https://www.theaemt.com/resource/the-true-cost-of-downtime-2024-a-comprehensive-analysis.html). [Manufacturing loses $260,000 per hour on average](https://www.pingdom.com/outages/average-cost-of-downtime-per-industry/). [Financial services and banking often see losses exceeding $5 million per hour](https://www.erwoodgroup.com/blog/the-true-costs-of-downtime-in-2025-a-deep-dive-by-business-size-and-industry/). For smaller businesses, the numbers are smaller but still significant. A [small retail or service business might lose 50,000 to 100,000 per hour](https://www.erwoodgroup.com/blog/the-true-costs-of-downtime-in-2025-a-deep-dive-by-business-size-and-industry/), while even [micro businesses can face losses around $1,600 per minute](https://www.encomputers.com/2024/03/small-business-cost-of-downtime/).
#### The Hidden Costs
Beyond immediate lost revenue, downtime creates ongoing costs that persist long after your service comes back online.
When users can't access your service, they don't just wait patiently. They Google for alternatives, sign up for competitor services, and might never come back. Acquiring new users is expensive. Losing existing ones because of reliability issues is especially painful because they've already shown they need what you offer. Word spreads fast. "That AI service that's always down" is a label that's hard to shake. In online communities, forums, and social media, reliability complaints get louder. Even after you fix underlying issues, the reputation lingers.
During and after outages, you'll face a flood of support tickets, refund requests, and angry emails. Your team spends time on damage control instead of building new features. These labor costs add up quickly. Happy users recommend services to colleagues and friends. Frustrated users don't. Every outage represents lost word-of-mouth growth and missed opportunities for positive reviews.
> **Extended Reading:**
> To learn more about downtime impact:
> - [Siemens 2024 True Cost of Downtime Report](https://blog.siemens.com/2024/07/the-true-cost-of-an-hours-downtime-an-industry-analysis/) analyzes how unscheduled downtime affects global companies
> - [The Impact of ChatGPT's 2024 Outages](https://opentools.ai/news/openai-faces-major-outage-how-chatgpt-users-coped) examines real-world effects when a major AI service goes down
## Improving System Availability
Now you understand what availability means, how to measure it, and why it matters for your AI service. The natural next question is how you actually improve it.
The availability formula tells us there are two basic ways to improve availability. You can make failures less frequent (increase MTBF) or you can recover from failures faster (decrease MTTR). In practice, most high-availability strategies do both.
The core principle behind all availability improvements is simple: don't put all your eggs in one basket. If you have only one server and it crashes, everything stops; but if you have two servers and one crashes, the other keeps working. If you have only one copy of your database and it corrupts, your data is gone; but if you have backups, you can restore it and get back online.
In this section, we'll look at practical ways to make your AI system more reliable. We'll start by finding weak points in your architecture, places where a single failure brings everything down. Then we'll look at different types of redundancy you can add, from running multiple servers to keeping backups of your data. We'll also see how each strategy helps you recover faster when things go wrong.
### Finding and Fixing Weak Points
Imagine you have a room lit by a single light bulb. If that bulb burns out, the entire room goes dark. Now imagine the same room with five light bulbs. If one burns out, you still have light from the other four. The room stays usable while you replace the broken bulb. The single-bulb setup has what engineers call a single point of failure ([SPOF](https://en.wikipedia.org/wiki/Single_point_of_failure)), which is one component whose failure brings down the entire system.
#### What is a Single Point of Failure?
A SPOF is any component in your system that, if it fails, causes everything to stop working. SPOFs are dangerous because they're often invisible until they actually fail. Your system runs fine for months, everything seems great, and then one day that critical component breaks and suddenly users can't access your service.
![](spof-diagram.png)
We can use the AI API server deployed in [Cloud Deployment](@/ai-system/cloud-deployment/index.md) as an example to identify the potential SPOFs. If you're running everything on one virtual machine and it crashes (out of memory, hardware failure, data center issue), your entire service goes down. Users get connection errors and can't make any requests. If the database file gets corrupted (disk failure, power outage during write, software bug), you lose all your request history and any user data. The API might crash or return errors because it can't access the database. If the model file is deleted or corrupted, your API can still accept requests but can't make predictions. Every classification request fails. If the internet connection to your VM fails (ISP issue, data center network problem), users can't reach your service even though it's running perfectly. If your API calls another service (maybe for extra features) and that service goes down, your API might become unusable even though your own code is working fine.
Problem is, you might not even realize these are SPOFs until something goes wrong at 3 AM on a Saturday.
#### How to Identify SPOFs
The simplest way to find SPOFs is to walk through your system architecture and ask "what if this fails?" for every component.
Let's do this for your AI API server. What if my VM crashes? Entire service goes down. Users get connection timeouts. This is a SPOF. What if my database file corrupts? All user data lost, API probably crashes or errors. This is a SPOF. What if I delete my model file accidentally? API runs but can't make predictions. This is a SPOF. What if my Docker container crashes? If you configured `--restart unless-stopped`, it automatically restarts in seconds. Users might see brief errors during restart, but service comes back. Partial SPOF, but with quick recovery. What if the cloud provider's entire region goes offline? Everything in that region goes down, including your VM. This is a SPOF.
Drawing your architecture can make this easier. Sketch out the components (VM, container, database, model, load balancer if you have one) and the connections between them. Look for any component that doesn't have a backup or alternative path.
#### Two Ways to Handle SPOFs
Once you've identified a SPOF, you have two options. You can eliminate it or plan to recover from it quickly. The right choice depends on how critical the component is and how much you're willing to invest.
One option is to eliminate the SPOF through prevention. This means adding redundancy so that failure of one component doesn't matter. If you have two servers instead of one, the failure of either server doesn't bring down your service. The other one keeps working. This is the "increase MTBF" approach where you haven't made individual servers less likely to fail, but you've made your overall system less likely to fail. For example, instead of one VM, deploy your AI API on two VMs with a load balancer in front. When one VM crashes, the load balancer automatically sends all traffic to the other VM. Users might not even notice the failure. This makes sense when the component is critical, failures are relatively common, you can afford the extra cost, and you need high availability.
Another option is to plan for quick recovery and accept a faster MTTR. This means accepting that the SPOF exists, but preparing to fix it as fast as possible when it fails. You keep backups, write clear recovery procedures, and maybe practice restoring to make sure you can do it quickly. This is the "decrease MTTR" approach where failures still happen, but you minimize how long they last. For example, your database file is a SPOF. Instead of setting up complex database replication, you run automated daily backups to cloud storage. When the database corrupts, you have a clear procedure. Download the latest backup, replace the corrupted file, restart the container. Total recovery time is maybe 30 minutes. This makes sense when the component is expensive or complex to duplicate, failures are rare, you can tolerate some downtime, and quick recovery is feasible.
For a student project or class assignment aiming for 99% uptime, don't worry about eliminating every SPOF. Focus on quick recovery plans, keep good backups of your database, and document how to redeploy if your VM dies. Cost is nearly free, and acceptable downtime is measured in hours. For a business tool or production service targeting 99.9% uptime, eliminate critical SPOFs by running on multiple servers. Have quick recovery plans for expensive components, set up automated backups every few hours, and consider database replication for critical data. Cost is moderate and acceptable downtime is minutes to an hour. For a critical system requiring 99.99%+ uptime, you will have to eliminate SPOFs at all levels. Deploy multiple servers in different geographic regions, implement real-time database replication, and set up automated failover mechanisms. Cost is high and acceptable downtime is seconds to minutes.
> **Videos:**
> - [Single point of failure explained](https://www.youtube.com/watch?v=Iy2YqgjXtRM&pp=ygUhU2luZ2xlIHBvaW50IG9mIGZhaWx1cmUgZXhwbGFpbmVk)
> - [How to avoid SPOFs](https://www.youtube.com/watch?v=-BOysyYErLY&pp=ygUhU2luZ2xlIHBvaW50IG9mIGZhaWx1cmUgZXhwbGFpbmVk)
> **Extended Reading:**
> To learn more about SPOF identification and elimination:
> - [What is a Single Point of Failure?](https://www.techtarget.com/searchdata center/definition/Single-point-of-failure-SPOF) from TechTarget provides full coverage
> - [System Design: How to Avoid Single Points of Failure](https://blog.algomaster.io/p/system-design-how-to-avoid-single-point-of-failures) offers technical strategies with practical examples and diagrams
> - [How to Avoid Single Points of Failure](https://clickup.com/blog/how-to-avoid-a-single-point-of-failure/) provides practical strategies and tools
### Redundancy and Backups
We've identified where your system is at risk. Now let's talk about how to protect it. The solution comes in two related forms: running backups (redundancy) prevents downtime, and saved backups (snapshots) enable quick recovery.
Think of redundancy like having a spare key to your house. If you lose your main key, you don't have to break down the door. You just use the spare and life continues normally. Backups, on the other hand, are like having photos of everything in your house. If there's a fire, the photos don't prevent the disaster, but they help you rebuild afterward.
Both are valuable. Redundancy keeps your service running when components fail. Backups help you recover when disasters strike. Let's explore how to set up both for different parts of your AI system.
#### Hardware-Level: Multiple Servers
Instead of running your AI API on a single cloud VM, you run it on two or more VMs simultaneously. A [load balancer](https://aws.amazon.com/what-is/load-balancing/) sits in front, distributing incoming requests across all healthy servers. When one server crashes, the load balancer stops sending traffic to it and routes everything to the remaining servers, and your API keeps responding to requests. Users might not even notice the problem. That's the beauty of redundancy, that your service keeps running and you can fix the failed server later.
![](load-balancer.png)
Suppose you currently run your containerized API on one cloud VM. Here's how to add hardware redundancy. Deploy the same Docker container on a second VM, maybe in a different availability zone or even region. Set up a load balancer using [Nginx](https://nginx.org/en/docs/http/load_balancing.html), cloud load balancers (like [AWS ELB](https://nginx.org/en/docs/http/load_balancing.html)), or simple [DNS round-robin](https://en.wikipedia.org/wiki/Round-robin_DNS). Configure health checks so the load balancer pings each server periodically (like `GET /health`). If a server doesn't respond, traffic stops going to it. If your API is stateless (each request independent), this just works. If you store state, you'll need shared storage or session replication.
Running two servers costs roughly twice as much as one. But for 99.9% availability targets, this investment often makes sense. Use this approach when you need 99.9%+ availability, can afford 2x compute costs, individual server failures are your biggest risk, and traffic volume justifies multiple servers.
#### Software-Level: Multiple Containers
Instead of running one Docker container with your AI API, you run multiple containers simultaneously, possibly all on the same VM. When one container crashes, the others keep serving requests.
Container crashes are common, specific causes include memory leaks, unhandled exceptions, and resource exhaustion. Running multiple containers means one crashing doesn't take down your whole service. Docker's restart policies automatically bring crashed containers back online. While it's restarting, other containers can handle the traffic.
You learned in [Packaging & Containerization](@/ai-system/packaging-containerization/index.md) to run your API with `docker run`. Here's how to run three instances for redundancy:
```bash
# Start three containers of your AI API
docker run -d -p 8001:8000 --restart unless-stopped --name ai-api-1 my-ai-classifier:v1.0
docker run -d -p 8002:8000 --restart unless-stopped --name ai-api-2 my-ai-classifier:v1.0
docker run -d -p 8003:8000 --restart unless-stopped --name ai-api-3 my-ai-classifier:v1.0
# Set up Nginx to load balance across them
# (Nginx config distributes traffic to localhost:8001, :8002, :8003)
```
Now if `ai-api-2` crashes, `ai-api-1` and `ai-api-3` continue serving requests. Docker will also automatically restart `ai-api-2`. The `--restart unless-stopped` flag is also important here. It tells Docker to automatically restart the container if it crashes, but not if you manually stopped it.
Running multiple containers on one VM is relatively cheap. You just need enough memory and CPU to handle all containers, which makes it much more affordable than multiple servers. Use this approach even for moderate availability targets (99%+), especially when application-level failures are common.
#### Data-Level: Backups and Replication
Data is special. When hardware fails, you buy new hardware. When software crashes, you restart it. But when data is lost, corrupted, deleted, or destroyed, it might be gone forever. Your users' data, request history, and system state represent irreplaceable information. Protecting data requires different strategies than protecting hardware or software, and usually involves backups and replication.
Backups are periodic snapshots of your data saved to a safe location. They're like save points in a video game. If something goes wrong, you can reload from the last save. Backups don't prevent failures, but they enable you to recover from them.
For your AI API with a SQLite database, below is an example how you can automatically backup the database:
```bash
#!/bin/bash
# Simple backup script that runs daily via cron
# Create timestamped backup
BACKUP_FILE="backup-$(date +%Y%m%d-%H%M%S).tar.gz"
tar -czf $BACKUP_FILE /app/data/ai_api.db
# Upload to cloud storage (AWS S3 example)
aws s3 cp $BACKUP_FILE s3://my-backups/ai-api/
# Keep only last 7 days locally to save space
find /backups -name "backup-*.tar.gz" -mtime +7 -delete
```
Set this to run automatically at 2 AM every day, and now if your database corrupts at 3 PM, you have a recent backup from 2 AM. When you need to recover, download the latest backup from S3 (`aws s3 cp s3://my-backups/ai-api/backup-20250126-020000.tar.gz .`), extract it (`tar -xzf backup-20250126-020000.tar.gz`), replace the corrupted database (`mv ai_api.db /app/data/ai_api.db`), restart your container (`docker restart ai-api`), and verify the service is working. Total recovery time is about 15-30 minutes, depending on backup size, download speed. This is your MTTR for database corruption. Also, you will lose data created between 2 AM (last backup) and when corruption happened. More frequent backups reduce data loss but consume more storage and resources.
Security experts recommend the 3-2-1 rule for critical data. Keep 3 copies of your data (original plus two backups), on 2 different storage types (like local disk plus cloud storage), with 1 off-site backup (survives building fire, flood, or local disaster). For your AI API, this might look like keeping your original SQLite database on your cloud VM (`/app/data/ai_api.db`), a daily snapshot on the same VM but different disk/partition, and another daily snapshot uploaded to cloud storage (like AWS S3 or Google Cloud Storage). This protects against several scenarios. If you accidentally delete something, restore from Backup 1 on the same VM (very fast). If a disk fails, restore from Backup 2 in cloud storage (a bit slower). If your VM is terminated, restore from Backup 2 and rebuild the VM. If an entire data center fails, Backup 2 is in a different region and remains accessible. The cloud storage backup is particularly important. If your entire VM is deleted (you accidentally terminate it, cloud provider has issues, account compromised), your local backups disappear too. Cloud storage in a different region survives these disasters.
![](backup-321.png)
Backups enable recovery (they reduce MTTR). But [replication](https://www.geeksforgeeks.org/system-design/database-replication-and-their-types-in-system-design/) prevents downtime in the first place (it increases MTBF). With replication, you maintain two or more copies of your database that stay continuously synchronized. How does it work? The primary database handles all write operations (create, update, delete). Replica databases continuously receive updates from the primary and stay in sync. Replicas can handle read operations, spreading the load. If the primary fails, you promote a replica to become the new primary.
For your AI API, you might upgrade from SQLite (single-file database) to PostgreSQL with replication:
```
Primary PostgreSQL (VM 1) ←→ Replica PostgreSQL (VM 2)
↓ ↓
Handles writes Handles reads + standby
```
When the primary fails, your application detects the failure (connection timeout), switches to the replica (either manually or automatically), promotes the replica to primary, and service continues with minimal disruption. For this setup, recovery time is seconds to minutes with automatic [failover](https://learn.microsoft.com/en-us/azure/reliability/concept-failover-failback), instead of the 15-30 minutes needed to restore from backups. Data loss is minimal, only transactions in the last few seconds before failure. As you can tell, compared to backups, replication has much better MTTR, but more complex to set up and maintain, higher cost (need to run multiple database servers), and requires application changes (connection pooling, failover logic).
| Approach | MTTR | Data Loss | Complexity | Cost | Best For |
| -------------- | --------------- | --------- | ---------- | ----------- | ------------- |
| Daily backups | Hours | Up to 24h | Low | Very low | 99% uptime |
| Hourly backups | 30-60 min | Up to 1h | Low | Low | 99% uptime |
| Replication | Seconds-minutes | Minimal | High | Medium-high | 99.9%+ uptime |
> **Videos:**
> - [What is a load balancer?](https://www.youtube.com/watch?v=sCR3SAVdyCc)
> - [Database replication explained](https://www.youtube.com/watch?v=bI8Ry6GhMSE)
> - [3-2-1 backup strategy](https://www.youtube.com/watch?v=rFO6NyLIP7M)
> **Extended Reading:**
> To learn more about redundancy and backup strategies:
> - [High Availability System Design](https://www.cisco.com/site/us/en/learn/topics/networking/what-is-high-availability.html) from Cisco provides full coverage of redundancy concepts
> - [Redundancy and Replication Strategies](https://www.scoredetect.com/blog/posts/redundancy-and-replication-strategies-for-high-availability) explores different approaches with practical examples
> - [Backup and Disaster Recovery Best Practices](https://solutionsreview.com/backup-disaster-recovery/backup-and-disaster-recovery-best-practices-to-consider/) offers 15 essential practices for protecting your data
>
> Popular backup tools to implement the strategies discussed:
> - [Litestream](https://litestream.io/) for SQLite and [pgBackRest](https://pgbackrest.org/) for PostgreSQL offer database-specific backup with cloud storage support
> - [Restic](https://restic.net/) and [BorgBackup](https://borgbackup.readthedocs.io/) provide general-purpose backup with deduplication and encryption

Binary file not shown.

After

Width:  |  Height:  |  Size: 577 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 170 KiB

View file

@ -0,0 +1,12 @@
+++
title = "B-Infrastructure & Deployment of AI"
date = 2025-09-25
description = ""
+++
> **TL;DR:**
> In this phase, we will dive into the backend infrastructure—both software and hardware—of AI systems, glance into how they work, and learn how to deploy our AI service on infrastructure other than our own PC.
In the previous phase [Interact with AI Systems](@/ai-system/interact-with-ai-systems/index.md) we have learned how to interact with existing AI services, and even build your own AI service running on your laptop. But you've probably noticed that your service runs slower than most of the paid AI services you've interacted with, and makes your laptop sound like a jet engine.
To understand the reason behind the speed difference, we need to dive into the hardware infrastructure supporting the running of AI models and services ([AI compute hardware](@/ai-system/ai-compute-hardware/index.md)). We will then explore the wide world of different types of hardware infrastructure for us to deploy our AI service on, so that we can free our laptop from heavy-duty work ([Cloud deployment](@/ai-system/cloud-deployment/index.md) and [Edge & self-hosted deployment](@/ai-system/edge-self-hosted-deployment/index.md)). And before we proceed to run our AI service on such infrastructure, we will also learn about the software infrastructure that is hardware-agnostic and will make deployment much easier ([Packaging & containerization](@/ai-system/packaging-containerization/index.md)).

View file

@ -0,0 +1,18 @@
+++
title = "A-Interact with AI Systems"
date = 2025-09-03
description = ""
+++
> **TL;DR:**
> Learn why standardized interactions between applications are essential for making AI models practical in real-world scenarios, moving beyond simple function calls to robust communication methods that work across different programming languages and distributed systems.
Interactions are common in both human society and the digital world. We as humans interact with each other through language so that our thoughts and purposes are communicated. In the digital world, a food delivery application interacts with restaurants to process your order and interacts with banks to ensure payment for the meal. Interactions between applications (AI or not) are what build the digital world we cannot live without. But how do applications interact with each other?
From the previous course [DAKI2 - Design og udvikling af AI-systemer](https://www.moodle.aau.dk/course/view.php?id=50253) you already know how an AI model is designed and implemented. You will even dive deeper with the parallel course [DAKI3 - Deep learning](https://www.moodle.aau.dk/course/view.php?id=57017). To some extent, you have already witnessed interactions between applications: when you feed data into the `forward` function of your AI model and retrieve its output data to perform visualization, different components of your software are interacting with each other.
> **Videos:**
> - [How Python function calls work internally](https://www.youtube.com/watch?v=nXLhpuKI4Mw)
But this form of interaction is not suitable in many real-world scenarios, as soon as our applications grow beyond a single piece of software. For example, AI models nowadays are largely written in Python, but not all softwares are. Needless to say, it is quite impractical to make software written in Rust call a Python function. Even if Python were the one and only programming language in the universe, we cannot expect our digital world to be one giant Python project where applications are functions that can call each other.
In conclusion, to make our AI models practical in real-world use cases, we need a more standardized and streamlined means of letting them interact with the outside world, which is the subject of this phase of the course. Note that for now we will be focusing on existing AI models/systems and locally run AI models. We will leave the deployment of AI models for later phases of the course.

View file

@ -0,0 +1,83 @@
+++
title = "B.8-Mini Project"
date = 2025-10-21
description = ""
+++
Leveraging the knowledge from the [phase A and B](@/ai-system/_index.md) of this course (and optionally the advanced techniques from the phase C), we will develop and deploy a multi-functional AI API server in our mini project.
## Outcome
The outcome of the project is an AI API server (similar to the one we implemented in [module 3](@/ai-system/wrap-ai-with-api/index.md)), deployed on a machine other than your personal PC. A client program that can interact with the server should also be included.
### Necessary Requirements
#### Implementation of the API Server
The API server should have more than one route (a.k.a. API endpoint). You have the flexibility to plan the functionality of these routes. Note that at least one of them should incorporate "AI functionality", either image or language-related. Examples include:
- One route `<domain>/v1/image_classify` for image classification and another route `<domain>/v1/conversation` for LLM-powered natural language conversation
- Multiple versions `<domain>/v1/image_classify` and `<domain>/v2/image_classify` providing different sets of functionality
- Utility routes like `<domain>/v1/model` for listing available AI models
For API framework, although we have only learned FastAPI in this course, you have the freedom to use other frameworks (e.g., Flask, Django, or even programming languages other than Python) if you want.
For the AI model powering the AI functionality, you can totally use off-the-shelf models from HuggingFace, the model you prepared for other courses, or elsewhere; this is not the focus of this mini project.
You are free to use libraries, ask AI for help, or even reference to the code I included in the blog post. But you shouldn't directly copy and paste any code (implemented by others or AI) without any modification, especially if you have no idea of the meaning of the code. In other words, I am not against reuse of existing tools and code since it is a common practice in software development, but you have to ensure that you understand your implementation.
#### Deployment of the API Server
In general, you can deploy the API server on any machine as long as it is not the same machine the client program is going to run on. The actual purpose of this requirement is the client program and the server should live in different host environments so the knowledge of interaction through network is required. In practice you can interpret this requirement quite flexibly, below is a list of examples that are all accepted:
- You run the server on your personal computer, and run the client on your colleague/friend's personal computer
- You have two personal computers, with one of them running the server and another one running the client
- You run the server on a Raspberry Pi/NVIDIA Jetson, and run the client on your personal computer, or reversely
- You run the server on a cloud computer, and run the client on your personal computer, or reversely
- You run both server and client on the same physical machine, but they are in different host environments: one of them is in a virtual machine or each is in a different virtual machine
The server should be deployed using containerization technique we learned in [module 5](@/ai-system/packaging-containerization/index.md). In other words, you shouldn't run the server program directly on the host environment of the machine. You have the freedom to use container frameworks other than Docker (e.g., Podman).
#### Client Program and Interaction with the Server
You also should prepare a simple client program for validating that the server is functioning correctly. There is no strict requirement for any aspect of this client program, as long as it can demonstrate the functionality of the API server you deployed.
### Tips
These are not strict requirements, but are aspects that you probably should consider to demonstrate your understanding of the knowledge covered in this course:
- API endpoints design that adhere to the REST principle
- API versioning considerations, even if you plan to only have `v1` endpoints
- Integration with databases in the API server for API key management
- Leveraging the dedicated AI computing hardwares if the machine has any
- Build the server container image with Dockerfile and proper layering
### Optional Achievements
For folks who feel the above requirements are too easy, here are some examples of how you can incorporate advanced techniques into your mini project to amaze your classmates. You probably have even crazier ideas if you are really considering implementing these: the sky is the limit here.
- Record per user usage and implement advanced rate limit algorithm for your API endpoints with AI functionalities
- Leverage production-ready techniques we are going to learn in the last phase to: make your server highly-available (e.g., run on a cluster) or implement advanced deployment strategies
- Make your server publicly available, by giving it a publicly accessible IP address, a domain, and proper SSL certificate
- Make your API server a drop-in replacement of OpenAI/Anthropic's API families by implementing multi-modal conversational APIs that can receive both natural language and image input and generate language output
I do want to note that, to keep it fair, incorporating these optional achievements in the project does not directly grant you a higher score than those who don't. The purpose of the project is to reflect your understanding of the knowledge we covered in this course, and the course is scored based on the oral exam, not the project outcome directly.
## Report
As you probably already know, you need to submit a report for this mini project. The format of the report is as follows.
3-8 pages, excluding references, containing:
- Title and all authors
- Introduction: a short problem analysis
- Implementation: Explain important design and implementation choices of the API server and the client program
- Deployment: Demonstrate the important steps of deploying the API server
- Results: Evaluation of the API server's functionality and your reflections
- Conclusion
> **Tip:** The report can provide a concise explanation of your key design choices, the rationale behind them, and the deployment process you followed. Focus on demonstrating your understanding of the system you built and showing that it achieves its intended objectives. Think of it as walking the reader through your thought process and the outcomes of your work, rather than providing exhaustive code documentation.
## Submission
In the end, you should submit one `.zip` or `.tar.gz` (or other open file bundle formats) file containing:
- The report in PDF format
- All source code necessary to build the API server container, including the source code of the API server, the `Dockerfile`, among others if needed (e.g., `requirements.txt`)
- Source code of the client program
**One submission per group** should be uploaded in DigitalExam no later than: December 4th 23:59, 2025 (Copenhagen time).

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 179 KiB

View file

@ -0,0 +1,526 @@
+++
title = "B.5-Packaging & Containerization"
date = 2025-10-02
description = ""
+++
> **TL;DR:**
> Ever struggled with "it works on my machine" syndrome? Learn how containers solve deployment headaches by packaging your applications with everything they need to run consistently anywhere. We'll explore Docker fundamentals and build a containerized AI API server.
Now you understand (from [AI compute hardware](@/ai-system/ai-compute-hardware/index.md)) that most computers, from your smartphones and laptops, those overpriced "AI computers" sold in the electronic stores, to cloud servers that you can rent for a price of a car per year, adhere to the same computer architecture. It should come with no surprise that in most cases, you can install a general purpose computer operating system (Linux, Microsoft Windows, or others) on them, and use them as you would use your PC.
But before you go ahead and deploy your AI API server on all kinds of machines and dreaming about earning passive income by charging other developer using your server, we will learn techniques that will make large-scale deployment much easier.
Recall the AI API server we implemented in [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md). You probably ran it directly on your machine, installed the required Python packages locally, and hoped everything would work a few months later. But what happens when you update your machine's operating system, or you want to deploy it on a different machine with a different operating system, or when your group members try to run it but have conflicting Python versions? Saying *it (used to) work on my machine* certainly doesn't help.
![Works on my machine](works-on-my-machine.png)
Are there techniques that can ensure that the runtime environment for our programs is consistent regardless of the operating system and OS-level runtime, so we can deploy our programs to any computer with the confidence that they just work? Yes, and you guessed it: packaging and containerization techniques.
The idea is, instead of trying to replicate the runtime every time we deploy our programs on a new machine (that means, install and configure everything you need depending on the type of the machine and operating system), we first *pack* the softwares alone with the runtime into a portable *container* that will always run with a single command regardless of which machine it is ran on. This process can take extra effort at first, but will save a lot of headache for later large-scale deployment.
## Basics of Containers
Before we learn how do pack our softwares into a container, we first need to get familiar with a few essential concepts related to containers.
### What are Containers?
A [container](https://www.docker.com/resources/what-container/) is a lightweight package that includes everything needed to run a piece of software: the code, runtime environment, libraries, and configurations. Containers are self-contained units, meaning that they can run consistently anywhere.
Traditionally, deploying software on a machine meant installing the software itself and all its dependencies directly on the machine. Not only this can be tedious to do if you are deploying a complex software in large-scale, but can also lead to lots of problems. For example, there might be two pieces of software on the same machine that requires different versions of the same library, and an update to the operating system can break the carefully configured environment.
Containers solve this by creating isolated environments that package your application with everything it needs to run. Each container acts like a sealed box with its own filesystem, libraries, and configurations, which are completely separate from other containers and the native environment of the machine. Yet they're incredibly efficient, starting in seconds and using minimal resources.
Think of containers like this: at a traditional Chinese dinner, everyone shares dishes from the center of the table. But, what if one person needs gluten-free soy sauce while another needs regular? What if someone accidentally adds peanuts to a shared dish when another guest has allergies? Containers are like giving each person their own Western-style plated meal with exactly the seasonings and portions they need. No sharing conflicts, no contamination between dishes, and everyone gets precisely what works for them, while still sitting at the same table.
![Container analogy](container-analogy.png)
The benefits of containers quickly made containerization become the industry standard for large-scale software deployment. Today, there is a very high chance that one of the applications you use everyday is running in containers. It is [reported](https://www.docker.com/blog/2025-docker-state-of-app-dev/) that by 2025, container usage in the IT industry has reached 92%. With the help of containers, companies can deploy updates without downtime, handle more users by scaling automatically, and run the same software reliably across different hardware infrastructures.
> **Videos:**
> - [Containerization explained](https://www.youtube.com/watch?v=0qotVMX-J5s)
> **Extended Reading:**
> For those curious about the differences between containers and [virtual machines (VMs)](https://en.wikipedia.org/wiki/Virtual_machine). Virtual machines create complete simulated computers, each running its own full operating system. It's like building separate restaurants for each type of cuisine, each with its own kitchen, dining room, storage, and utility systems. Containers, on the other hand, share the host's operating system kernel while maintaining isolation. This makes containers much lighter and faster to start compared to VMs. Nevertheless, VMs provide stronger isolation for certain security-critical applications, just like separate restaurants offer more complete separation for health code or dietary law compliance, so they still have their usecases.
### How Containers Work?
The secret sauce of containers' efficiency and flexibility is a clever [layering system](https://docs.docker.com/get-started/docker-concepts/building-images/understanding-image-layers/).
Imagine you're building different types of hamburgers. You start with a basic bottom bun. Then you add a beef patty as the next layer. For a cheeseburger, you add a cheese layer on top. For a deluxe burger, you might add lettuce, tomato, and special sauce as additional layers. Instead of preparing everything from scratch each time, you can reuse the same foundation (bun and patty) and just add the unique toppings that make each burger special.
Similarly, each container image is a system of layers. Each layer represents a set of changes from the previous one. For example, you would have a container image with 4 layers:
```
1. Add Python runtime
2. Install libraries
3. Copy your application code
4. Configure the startup commands
```
Since containers running on one machine usually have common layers, especially the base layers such as Python runtime, containers will share the common layers so that only one copy of the layer exists. This means that duplicate layers do not have to be stored so storage space is saved. Also, an update to each container don't involve rebuilding of the whole container, just the layers that have been modified.
![Container layers](container-layers.png)
> **Extended Reading:**
> When a container runs, it obviously needs to modify files in the layers, like storing temporary data. But it seems that this will break the reusability of layers. Thus, there is actually a [temporary writable layer](https://medium.com/@princetiwari97940/understanding-docker-storage-image-layers-copy-on-write-and-how-data-changes-work-caf38c2a3477) on top of the read-only layers when a container is running. All changes happen in this writable layer during the running of a container image, while the underlying layers of the image itself is untouched.
>
> Interested in other working mechanisms of containers? When a read-only file in the image layers is modified, the container will use a "copy-on-write" strategy: copying the file to the writable layer before making changes. This is made possible with union filesystems (like [OverlayFS](https://jvns.ca/blog/2019/11/18/how-containers-work--overlayfs/)) that merge multiple directories into a single view.
### Container Frameworks
While containers as a concept have existed since the early days of Linux, it was [Docker](https://www.docker.com/) that made them accessible to most developers. Docker provides a comprehensive toolkit for working with containers, that includes:
- [**Docker Engine**](https://docs.docker.com/engine/), the core runtime that manages containers running on a machine
- [**Docker CLI**](https://docs.docker.com/reference/cli/docker/), providing commands for managing containers like `docker run`
- [**Dockerfile**](https://docs.docker.com/reference/dockerfile/), a recipe for building container images that writes in plain text and corresponds to the layered system of images
- [**Docker Hub**](https://hub.docker.com/), a cloud registry for sharing container images
With the toolkit provided by Docker, in many cases you can quickly starting using containers without building a container image by yourself or even understand the working mechanism of containers. There is a high chance that the software you need is already available on Docker Hub, say for example the PostgreSQL database, and you can spin up a container with Docker CLI command `docker run postgres`. Docker will pull the already built `postgres` image from Docker Hub and run a container instance of the image.
If you need to build a custom image, with Dockerfile you can also easily use one of the existing image as the base image, and you only have to define the customized part of your image. We will dive into how to write Dockerfile below.
Under the hood, Docker uses a client-server architecture that separates what you interact with from what actually does the work. When you use Docker CLI, you are interacting with the Docker Client, a program that takes your commands and send the corresponding requests to the Docker Daemon. Docker Daemon is a background service that does the actual work of managing containers, like a server that runs on backend. As you would imagine, the Docker Client and the Docker Daemon don't necessarily have to run on the same machine, a common relationship between clients and servers.
> **Videos:**
> - [Container images explained](https://www.youtube.com/watch?v=wr4gpKBO3ug)
> - [Docker introduction](https://www.youtube.com/watch?v=Gjnup-PuquQ)
> **Extended Reading:**
> There are alternative container frameworks to Docker, such as:
>
> - [**Podman**](https://podman.io/) that runs without a background daemon. This can provide better security. It also provides nearly identical CLI commands to Docker, so it is a drop-in replacement in most cases
> - [**containerd**](https://containerd.io/), which is what Docker actually uses under the hood. It is a minimal runtime that's the default for Kubernetes. Perfect when you just need to run containers without extras
>
> As containers became popular, the industry recognized the need for standards. The [Open Container Initiative (OCI)](https://opencontainers.org/) created universal standards for container image format and runtime behavior. This means containers built with any OCI-compliant tool will run on any OCI-compliant runtime. For example, a container built by Docker can run flawlessly in Podman's runtime, and vise versa.
## Use Containers
Now that we understand what containers are and how they work, let's get hands-on experience using them. We'll use a Python container as our running example.
Before we begin, you'll need to install Docker on your machine (if you haven't already). Recall that Docker uses a client-server architecture we discussed earlier. When you install Docker, you're getting both the Docker Client (the command-line interface you'll use) and the Docker Daemon (the background service that actually manages containers). Docker Desktop provides both components along with a user-friendly interface for Windows and macOS, while Linux users typically install Docker Engine directly. The installation process varies by platform, so follow the [official Docker installation guide](https://docs.docker.com/get-docker/) for your operating system.
### Images
#### Pulling Images from Registries
Before we can run a container, we need to get a container image. The easiest way is to pull a pre-built image from a public registry like Docker Hub.
```bash
docker pull python:3.11
```
This command downloads the official Python 3.11 image to your local machine. The format is `repository:tag`, where `python` is the repository name and `3.11` is the tag that specifies the version. If you omit the tag, Docker defaults to `latest`.
You can also pull specific variants of images:
```bash
docker pull python:3.11-slim # Smaller image with minimal packages
docker pull python:3.11-alpine # Even smaller, based on Alpine Linux
```
> **Extended Reading:**
> Docker Hub is just one of many container registries. Other popular options include [GitHub Container Registry](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry) (ghcr.io) and [Google Container Registry](https://cloud.google.com/artifact-registry/docs) (gcr.io). To pull from these registries, simply include the registry domain in your image path:
>
> ```bash
> docker pull ghcr.io/joeferner/redis-commander:latest
> docker pull gcr.io/kaniko-project/executor:latest
> ```
#### Managing Images
Once you start working with containers, you'll accumulate various images on your machine. Here are the essential commands to manage them:
```bash
# List all images on your machine
docker images
# Get detailed information about a specific image
docker inspect python:3.11
# Remove an image (only if no containers are using it)
docker rmi python:3.11
# Remove unused images to free up space
docker image prune
```
The `docker images` command shows useful information like image size, creation date, and unique image IDs. You'll notice that similar images often share layers, which is why the total size of multiple Python images might be less than you'd expect.
### Running Containers
#### Basic Operations
To run a container, the basic command is simple:
```bash
docker run python:3.11
```
This creates and starts a new container from the Python image. However, this container will start and immediately exit because there's no long-running process to keep it alive. We can try something more useful:
```bash
# Run a simple Python command
docker run python:3.11 python -c "print('Hello from container!')"
# Run a container in the background that will run for one hour if not stopped manually
docker run -d python:3.11 python -c "import time; time.sleep(3600)"
```
The `-d` flag runs the container in "detached" mode, which means it runs in the background without blocking your terminal. This can be useful for running long-lasting programs.
#### Interactive Mode
Often you'll want to interact with a container directly, like getting a shell inside it to explore or debug:
```bash
# Get an interactive Python shell inside the container
docker run -it python:3.11 python
# Get a bash shell to explore the container
docker run -it python:3.11 bash
```
The `-it` combination gives you an **i**nteractive **t**erminal. Once inside, you can install packages, run scripts, or explore the filesystem just like you would on any Linux machine.
#### Port Mapping
For web applications or APIs, you'll need to expose ports so you can access them from your host machine:
```bash
# Run a simple HTTP server and map port 8000
docker run -p 8000:8000 python:3.11 python -m http.server 8000
```
The `-p 8000:8000` maps port 8000 inside the container to port 8000 on your host machine. The format is `host_port:container_port`. Now you can visit `http://localhost:8000` in your browser to access the server running inside the container.
You can also map to different ports:
```bash
# Map container port 8000 to host port 3000
docker run -p 3000:8000 python:3.11 python -m http.server 8000
```
#### Sharing Files and Configuration
Containers are isolated by default, which means files created inside them disappear when the container stops. To persist data or share files between your host machine and containers, use volume mounting:
```bash
# Mount the current directory to /app inside the container
docker run -v $(pwd):/app python:3.11 ls /app
# Mount a specific file
docker run -v $(pwd)/script.py:/script.py python:3.11 python /script.py
```
The `-v` flag creates a **v**olume mount with the format `host_path:container_path`. Now files you create or modify in `/app` inside the container will actually be stored in your current directory on the host machine.
For configuration, you'll often need to pass environment variables to containers:
```bash
# Set environment variables
docker run -e DEBUG=true -e API_KEY=your_key python:3.11 python -c "import os; print(os.environ.get('DEBUG'))"
# Load environment variables from a file
docker run --env-file .env python:3.11 python -c "import os; print(os.environ)"
```
This is particularly useful for configuring database connections, API keys, or feature flags without hardcoding them into your application.
### Managing Containers
#### Container Lifecycle
Containers have a lifecycle just like any other process. Here are the essential commands for managing running containers:
```bash
# List running containers
docker ps
# List all containers (including stopped ones)
docker ps -a
# Stop a running container
docker stop <container_id_or_name>
# Start a stopped container
docker start <container_id_or_name>
# Restart a container
docker restart <container_id_or_name>
# View container logs
docker logs <container_id_or_name>
# Follow logs in real-time
docker logs -f <container_id_or_name>
```
Fun fact: you don't need to type the full container ID. Just the first few characters are enough, as long as they're unique.
You can also give your containers meaningful names:
```bash
# Run a container with a custom name
docker run --name my-python-app -d python:3.11 python -c "import time; time.sleep(300)"
# Now you can reference it by name
docker logs my-python-app
docker stop my-python-app
```
#### Executing Commands
Sometimes you need to run additional commands in a container that's already running. This is where `docker exec` comes in handy:
```bash
# Execute a single command in a running container
docker exec my-python-app python -c "print('Hello from exec!')"
# Get an interactive shell in a running container
docker exec -it my-python-app bash
# Install additional packages in a running container
docker exec my-python-app pip install requests
```
This is incredibly useful for debugging, installing additional tools, or making quick changes without recreating the entire container.
#### Cleaning Up
As you experiment with containers, you'll accumulate stopped containers and unused images. Here's how to clean up:
```bash
# Remove a specific stopped container
docker rm <container_id_or_name>
# Remove all stopped containers
docker container prune
# Remove unused images
docker image prune
# Remove everything unused (containers, images, networks, build cache)
docker system prune
```
Regular cleanup keeps your system tidy and frees up disk space.
> **Extended Reading:**
> When you need to run multiple related containers (like a web application with a database), managing them with individual `docker run` commands becomes cumbersome. [Docker Compose](https://docs.docker.com/compose/) solves this by letting you define your entire multi-container application in a single YAML file. It can also replace complex `docker run` commands even for single containers, making it easier to manage containers with lots of configuration options:
>
> ```yaml
> services:
> web:
> image: nginx
> ports:
> - "8080:80"
> database:
> image: postgres
> environment:
> POSTGRES_PASSWORD: secret
> ```
>
> With `docker compose up`, you can start all services at once. This becomes essential for complex applications where containers need to communicate with each other. Check out the [Docker Compose quickstart](https://docs.docker.com/compose/gettingstarted/) and [sample applications](https://github.com/docker/awesome-compose) to see practical examples.
## Build Containers
So far we've been using pre-built images from registries like Docker Hub. But what happens when you want to package your own application? The Python images we've used are great starting points, but they don't include your specific code, dependencies, or configuration. To deploy the image classification API server we built in [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md), we need to create our own container image that bundles everything together.
Building custom container images transforms your application from something that requires manual setup on each machine into a portable package that runs consistently anywhere. Instead of asking users to install Python, download dependencies, configure environment variables, and run multiple commands, they can simply execute `docker run your-app` and everything works.
### Interactive Approach (Not Recommended)
Before diving into the proper way to build images, let's briefly look at the manual approach to understand why it's not ideal. You could theoretically create a custom image by starting a container, installing everything manually, and then saving it using the [docker commit](https://docs.docker.com/reference/cli/docker/container/commit/) command:
```bash
# Start an interactive Python container
docker run -it python:3.11 bash
# Inside the container, manually install dependencies
pip install fastapi uvicorn transformers torch pillow sqlalchemy
# Copy your application files (you'd need to mount or copy them somehow)
# Configure everything manually...
# Exit the container, then commit it as a new image
docker commit container_id my-app:latest
```
While this technically works, it has serious drawbacks: the process isn't reproducible, there's no documentation of what was installed, it's error-prone, and you can't easily version or modify your setup. This approach is like cooking without a recipe - it might work once, but you'll struggle to recreate it consistently.
### Dockerfile: The Recipe for Container Images
The proper way to build container images is with a [**Dockerfile**](https://docs.docker.com/reference/dockerfile/): a text file containing step-by-step instructions for creating your image. Remember the layered system we discussed earlier? A Dockerfile defines exactly what goes into each layer, making the build process completely reproducible and documented.
Think of a Dockerfile as a recipe that tells Docker: "Start with this base ingredient (base image), add these components (dependencies), mix in this code (your application), and serve it this way (startup command)." Anyone with your Dockerfile can recreate the exact same image, just like anyone can follow a recipe to make the same dish.
Each instruction in a Dockerfile creates a new layer in your image. This connects directly to the efficiency benefits we discussed: if you only change your application code, Docker will reuse all the cached layers for the base image and dependencies, rebuilding only what's necessary.
#### Dockerfile Instructions
To write a Dockerfile, you need to understand the different instructions available. Each instruction tells Docker what to do during the build process, as list below.
**Foundation Instructions** set up the basic environment:
- `FROM` specifies which base image to start from (always the first instruction)
- `WORKDIR` sets the working directory for subsequent commands
**File Operations** handle getting your code and files into the container:
- `COPY` transfers files from your host machine to the container
- `ADD` similar to COPY but with additional features like extracting archives
**Build-time Instructions** execute during image creation:
- `RUN` executes commands during the build process, like installing packages
- `ARG` defines build-time variables that can be passed during the build
**Runtime Configuration** defines how the container behaves when it runs:
- `ENV` sets environment variables that persist when the container runs
- `EXPOSE` documents which ports the application uses (for documentation only)
- `VOLUME` defines mount points for persistent or shared data
**Execution Instructions** control what happens when the container starts:
- `CMD` provides default command and arguments (can be overridden)
- `ENTRYPOINT` sets the main command that always runs (harder to override)
Instructions in a Dockerfile help you structure your image logically: start with foundation, add your files, configure the build environment, set runtime properties, and finally define execution behavior.
#### Building the Image Classification Server
Let's containerize our image classification API server. First, we need to organize our project files:
```
my-ai-api/
├── Dockerfile
├── requirements.txt
├── main.py
└── ai_api.db (will be created)
```
Create a `requirements.txt` file listing all Python dependencies:
```txt
fastapi==0.104.1
uvicorn==0.24.0
transformers==4.35.2
torch==2.1.1
pillow==10.1.0
sqlalchemy==2.0.23
numpy<2
```
Now, here's our Dockerfile with step-by-step explanations:
```dockerfile
# Start with official Python 3.11 image (creates base layer)
FROM python:3.11-slim
# Set working directory inside container
WORKDIR /app
# Copy requirements first (for better layer caching)
COPY requirements.txt .
# Install Python dependencies (creates dependency layer)
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code (creates application layer)
COPY server.py .
# Create directory for database
RUN mkdir -p ./data
# Expose port 8000 for the API
EXPOSE 8000
# Command to run when container starts
CMD ["python", "server.py"]
```
Let's break down each instruction:
- **FROM**: Specifies the base image. We use `python:3.11-slim` for a smaller footprint
- **WORKDIR**: Sets `/app` as the working directory for subsequent commands
- **COPY requirements.txt**: Copies only requirements first to leverage Docker's layer caching
- **RUN pip install**: Installs dependencies in a separate layer
- **COPY main.py**: Copies application code in its own layer
- **EXPOSE**: Documents that the container uses port 8000 (doesn't actually publish it)
- **CMD**: Defines the default command when the container starts
#### Building and Running Your Container
Now let's build the image from our Dockerfile:
```bash
# Navigate to your project directory
cd my-ai-api
# Build the image with a tag
docker build -t my-ai-classifier:v1.0 .
```
Docker will execute each instruction in your Dockerfile, creating layers as it goes. Once built, run your containerized API server:
```bash
# Run the container with port mapping
docker run -p 8000:8000 my-ai-classifier:v1.0
# Or run in detached mode with volume for persistent database
docker run -d -p 8000:8000 -v $(pwd)/data/ai_api.db:/app/data/ai_api.db --name ai-server my-ai-classifier:v1.0
```
Your API server is now running in a container! You can access it at `http://localhost:8000` just like before, but now everything runs in a completely isolated, reproducible environment.
> **Extended Reading:**
> To build more efficient and maintainable container images, consider these advanced practices:
>
> - **[.dockerignore files](https://docs.docker.com/build/concepts/context/#dockerignore-files)** to exclude unnecessary files from build context
> - **[Multi-stage builds](https://docs.docker.com/build/building/multi-stage/)** for smaller production images
> - **[Dockerfile best practices](https://docs.docker.com/develop/dev-best-practices/)** for security and performance
>
> These techniques become increasingly important as your applications grow in complexity and you move toward production deployments.
#### Distributing Your Images
Now that you've built a working container image, you might want to share it with others or deploy it to production servers. Container registries serve as distribution hubs where you can publish your images for others to download and use.
To share your image, you need to push it to a registry. Let's use Docker Hub as an example:
```bash
# First, login to Docker Hub
docker login
# Tag your image with your Docker Hub username
docker tag my-ai-classifier:v1.0 yourusername/my-ai-classifier:v1.0
# Push the image to Docker Hub
docker push yourusername/my-ai-classifier:v1.0
```
The tagging step is crucial: it follows the format `registry/username/repository:tag`. For Docker Hub, you only need `username/repository:tag` since it's the default registry.
Once pushed, anyone can run your containerized API server with a single command:
```bash
docker run -p 8000:8000 yourusername/my-ai-classifier:v1.0
```
> **Videos:**
> - [Docker tutorial](https://www.youtube.com/watch?v=DQdB7wFEygo)
> **Extended Reading:**
> The same push process works for other registries like [GitHub Container Registry](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry) (`ghcr.io/username/repository:tag`) or [Google Container Registry](https://cloud.google.com/artifact-registry/docs) (`gcr.io/project/repository:tag`). Many registries also offer **automated building**: instead of building images locally, you can push your Dockerfile and source code to the registry, and it will build the image for you. This is particularly useful for [CI/CD pipelines](https://docs.docker.com/build/ci/) where you want automated builds triggered by code changes. Services like Docker Hub's [Automated Builds](https://docs.docker.com/docker-hub/builds/), [GitHub Actions with Container Registry](https://docs.github.com/en/actions/publishing-packages/publishing-docker-images), and [cloud provider build services](https://cloud.google.com/build/docs) handle the entire build process in the cloud.
## Exercise
**Containerize Your AI API Server**
Transform your image classification API server from [Wrap AI Models with APIs](@/ai-system/wrap-ai-with-api/index.md) into a portable, reproducible container that can run anywhere:
- **Write a Dockerfile**: Create a comprehensive Dockerfile using the instructions covered in [Dockerfile Instructions](#dockerfile-instructions)
- **Build and Run**: Follow the process demonstrated in [Building and Running Your Container](#building-and-running-your-container) to create your container image and run it with appropriate port mapping and volume mounting
- **Test Functionality**: Verify that your containerized API server works identically to the original version, with all endpoints accessible and functioning correctly
**Advanced Challenges (Optional):**
- **Optimization**: Implement techniques from the extended reading sections, such as creating a .dockerignore file and exploring multi-stage builds for smaller image sizes
- **Distribution**: Practice the workflow from [Distributing Your Images](#distributing-your-images) by pushing your image to Docker Hub or GitHub Container Registry, making it accessible to others
The goal is to transform your API from a manual setup requiring multiple installation steps into a single-command deployment that works consistently across different environments.

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

View file

@ -0,0 +1,14 @@
+++
title = "C-Production-ready AI Systems"
date = 2025-10-25
description = ""
+++
Now you can deploy your AI system on different hardware infrastructures with ease, and also enable everyone in the world to access (and hopefully pay for) your AI services. Now you won't run into situations where your friends are calling you to play CS2 but your laptop is running an AI service so you cannot.
However, in real-world production environments, we might encounter challenges that our existing knowledge cannot effectively handle. For one, machines can break and so do the AI systems running on those machines. You probably do not want a phone call from your boss telling you the company's whole service goes down to ruin your weekend. On the other hand, you might want to introduce update to a running AI system at some point, but your boss tell you that you cannot just stop the current system and start the new one, since this 5 seconds service interruption will cause the company millions.
> **Videos:**
>
> - [How service interruptions can take down a government service, or break a Country's network](https://www.youtube.com/@kevinfaang/videos)
In answer to these, in this final phase of the course, we will explore some advanced techniques that make your AI system more robust to unforeseen interruptions, and can handle challenging needs in real-world deployment.

Binary file not shown.

After

Width:  |  Height:  |  Size: 571 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 380 KiB

View file

@ -0,0 +1,703 @@
+++
title = "A.3-Wrap AI Models with APIs"
date = 2025-09-18
description = ""
+++
> **TL;DR:**
> Build your own APIs for serving AI models, covering everything from basic server setup and AI model integration to authentication, database-backed user management, and rate limiting—transforming you from an API consumer to an API producer.
In the previous two modules we've seen many industry-standard API techniques and practices. Through the power of APIs we've also played with AI systems run by someone else. The problem is you are always spending money by doing so. It's time to serve your own APIs so that you turn from consumers to producers (and maybe earn some money by letting other people use your APIs).
> **Example:**
> This blog site is a fully self-hosted website with basic HTTP-based APIs for handling `GET` requests from browsers. When you visit this post, your browser essentially sends a `GET` request to my server and the server responds with the HTML body for the browser to render. Knowing how to implement your own APIs enables you to do lots of cool stuff that you can control however you want!
![API server kitchen analogy](api-server-kitchen.png)
APIs are served by API servers—a type of application that listens to API requests sent to them and produces the corresponding responses. They are like kitchens that maintain order and delivery windows for accepting and fulfilling orders, but usually keep the process of how an order is processed behind the doors. Publicly accessible APIs that you've been playing with in previous modules are nothing magic: they are served by API servers run by providers on one or more machines identified by the APIs' corresponding domains. We will compare a few choices of Python frameworks for implementing API servers, and focus on one of them to demonstrate how to implement API fundamentals you learned from previous modules in practice.
## Python API Servers
Nowadays Python is the de facto language for implementing AI models. When we wrap our AI models with APIs, it would be straightforward if API servers are also implemented with Python, so that we can implement models and servers in one Python program. Thus, we will take a look at three popular Python frameworks commonly used to implement API servers: FastAPI, Django, and Flask.
[**FastAPI**](https://www.geeksforgeeks.org/python/fastapi-introduction/) is a modern and high-performance framework used to build APIs quickly and efficiently. It is a relatively new player in Python API frameworks, but has quickly become one of the fastest-growing frameworks in Python. It has built-in support for essential components of APIs such as authentication and input validation. FastAPI is also suitable for implementing high-performing API servers thanks to its asynchronous support—think of a kitchen that won't be occupied by a few orders under processing and can always take and process new requests. Below is a barebone API server implemented in FastAPI:
```python
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
def read_root():
return {"message": "Hello, World!"}
```
[**Django**](https://www.w3schools.com/django/django_intro.php) is a comprehensive web framework that is designed to implement complex web applications (websites) instead of focusing on APIs. Django follows the classic MVT (Model-View-Template) design pattern of web apps, where model represents the data you want to display, typically sourced from a database; view handles incoming requests and returns the appropriate template and content based on the user's request; and template is an HTML file that defines the structure of the web page and includes logic for displaying the data. It also comes with lots of built-in modules for building web apps, such as database connectors and authentication. Below is a minimal Django implementation.
```python
# urls.py
from django.urls import path
from . import views
urlpatterns = [
path('', views.hello_world, name='hello_world'),
]
# views.py
from django.http import JsonResponse
def hello_world(request):
return JsonResponse({"message": "Hello, World!"})
```
[**Flask**](https://dev.to/atifwattoo/flask-a-comprehensive-guide-19mm) is a web framework similar to Django, but it is designed to be lightweight and modular. Its built-in functionalities are basic, but it can be extended through additional packages and is suitable for implementing smaller-scale applications or prototyping. It is also usually considered the least performant among the three frameworks, due to its lack of asynchronous support. Below is a barebone implementation with Flask.
```python
from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/')
def hello_world():
return jsonify({"message": "Hello, World!"})
```
The three examples above can all achieve similar results—implement an API server that takes `GET` requests and returns a hello-world message. You can tell that comparatively the implementation with FastAPI and Flask is simpler than that with Django. We will use FastAPI as the primary example for demonstrating how to build your own API servers in the following content.
> **Videos:**
> - [Comparison between FastAPI, Flask, and Django](https://www.youtube.com/watch?v=cNlJCQHSmbE)
## FastAPI Fundamentals
We will start with implementing an API server with essential functionalities: accept requests to specific routes (specified by the URL) with `GET` and `POST` methods.
### Basic Setup
We will need both `fastapi` and `uvicorn` packages, where `uvicorn` is what we call a server worker. Essentially `fastapi` primarily handles the definition of the server, and `uvicorn` actually does the API serving heavy-lifting. Extending the above example, we can start from a minimal implementation, but this time with some customization so it feels more like our own:
```python
# main.py
from fastapi import FastAPI
app = FastAPI(title="My AI API Server", version="1.0.0")
@app.get("/")
def read_root():
return {"message": "Welcome to my AI API server!"}
```
And to start our server, run:
```bash
uvicorn main:app --reload --host 127.0.0.1 --port 8000
```
Where `main:app` points to the `app` object we implemented in the `main` program. `--reload` tells the server to automatically restart itself after we modify `main.py` for ease of development. `127.0.0.1` is the IP of "localhost"—the computer we run the server on, and `--host 127.0.0.1` means the server will only accept requests sent from the same computer. `8000` is the port our server listens on, in other words, the port used to identify our server application. You can now try to send a `GET` request to `http://127.0.0.1:8000` with another Python application and the `requests` library, or by accessing the URL in your browser, and you should be able to see the message.
![FastAPI browser response](fastapi-browser.png)
You will also be able to see the log messages from your server:
```
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started reloader process [52851] using StatReload
INFO: Started server process [52853]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 127.0.0.1:56835 - "GET / HTTP/1.1" 200 OK
```
Showing that a `GET` request from `127.0.0.1:56835` (location of the client application you used to send the request, the port you see might be different) for the route `/` is responded with `200 OK`. Now try editing `main.py` and you will see the reload functionality working:
```
WARNING: StatReload detected changes in 'main.py'. Reloading...
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [54804]
INFO: Started server process [54811]
INFO: Waiting for application startup.
INFO: Application startup complete.
```
### Routes and URL Variables
Remember the OpenAI and Anthropic APIs you've played with? They have lots of routes under the same domain, for example `api.openai.com/v1/responses` and `api.openai.com/v1/chat/completions`. You can also easily define specific routes in FastAPI using the `@app.get` decorator's parameter. For example, in `main.py` we add another route:
```python
@app.get("/secret")
def get_secret():
return {"message": "You find my secret!"}
```
And you will get the corresponding responses from `/` and `/secret` routes by accessing `http://127.0.0.1:8000` and `http://127.0.0.1:8000/secret`, respectively.
There are also occasions you want users to pass variables through URLs. For example YouTube channels' URLs are given as `https://www.youtube.com/@DigitalFoundry` or `https://www.youtube.com/channel/UCm22FAXZMw1BaWeFszZxUKw`. Implementing a separate route for each unique variable is clearly not practical. Luckily you have two ways to pass and parse variables in FastAPI's routes. One is through [URL template](https://en.wikipedia.org/wiki/URI_Template):
```python
@app.get("/parrot/{message}")
def repeat_message(message: str):
return {"message": message}
```
Try accessing `http://127.0.0.1:8000/parrot/` + any message. You can mix multiple fixed paths and variables in one route, for example:
```python
@app.get("/parrot/{message}-{date}/secret/{user}")
def repeat_message(message: str, user: int, date: str):
return {"message": f"A secret message {message} sent by user {user} on {date}."}
# Try access http://localhost:8000/parrot/random-July26/secret/21
```
Another way is through [URL parameters](https://www.semrush.com/blog/url-parameters/). These are variables specified following `?` at the end of URLs with format `<key>=<value>` for each variable, and can be `&`-separated for specifying multiple variables. For example, `https://www.youtube.com/watch?v=5tdsZwlWXAc&t=2s`. In FastAPI, URL parameters are caught by function parameters that are not covered by URL templates:
```python
@app.get("/secret")
def get_secret(user: int = 0):
return {"message": f"User {user} find my secret!"}
```
Try accessing `http://127.0.0.1:8000/secret?user=2`. Needless to say, you can mix the above two approaches in one route.
> **Note:**
> Strictly speaking URL parameters are part of URL templates, and URL templates can go quite complicated. But in practice following REST principles, you should keep your served URLs intuitive and straightforward.
### Handle POST Requests
You've noticed that the routes we implemented above can only handle `GET` requests. To handle `POST` requests we will have to additionally read the incoming data (request body). Since incoming and response data are often more complicated and structured for `POST` routes, it's also worth bringing up `pydantic` for building data models and integrating into our FastAPI server.
Let's say you expect user to send a request body with following format:
```json
{
"message": "This is a very secret message!",
"date": "2025-07-30",
"user": 20
}
```
You can, technically, simply use an `@app.post` decorated function and catch the request body as a plain Python `dict`:
```python
@app.post("/receiver")
def receiver(data: dict):
user = data['user']
message = data['message']
date = data['date']
return {"message": f"User {user} send a secret message '{message}' on {date}."}
```
The problem is, you cannot ensure that the data sent by users actually complies with the data type you want. For example, a request body with `user` as a string will also be accepted by the above route, but can cause problems in later processing. Manually implementing type checks can be tedious.
Fortunately, FastAPI can incorporate [`pydantic`](https://docs.pydantic.dev/latest/), a data validation library to abstract data models and perform automatic type checking. Extending the above example, we first define a data class and require the request body to follow the definition:
```python
from pydantic import BaseModel
class ReceivedData(BaseModel):
user: int
message: str
date: str
@app.post("/receiver")
def receiver(data: ReceivedData):
user = data.user
message = data.message
date = data.date
return {"message": f"User {user} sent a secret message '{message}' on {date}."}
```
Now if the request body contains invalid data types, FastAPI will reject the request and return `422 Unprocessable Content`.
> **Videos:**
> - [FastAPI tutorial](https://www.youtube.com/watch?v=iWS9ogMPOI0)
> - [Pydantic tutorial](https://www.youtube.com/watch?v=XIdQ6gO3Anc)
> **Extended Reading:**
> There are many more benefits and functionalities of integrating `pydantic` to define and abstract data models in FastAPI, including reusability and automatic documentation generation. Generally speaking, when implementing API servers, data model abstraction is a preferred practice. Take a look at more things you can do with both libraries combined:
> - https://data-ai.theodo.com/en/technical-blog/fastapi-pydantic-powerful-duo
> - https://www.geeksforgeeks.org/python/fastapi-pydantic/
### API Versioning
As we covered in [Advanced APIs in the Era of AI](@/ai-system/advanced-apis/index.md#api-versioning), API versioning allows you to introduce changes without breaking existing integrations. This is particularly important for AI APIs where models and features are constantly evolving. FastAPI makes implementing URL path versioning straightforward using `APIRouter` with prefixes.
```python
from fastapi import APIRouter
from datetime import datetime
v1_router = APIRouter(prefix="/v1")
v2_router = APIRouter(prefix="/v2")
@v1_router.post("/receiver")
def receiver_v1(data: ReceivedData):
return {"message": f"User {data.user} sent '{data.message}' on {data.date}"}
@v2_router.post("/receiver")
def receiver_v2(data: ReceivedData):
return {
"message": f"User {data.user} sent '{data.message}' on {data.date}",
"version": "2.0",
"timestamp": datetime.now().isoformat()
}
app.include_router(v1_router)
app.include_router(v2_router)
```
Now your API supports both versions simultaneously: users can access `/v1/receiver` for the original functionality while `/v2/receiver` provides enhanced features.
> **Extended Reading:**
> Examples of implementing advanced API techniques introduced in [Advanced APIs in the Era of AI](@/ai-system/advanced-apis/index.md) with FastAPI:
> - [Streaming Protocols](https://apidog.com/blog/fastapi-streaming-response/)
> - [WebSockets](https://www.geeksforgeeks.org/python/how-to-use-websocket-with-fastapi/)
> - [MQTT](https://sabuhish.github.io/fastapi-mqtt/getting-started/)
> - [Model Context Protocol](https://github.com/tadata-org/fastapi_mcp)
## Build APIs for AI Models
With the foundation of basic implementation of FastAPI servers, we proceed to integrate AI models and implement AI API servers. We will build an API server with image classification APIs as an example.
### Barebone Implementation
First and foremost we need a image classification model to support the AI pipeline of our API server. You might already have a model implemented in previous or parallel courses lying around. For demonstration purpose here we will use an off-the-shelf model from [HuggingFace](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ImageClassificationPipeline).
```python main.py
import asyncio
from PIL import Image
import torch
from transformers import pipeline, AutoImageProcessor, AutoModelForImageClassification
class ImageClassifier:
def __init__(self):
self.model = None
self.processor = None
self.model_name = "microsoft/resnet-18"
async def load_model(self):
"""Load the image classification model asynchronously"""
if self.model is None:
print(f"Loading model: {self.model_name}")
self.model = AutoModelForImageClassification.from_pretrained(self.model_name)
self.processor = AutoImageProcessor.from_pretrained(self.model_name)
print("Model loaded successfully")
async def classify_image(self, image: Image.Image) -> dict:
"""Classify a single image"""
if self.model is None:
await self.load_model()
# Process image
inputs = self.processor(image, return_tensors="pt")
# Run inference
with torch.no_grad():
outputs = self.model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits[0], dim=0)
# Get top 5 predictions
top_predictions = torch.topk(predictions, 5)
results = []
for score, idx in zip(top_predictions.values, top_predictions.indices):
label = self.model.config.id2label[idx.item()]
confidence = score.item()
results.append({
"label": label,
"confidence": round(confidence, 4)
})
return {
"predictions": results,
"model": self.model_name
}
```
We are using a very lightweight image classification model `microsoft/resnet-18` that should be able to run on most PCs. Notice the `async` declaration on model loading and inference functions. It is there to make sure that when your server is loading the model or processing an incoming image, the server is still able to process other requests. Think of the server always assign a dedicated person (thread) to handle a incoming request, so that if it receives another request during the process, it can assign another person rather than waiting the previous person to finish their job.
Define the server app that will load the model when startup.
```python main.py
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
# Initialize classifier
classifier = ImageClassifier()
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
await classifier.load_model()
yield
app = FastAPI(title="AI Image Classification API", version="1.0.0", lifespan=lifespan)
```
We also define a few data models for the incoming request and responses, and a utility function for reading image data. Note that we use [`base64`-encoded images](https://www.base64-image.de/) so that the request body is JSON as we familiar with.
```python main.py
import base64
import io
from typing import List, Optional
from pydantic import BaseModel
class ImageRequest(BaseModel):
image: str # base64 encoded image
filename: Optional[str] = None
class ClassificationResponse(BaseModel):
predictions: List[dict]
model: str
class ModelInfo(BaseModel):
name: str
status: str
num_labels: Optional[int] = None
def decode_base64_image(base64_string: str) -> Image.Image:
"""Decode base64 string to PIL Image"""
try:
# Remove data URL prefix if present
if base64_string.startswith('data:image'):
base64_string = base64_string.split(',')[1]
# Decode base64
image_data = base64.b64decode(base64_string)
image = Image.open(io.BytesIO(image_data)).convert("RGB")
return image
except Exception as e:
raise HTTPException(status_code=400, detail=f"Invalid base64 image: {str(e)}")
```
Finally, we implement two routes for users to fetch information about the model, and to perform image classification.
```python
@app.get("/model/info", response_model=ModelInfo)
async def model_info():
"""Get model information"""
if classifier.model is None:
return ModelInfo(
name=classifier.model_name,
status="not_loaded"
)
return ModelInfo(
name=classifier.model_name,
status="loaded",
num_labels=len(classifier.model.config.id2label)
)
@app.post("/classify", response_model=ClassificationResponse)
async def classify_image(request: ImageRequest):
"""Classify a single base64 encoded image"""
try:
# Decode base64 image
image = decode_base64_image(request.image)
# Classify image
result = await classifier.classify_image(image)
return ClassificationResponse(**result)
except HTTPException:
raise
except Exception as e:
raise HTTPException(status_code=500, detail=f"Error processing image: {str(e)}")
```
Now you have a little image classification API server! I sent it a picture of Spanish-style seafood casserole I made yesterday (it's delicious, by the way) by encoding the image to `base64` format.
![Seafood casserole](seafood-casserole.png)
And I got the classification result from the server:
```json
{
"model": "microsoft/resnet-18",
"predictions": [
{
"confidence": 0.5749,
"label": "soup bowl"
},
{
"confidence": 0.2213,
"label": "consomme"
},
{
"confidence": 0.1637,
"label": "hot pot, hotpot"
},
{
"confidence": 0.0107,
"label": "mortar"
},
{
"confidence": 0.0097,
"label": "potpie"
}
]
}
```
### API Key Authentication
Right now our API server is unprotected and anyone that can access your PC can send requests and overload your delicate PC, while you also have no idea who are doing so. That's why most APIs are protected with authentication (typically API keys), and we should also implement a similar system.
FastAPI has built-in authentication support, and to implement a basic API key authentication, we can use a verification function:
```python
from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
async def verify_api_key(credentials: HTTPAuthorizationCredentials = Depends(security)):
# API key validation logic
if credentials.credentials != "your-secret-api-key":
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid API key"
)
return credentials.credentials
```
And for routes that need to be protected, we add the requirement for API keys:
```python
@app.post("/classify", response_model=ClassificationResponse)
async def classify_image(request: ImageRequest, api_key: str = Depends(verify_api_key)):
# Remaining code
```
Now only the request with authentication header `Authorization:Bearer your-secret-api-key` will be accepted by the `classify` route, otherwise it will return a `401 Unauthorized`.
The limitation of the above implementation is that your API key is hardcoded. In practice you will want to have a dynamic list of API keys, one for each user, which also enables you to identify each request and keep track of the usage of users.
> **Extended Reading:**
> Comprehensive overview of authentication and authorization in FastAPI:
> - https://www.geeksforgeeks.org/python/authentication-and-authorization-with-fastapi/
> - https://betterstack.com/community/guides/scaling-python/authentication-fastapi/
### Database Integration
Continue on the above topic, one common practice for recording the list of API keys and their respective users and other information is through databases. In previous [AI og data](https://www.moodle.aau.dk/course/view.php?id=50254) course we already hands-on concepts of databases and interact with databases through SQL queries. You can directly use database connectors and integrate SQL queries into your API server, but similar to the `pydantic` library for managing data models, we also have `sqlalchemy` for managing data models for databases.
[`sqlalchemy`](https://www.datacamp.com/tutorial/sqlalchemy-tutorial-examples) provides high-level interface to interact with databases so you do not have to write SQL queries yourself, but focus on the abstract definition and manipulation of data models. Similar to `pydantic` providing automatic type verification, `sqlalchemy` also provide automatic database initialization and SQL injection protection. For dynamic API key and user management and usage tracking, we have the following two data models for each user and processed API request:
```python
from sqlalchemy import create_engine, Column, Integer, String, DateTime, Float, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, Session, relationship
from datetime import datetime
import time
Base = declarative_base()
class User(Base):
__tablename__ = "users"
id = Column(Integer, primary_key=True)
api_key = Column(String, unique=True, nullable=False)
email = Column(String)
created_at = Column(DateTime, default=datetime.utcnow)
requests = relationship("APIRequest", back_populates="user")
class APIRequest(Base):
__tablename__ = "api_requests"
id = Column(Integer, primary_key=True)
user_id = Column(Integer, ForeignKey("users.id"))
endpoint = Column(String)
timestamp = Column(DateTime, default=datetime.utcnow)
response_time_ms = Column(Float)
status_code = Column(Integer)
user = relationship("User", back_populates="requests")
```
Here we use SQLite as a simple database for demonstration, and the following code will create the database and tables when you start the server for the first time. You can see one of the benefits of `sqlalchemy`: later if you want to move to more performant databases, most of the time you just have to replace the database URL and reuse the data model, and `sqlalchemy` will handle the differences between databases for you.
```python
engine = create_engine("sqlite:///ai_api.db")
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base.metadata.create_all(bind=engine)
```
To upgrade our API key authentication, we fetch the user using the given API key from the database:
```python
security = HTTPBearer()
def get_db():
db = SessionLocal()
try:
yield db
finally:
db.close()
async def get_current_user(
credentials: HTTPAuthorizationCredentials = Depends(security),
db: Session = Depends(get_db)
):
user = db.query(User).filter(User.api_key == credentials.credentials).first()
if not user:
raise HTTPException(status_code=401, detail="Invalid API key")
return user
@app.post("/classify", response_model=ClassificationResponse)
async def classify_image(
request: ImageRequest,
user: User = Depends(get_current_user),
db: Session = Depends(get_db)
):
# Remaining code
```
> **Videos:**
> - [SQLAlchemy introduction](https://www.youtube.com/watch?v=aAy-B6KPld8)
### API Key Identification & Tracking
Now we can identify specific users with the API keys they used, we can also track their usage of our API server. Utilizing the `APIRequest` data model we defined earlier, we update our `/classify` route with additional recording of each request:
```python
@app.post("/classify", response_model=ClassificationResponse)
async def classify_image(
request: ImageRequest,
user: User = Depends(get_current_user),
db: Session = Depends(get_db)
):
"""Classify a single base64 encoded image"""
start_time = time.time()
try:
# Classify image (your existing logic)
image = decode_base64_image(request.image)
result = await classifier.classify_image(image)
# Log request
api_request = APIRequest(
user_id=user.id,
endpoint="/classify",
response_time_ms=(time.time() - start_time) * 1000,
status_code=200
)
db.add(api_request)
db.commit()
return ClassificationResponse(**result)
except Exception as e:
# Log failed request
api_request = APIRequest(
user_id=user.id,
endpoint="/classify",
response_time_ms=(time.time() - start_time) * 1000,
status_code=500
)
db.add(api_request)
db.commit()
raise HTTPException(status_code=500, detail=str(e))
```
Now whenever users send requests to our server, records will be stored into the `api_requests` table of our database:
```
1|1|/classify|2025-07-27 12:16:27.610650|21.8780040740967|200
2|1|/classify|2025-07-27 12:24:43.704042|22.1047401428223|200
3|1|/classify|2025-07-27 12:24:46.572790|16.6518688201904|200
4|1|/classify|2025-07-27 12:24:48.011679|16.9012546539307|200
5|1|/classify|2025-07-27 12:24:48.978239|16.8101787567139|200
```
We can also create a route for users to check their own usage status.
```python
@app.get("/usage")
async def get_usage(
user: User = Depends(get_current_user),
db: Session = Depends(get_db)
):
requests = db.query(APIRequest).filter(APIRequest.user_id == user.id).all()
total = len(requests)
successful = len([r for r in requests if r.status_code == 200])
avg_time = sum(r.response_time_ms for r in requests) / total if total > 0 else 0
return {
"total_requests": total,
"successful_requests": successful,
"success_rate": round(successful / total * 100, 2) if total > 0 else 0,
"avg_response_time_ms": round(avg_time, 2)
}
```
Users can send a `GET` request to this route with their API keys and get a report of their usage:
```json
{
"avg_response_time_ms": 18.87,
"success_rate": 100.0,
"successful_requests": 5,
"total_requests": 5
}
```
### Rate Limiting
With API key-based user tracking in place, we can now implement rate limiting for each user to prevent bad actors from overloading our API server. Below is a simple DIY implementation that limits each user to sending 5 requests per minute, a measurement following the "sliding window" approach we introduced in [Advanced APIs in the Era of AI](@/ai-system/advanced-apis/index.md#rate-limiting).
```python
from datetime import datetime, timedelta
async def check_rate_limit(user: User, db: Session):
"""Check if user has exceeded their rate limits"""
now = datetime.utcnow()
# Check requests in the last minute
minute_ago = now - timedelta(minutes=1)
recent_requests = db.query(APIRequest).filter(
APIRequest.user_id == user.id,
APIRequest.timestamp >= minute_ago
).count()
if recent_requests >= 5:
raise HTTPException(
status_code=429,
detail=f"Rate limit exceeded: 5 requests per minute"
)
```
And in our `/classify` route, add one line of code at the start of the function:
```python
await check_rate_limit(user, db)
```
Now if a user sends more than 5 requests within one minute, their requests will be rejected with a `429 Too Many Requests`. In practice you might also want to record users' rate limit threshold in the `User` data model instead of hardcoding it.
> **Extended Reading:**
> A few libraries for easier implementation of rate limiting:
> - https://github.com/laurentS/slowapi
> - https://github.com/long2ice/fastapi-limiter
## Exercise
Build an image classification API server that demonstrates knowledge covered in this module, reversing your role from API consumer to producer.
**Exercise: Image Classification API Server**
Develop an API server that integrates the concepts covered throughout this module:
- **FastAPI Implementation**: Use the FastAPI fundamentals covered in [FastAPI Fundamentals](#fastapi-fundamentals), including proper route definition, request handling, and Pydantic data models
- **AI Model Integration**: Integrate an image classification model following the patterns shown in [Build APIs for AI Models](#build-apis-for-ai-models), using an appropriate open-source model that can run on your system
- **API Versioning**: Support API versioning using the approach shown in [API Versioning](#api-versioning)
- **Authentication System**: Implement API key authentication as demonstrated in [API Key Authentication](#api-key-authentication) to protect your endpoints
**Client Integration:**
Modify your image analysis program from [API Fundamentals](@/ai-system/api-fundamentals/index.md) to connect to your server instead of third-party APIs.
**Additional Functionalities (Optional):**
- **Database Integration**: Use database integration techniques from [Database Integration](#database-integration) for user (API key) management and usage tracking.
- **Rate Limiting**: Apply rate limiting concepts from [Rate Limiting](#rate-limiting) to prevent server overload

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

View file

@ -81,13 +81,14 @@ img {
} }
h1, h2, h3, h4 { h1, h2, h3, h4 {
margin: 1.5rem 0 0.75rem; margin: 2rem 0 0.75rem;
line-height: 1.3; line-height: 1.3;
} }
h1 { font-size: 1.5rem; } h1 { font-size: 1.5rem; }
h2 { font-size: 1.25rem; } h2 { font-size: 1.25rem; }
h3 { font-size: 1.1rem; } h3 { font-size: 1.1rem; }
h4 { font-size: 1rem; }
h2::after { h2::after {
content: " ##"; content: " ##";
@ -101,8 +102,6 @@ h3::after {
font-weight: normal; font-weight: normal;
} }
h4 { font-size: 1rem; }
h4::after { h4::after {
content: " ####"; content: " ####";
color: var(--muted); color: var(--muted);
@ -184,6 +183,29 @@ blockquote {
color: var(--muted); color: var(--muted);
} }
table {
border-collapse: collapse;
margin: 1rem 0;
width: 100%;
overflow-x: auto;
display: block;
th, td {
border: 1px solid var(--border);
padding: 0.5rem 0.75rem;
text-align: left;
}
th {
background: var(--code-bg);
font-weight: 600;
}
tr:hover {
background: var(--code-bg);
}
}
.pagination { .pagination {
display: flex; display: flex;
justify-content: space-between; justify-content: space-between;

View file

@ -22,7 +22,7 @@
{% set all_pages = all_pages | concat(with=s2.pages) %} {% set all_pages = all_pages | concat(with=s2.pages) %}
{% endif %} {% endif %}
<ul class="post-list"> <ul class="post-list">
{% for page in all_pages | sort(attribute="date") | reverse | slice(end=5) %} {% for page in all_pages | sort(attribute="date") | reverse | slice(end=7) %}
<li> <li>
<time datetime="{{ page.date }}">{{ page.date | date(format="%Y-%m-%d") }}</time> <time datetime="{{ page.date }}">{{ page.date | date(format="%Y-%m-%d") }}</time>
<a href="{{ page.permalink }}">{{ page.title }}</a> <a href="{{ page.permalink }}">{{ page.title }}</a>