Bare Metal AI: A Tabletop Data Center to Text Myself

Introduction

In Part 1, we built the inference foundation and briefly enjoyed the dangerous illusion that everything was under control. Two HAT Pis were running hailo-ollama, Open WebUI was sitting in front of them like a well-dressed receptionist, and SearXNG was ready to fetch web results without sending every question straight into the cloud. The models responded, the services stayed up, and for a short, beautiful moment, the whole thing looked suspiciously functional.

Then I tried to connect it to Hermes Agent and send myself a WhatsApp message. The first problem appeared within minutes. Then another. Then a third that turned out to be caused by the solution to the second. In short, the polite architecture diagram turned into a hooligan! Order was restored (sort of) by a custom Python HTTP server sitting between Hermes and Open WebUI, doing things no reasonable person would have anticipated needing to do. It's called hailo-bridge, and this article is largely its origin story.

This part covers everything that lives on Pi #4: the custom Python bridge that gets Hermes and hailo-ollama speaking the same language, the Hermes Agent installation and configuration, the WhatsApp integration, web search injection, and the many small discoveries that only appear once you ask a tiny local model to behave like a proper autonomous agent. It also includes an honest look at what 1.5B models can and cannot do, because there is a big difference between “runs locally” and “should be trusted with tools, search, and your afternoon.” It also taught me that when you bring AI fully local, the privacy improves, but the hallucinations join you for the ride.

The Compatibility Problem: Why `hailo-bridge` Exists

Before getting into the code, it is worth explaining why this bridge exists in the first place. When I started, everything looked compatible on paper. Hermes Agent talks to an OpenAI-style API, Open WebUI exposes an OpenAI-style API, and hailo-ollama behaves enough like Ollama that you briefly start believing this might be easy. That belief lasted about as long as the first WhatsApp test message.

The short version is that Hermes and hailo-ollama speak slightly different dialects of the same API-shaped language. Not wildly different. Not “wrong continent” different. Just different enough that everything breaks in annoying, specific ways.

Problem 1: Streaming responses

The first problem was streaming. Hermes expects streaming responses with stream: true, both for the typing experience and for handling longer responses cleanly. hailo-ollama, at least in this setup, was not interested in that arrangement.

Ask it to stream and it returns HTTP 500. Not a polite warning. Not a partial response. Just a firm little door slam. The Hailo inference path returns the completed generation as a whole, rather than delivering tokens incrementally like standard Ollama streaming.

So the bridge cheats, in the most useful possible way. It accepts the streaming request from Hermes, calls hailo-ollama internally with streaming disabled, waits for the full response, and then replays that response as Server-Sent Events. Hermes gets the stream-shaped response it expects. hailo-ollama never has to stream anything. Everyone leaves the meeting feeling heard.

Problem 2: Oversized system prompts

The second problem was prompt size. Hermes builds a fairly chunky system prompt once you include the agent persona, platform instructions, memory context, tool definitions, and everything else it wants the model to know before answering.

With a moderate toolset loaded, that prompt can easily grow past 14,000 characters. The 1.5B models running on the HAT Pis do not have much room to spare, and once the combined prompt gets too large, hailo-ollama starts failing with HTTP 500 instead of gracefully explaining that the context window has left the building.

The bridge handles this with a configurable max_system_chars limit before forwarding the request. It is not a perfect token counter, because characters and tokens are not the same thing, but it is a practical guardrail. The goal is simple: keep the prompt small enough that the model actually responds, instead of feeding it an entire instruction manual and watching it faint.

Problem 3: Unreliable tool calling and web search

The third problem was web search. Hermes has a proper web search toolset, which is the elegant version of this story. The model decides when search is needed, emits a correctly formatted tool call, receives the results, and uses them in the answer.

That works well when the model is big enough and reliable enough to use tools properly. The 1.5B models were enthusiastic, but not exactly dependable employees.

Sometimes they called tools. Sometimes they ignored them. Sometimes they produced something tool-shaped but not tool-valid, which is worse because now you have both a wrong answer and paperwork. Under a truncated system prompt, reliable tool calling became even less realistic.

So I moved search out of the model’s hands entirely. hailo-bridge detects search-like queries before they reach the model, calls SearXNG directly, and injects the results into the user message.

By the time the model sees the request, the search has already happened. Its job is no longer “decide whether to search, call the tool correctly, parse the response, and answer.” Its job is “summarize these results without inventing a parallel universe.”

That is much more realistic for a tiny local model.

So that is why hailo-bridge exists. It is part translator, part adapter, part search proxy, and part tiny bouncer standing between Hermes and the inference stack saying, “I know what you meant, let me make that survivable.” It is not the most elegant architecture in the world, but it turns three incompatible assumptions into a working local agent pipeline. For roughly 200 lines of Python, I will take that deal.

Pi #4 — Setting Up hailo-bridge

Pi #4 is where the polite architecture diagram starts doing the weird but necessary work. This machine runs hailo-bridge, the small Python service that makes Hermes think it is talking to a normal Ollama-compatible endpoint, while quietly translating requests for the Pi cluster behind the curtain.

Prerequisites

Pi #4 needs a few system packages before it can do anything useful, so go ahead and run the following commands:

sudo apt install -y python3-pip python3-yaml git curl build-essential ripgrep pipx
pipx ensurepath
source ~/.bashrc

These packages provide the basic components Hermes and hailo-bridge need: python3-pip installs Python dependencies, python3-yaml lets hailo-bridge read YAML config files, git and curl support downloading code and resources, build-essential provides the compiler toolchain needed by Python packages with native extensions, ripgrep gives Hermes fast file and session search, and pipx installs Hermes as an isolated Python CLI tool without polluting the system Python installation.

After pipx ensurepath, run source ~/.bashrc so the updated PATH takes effect immediately instead of requiring a logout.

Get the Open WebUI API Key

hailo-bridge authenticates to Open WebUI using an API key; without this, requests to Open WebUI will fail with an authentication error, and the bridge will mostly sit there looking innocent while nothing works. So, before configuring the bridge, we need to generate a key in Open WebUI.

First, ensure API keys are enabled in Open WebUI:

Open http://192.168.x.PI3:3000 and log in as admin
Go to Admin Panel → Settings → General
Enable the API Keys toggle; if this is off, the option won't appear for any user and you will be wondering how to press a non-existent button to get your key.

Then generate the key:

Go to Settings → Account → API Keys
Click Create new secret key, give it a name (e.g. hailo-bridge)
Copy the key immediately — Open WebUI does not show it again after this screen

This key goes into hailo-bridge-config.yaml as the api_key value.

Deploy the bridge

First, create a home for the bridge:

mkdir -p ~/hailo-proxy

Copy hailo-bridge.py to ~/hailo-proxy/hailo-bridge.py (provided as a companion file to this article).

Create ~/hailo-proxy/hailo-bridge-config.yaml:

port: 11434
default_model: llama3.2:1b
max_system_chars: 10000
search_url: http://192.168.x.PI3:8080/search
backends:
  llama3.2:1b: http://192.168.x.PI3:3000/ollama/v1
  qwen2.5:1.5b: http://192.168.x.PI3:3000/ollama/v1
api_key: your-openwebui-api-key-here

The port setting tells hailo-bridge which port to listen on. This example uses 11434, the standard Ollama port, so Hermes can talk to hailo-bridge without special configuration.

default_model sets the fallback model when a request does not specify one.

max_system_chars limits the size of system prompts before they are forwarded. I use 10000 as a tested sweet spot: large enough to preserve useful context, but small enough to avoid 500 errors from hailo-ollama.

search_url points to SearXNG on Pi #3.

The backends section defines model-to-backend routing. In this setup, both models route through Open WebUI on Pi #3, which then forwards each request to the appropriate HAT Pi based on the selected model.

Install hailo-bridge as a systemd service

Create a systemd service so hailo-bridge starts automatically on boot and restarts if it crashes. To do so, create a new service file:

sudo nano /etc/systemd/system/hailo-bridge.service

Paste the following configuration into the file:

[Unit]
Description=Hailo Bridge
After=network.target

[Service]
User=pi
ExecStart=/usr/bin/python3 /home/pi/hailo-proxy/hailo-bridge.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

This tells systemd to run hailo-bridge.py as the pi user, start it after networking is available, and restart it automatically if the process exits.

Now, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable hailo-bridge
sudo systemctl start hailo-bridge
sudo systemctl status hailo-bridge

daemon-reload tells systemd to pick up the new service file. enable makes the service start on boot, start launches it immediately, and status confirms whether it is running successfully.

Test the bridge

Now for the important part: make sure the bridge is not just installed, but actually doing bridge things.

These three tests confirm the full chain: hailo-bridge, Open WebUI, hailo-ollama, and search injection through SearXNG.

# Test 1: Basic chat, non-streaming
# Confirms bridge → Open WebUI → hailo-ollama is working end-to-end.
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"hello"}],"stream":false}'

If this returns a normal chat completion, the basic backend path is alive.

# Test 2: Streaming response
# Confirms SSE conversion is working.
# The response should arrive as data: chunks.
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"hello"}],"stream":true}'

A working streaming response looks like a sequence of data: {...} lines ending with data: [DONE]. This is the bridge doing its little magic trick: hailo-ollama does not have to stream, but Hermes still gets a stream-shaped response.

# Test 3: Search injection
# Confirms hailo-bridge can detect a search-like request and inject SearXNG results.
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"latest Raspberry Pi news"}],"stream":false}'

A working Test 3 response should include search-backed results instead of a hallucinated summary. To confirm the bridge triggered search injection, check the logs:

sudo journalctl -u hailo-bridge -n 20 --no-pager

Look for a line similar to:

SEARCH injected for: ...

If you see that, the bridge is doing exactly what it was built to do: politely intercepting chaos before it reaches the tiny model.

What's Next — Part 3

The bridge is running. Requests go in, responses come out, search results are real. The infrastructure layer is done.

Part 3 is where we connect this to the outside world — Hermes Agent, WhatsApp, scheduled tasks, and an honest conversation about what the 1.5B models can actually do when you ask them to run your life autonomously.

Spoiler: it's complicated.

Next: The Brain, Hermes Agent, WhatsApp, and the Tiny Model Reality Check

Cover image generated by AI. The Pis may be fake, but the over-engineering is very real.

Part 2: Building the hailo-bridge and Solving the Streaming Problem

Introduction

The Compatibility Problem: Why `hailo-bridge` Exists

Problem 1: Streaming responses

Problem 2: Oversized system prompts

Problem 3: Unreliable tool calling and web search

Pi #4 — Setting Up hailo-bridge

Prerequisites

Get the Open WebUI API Key

Deploy the bridge

Install hailo-bridge as a systemd service

Test the bridge

What's Next — Part 3

Comments

Bare Metal AI — Herding Pis to Think Locally

Part 3: The Brain, Hermes Agent, WhatsApp, and the Tiny Model Reality Check

More from this blog

Part 3: The Brain, Hermes Agent, WhatsApp, and the Tiny Model Reality Check

Part 1: Building a Self-Hosted AI Agent Cluster with Raspberry Pi 5 and Hailo AI HAT+ 2

Part 7-What I Learned, What I'd Do Differently

Part 6 - Performance Tuning: Getting the Most from Raspberry Pi

Command Palette

Introduction

The Compatibility Problem: Why hailo-bridge Exists

Problem 1: Streaming responses

Problem 2: Oversized system prompts

Problem 3: Unreliable tool calling and web search

Pi #4 — Setting Up hailo-bridge

Prerequisites

Get the Open WebUI API Key

Deploy the bridge

Install hailo-bridge as a systemd service

Test the bridge

What's Next — Part 3

Comments

Bare Metal AI — Herding Pis to Think Locally

Part 3: The Brain, Hermes Agent, WhatsApp, and the Tiny Model Reality Check

More from this blog

The Compatibility Problem: Why `hailo-bridge` Exists