Skip to main content

Command Palette

Search for a command to run...

Part 2: Building the hailo-bridge and Solving the Streaming Problem

Where I discovered that "just connect the AI to WhatsApp" involves writing a custom HTTP streaming bridge, a self-hosted search engine integration, and a concerning amount of systemd configuration.

Updated
10 min read
Part 2: Building the hailo-bridge and Solving the Streaming Problem

Introduction

In Part 1, we built the inference foundation and briefly enjoyed the dangerous illusion that everything was under control. Two HAT Pis were running hailo-ollama, Open WebUI was sitting in front of them like a well-dressed receptionist, and SearXNG was ready to fetch web results without sending every question straight into the cloud. The models responded, the services stayed up, and for a short, beautiful moment, the whole thing looked suspiciously functional.

Then I tried to connect it to Hermes Agent and send myself a WhatsApp message. The first problem appeared within minutes. Then another. Then a third that turned out to be caused by the solution to the second. In short, the polite architecture diagram turned into a hooligan! Order was restored (sort of) by a custom Python HTTP server sitting between Hermes and Open WebUI, doing things no reasonable person would have anticipated needing to do. It's called hailo-bridge, and this article is largely its origin story.

This part covers everything that lives on Pi #4: the custom Python bridge that gets Hermes and hailo-ollama speaking the same language, the Hermes Agent installation and configuration, the WhatsApp integration, web search injection, and the many small discoveries that only appear once you ask a tiny local model to behave like a proper autonomous agent. It also includes an honest look at what 1.5B models can and cannot do, because there is a big difference between “runs locally” and “should be trusted with tools, search, and your afternoon.” It also taught me that when you bring AI fully local, the privacy improves, but the hallucinations join you for the ride.


The Compatibility Problem: Why hailo-bridge Exists

Before getting into the code, it is worth explaining why this bridge exists in the first place. When I started, everything looked compatible on paper. Hermes Agent talks to an OpenAI-style API, Open WebUI exposes an OpenAI-style API, and hailo-ollama behaves enough like Ollama that you briefly start believing this might be easy. That belief lasted about as long as the first WhatsApp test message.

The short version is that Hermes and hailo-ollama speak slightly different dialects of the same API-shaped language. Not wildly different. Not “wrong continent” different. Just different enough that everything breaks in annoying, specific ways.

Problem 1: Streaming responses

The first problem was streaming. Hermes expects streaming responses with stream: true, both for the typing experience and for handling longer responses cleanly. hailo-ollama, at least in this setup, was not interested in that arrangement.

Ask it to stream and it returns HTTP 500. Not a polite warning. Not a partial response. Just a firm little door slam. The Hailo inference path returns the completed generation as a whole, rather than delivering tokens incrementally like standard Ollama streaming.

So the bridge cheats, in the most useful possible way. It accepts the streaming request from Hermes, calls hailo-ollama internally with streaming disabled, waits for the full response, and then replays that response as Server-Sent Events. Hermes gets the stream-shaped response it expects. hailo-ollama never has to stream anything. Everyone leaves the meeting feeling heard.

Problem 2: Oversized system prompts

The second problem was prompt size. Hermes builds a fairly chunky system prompt once you include the agent persona, platform instructions, memory context, tool definitions, and everything else it wants the model to know before answering.

With a moderate toolset loaded, that prompt can easily grow past 14,000 characters. The 1.5B models running on the HAT Pis do not have much room to spare, and once the combined prompt gets too large, hailo-ollama starts failing with HTTP 500 instead of gracefully explaining that the context window has left the building.

The bridge handles this with a configurable max_system_chars limit before forwarding the request. It is not a perfect token counter, because characters and tokens are not the same thing, but it is a practical guardrail. The goal is simple: keep the prompt small enough that the model actually responds, instead of feeding it an entire instruction manual and watching it faint.

The third problem was web search. Hermes has a proper web search toolset, which is the elegant version of this story. The model decides when search is needed, emits a correctly formatted tool call, receives the results, and uses them in the answer.

That works well when the model is big enough and reliable enough to use tools properly. The 1.5B models were enthusiastic, but not exactly dependable employees.

Sometimes they called tools. Sometimes they ignored them. Sometimes they produced something tool-shaped but not tool-valid, which is worse because now you have both a wrong answer and paperwork. Under a truncated system prompt, reliable tool calling became even less realistic.

So I moved search out of the model’s hands entirely. hailo-bridge detects search-like queries before they reach the model, calls SearXNG directly, and injects the results into the user message.

By the time the model sees the request, the search has already happened. Its job is no longer “decide whether to search, call the tool correctly, parse the response, and answer.” Its job is “summarize these results without inventing a parallel universe.”

That is much more realistic for a tiny local model.

So that is why hailo-bridge exists. It is part translator, part adapter, part search proxy, and part tiny bouncer standing between Hermes and the inference stack saying, “I know what you meant, let me make that survivable.” It is not the most elegant architecture in the world, but it turns three incompatible assumptions into a working local agent pipeline. For roughly 200 lines of Python, I will take that deal.


Pi #4 — Setting Up hailo-bridge

Pi #4 is where the polite architecture diagram starts doing the weird but necessary work. This machine runs hailo-bridge, the small Python service that makes Hermes think it is talking to a normal Ollama-compatible endpoint, while quietly translating requests for the Pi cluster behind the curtain.

Prerequisites

Pi #4 needs a few system packages before it can do anything useful, so go ahead and run the following commands:

sudo apt install -y python3-pip python3-yaml git curl build-essential ripgrep pipx
pipx ensurepath
source ~/.bashrc

These packages provide the basic components Hermes and hailo-bridge need: python3-pip installs Python dependencies, python3-yaml lets hailo-bridge read YAML config files, git and curl support downloading code and resources, build-essential provides the compiler toolchain needed by Python packages with native extensions, ripgrep gives Hermes fast file and session search, and pipx installs Hermes as an isolated Python CLI tool without polluting the system Python installation.

After pipx ensurepath, run source ~/.bashrc so the updated PATH takes effect immediately instead of requiring a logout.

Get the Open WebUI API Key

hailo-bridge authenticates to Open WebUI using an API key; without this, requests to Open WebUI will fail with an authentication error, and the bridge will mostly sit there looking innocent while nothing works. So, before configuring the bridge, we need to generate a key in Open WebUI.

First, ensure API keys are enabled in Open WebUI:

  • Open http://192.168.x.PI3:3000 and log in as admin

  • Go to Admin Panel → Settings → General

  • Enable the API Keys toggle; if this is off, the option won't appear for any user and you will be wondering how to press a non-existent button to get your key.

Then generate the key:

  • Go to Settings → Account → API Keys

  • Click Create new secret key, give it a name (e.g. hailo-bridge)

  • Copy the key immediately — Open WebUI does not show it again after this screen

This key goes into hailo-bridge-config.yaml as the api_key value.

Deploy the bridge

First, create a home for the bridge:

mkdir -p ~/hailo-proxy

Copy hailo-bridge.py to ~/hailo-proxy/hailo-bridge.py (provided as a companion file to this article).

Create ~/hailo-proxy/hailo-bridge-config.yaml:

port: 11434
default_model: llama3.2:1b
max_system_chars: 10000
search_url: http://192.168.x.PI3:8080/search
backends:
  llama3.2:1b: http://192.168.x.PI3:3000/ollama/v1
  qwen2.5:1.5b: http://192.168.x.PI3:3000/ollama/v1
api_key: your-openwebui-api-key-here

The port setting tells hailo-bridge which port to listen on. This example uses 11434, the standard Ollama port, so Hermes can talk to hailo-bridge without special configuration.

default_model sets the fallback model when a request does not specify one.

max_system_chars limits the size of system prompts before they are forwarded. I use 10000 as a tested sweet spot: large enough to preserve useful context, but small enough to avoid 500 errors from hailo-ollama.

search_url points to SearXNG on Pi #3.

The backends section defines model-to-backend routing. In this setup, both models route through Open WebUI on Pi #3, which then forwards each request to the appropriate HAT Pi based on the selected model.

Install hailo-bridge as a systemd service

Create a systemd service so hailo-bridge starts automatically on boot and restarts if it crashes. To do so, create a new service file:

sudo nano /etc/systemd/system/hailo-bridge.service

Paste the following configuration into the file:

[Unit]
Description=Hailo Bridge
After=network.target

[Service]
User=pi
ExecStart=/usr/bin/python3 /home/pi/hailo-proxy/hailo-bridge.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

This tells systemd to run hailo-bridge.py as the pi user, start it after networking is available, and restart it automatically if the process exits.

Now, enable and start the service:

sudo systemctl daemon-reload
sudo systemctl enable hailo-bridge
sudo systemctl start hailo-bridge
sudo systemctl status hailo-bridge

daemon-reload tells systemd to pick up the new service file. enable makes the service start on boot, start launches it immediately, and status confirms whether it is running successfully.

Test the bridge

Now for the important part: make sure the bridge is not just installed, but actually doing bridge things.

These three tests confirm the full chain: hailo-bridge, Open WebUI, hailo-ollama, and search injection through SearXNG.

# Test 1: Basic chat, non-streaming
# Confirms bridge → Open WebUI → hailo-ollama is working end-to-end.
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"hello"}],"stream":false}'

If this returns a normal chat completion, the basic backend path is alive.

# Test 2: Streaming response
# Confirms SSE conversion is working.
# The response should arrive as data: chunks.
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"hello"}],"stream":true}'

A working streaming response looks like a sequence of data: {...} lines ending with data: [DONE]. This is the bridge doing its little magic trick: hailo-ollama does not have to stream, but Hermes still gets a stream-shaped response.

# Test 3: Search injection
# Confirms hailo-bridge can detect a search-like request and inject SearXNG results.
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3.2:1b","messages":[{"role":"user","content":"latest Raspberry Pi news"}],"stream":false}'

A working Test 3 response should include search-backed results instead of a hallucinated summary. To confirm the bridge triggered search injection, check the logs:

sudo journalctl -u hailo-bridge -n 20 --no-pager

Look for a line similar to:

SEARCH injected for: ...

If you see that, the bridge is doing exactly what it was built to do: politely intercepting chaos before it reaches the tiny model.


What's Next — Part 3

The bridge is running. Requests go in, responses come out, search results are real. The infrastructure layer is done.

Part 3 is where we connect this to the outside world — Hermes Agent, WhatsApp, scheduled tasks, and an honest conversation about what the 1.5B models can actually do when you ask them to run your life autonomously.

Spoiler: it's complicated.


Next: The Brain, Hermes Agent, WhatsApp, and the Tiny Model Reality Check


Cover image generated by AI. The Pis may be fake, but the over-engineering is very real.

5 views

Bare Metal AI — Herding Pis to Think Locally

Part 2 of 3

Some people see a Raspberry Pi and think neat little computer. Other people start wondering how many they can stack before the desk gives out. Bare Metal AI is for that second kind of person. It follows the messy, rewarding ride of building self hosted AI infrastructure from scratch, from inference boxes and custom bridges to local search, messaging hacks, and the kind of debugging that only seems to happen at 3am. No cloud. No mystery layers. Just hardware, code, stubbornness, and a dangerous amount of coffee.

Up next

Part 3: The Brain, Hermes Agent, WhatsApp, and the Tiny Model Reality Check

Where we wire everything together, send our first WhatsApp message, and have a long honest conversation about parameter counts.

More from this blog

N

Needless complexity, production-grade architecture!

11 posts

Came for the answer, stayed for the rabbit hole.

Accidental Complexity is my take on building things that probably didn't need to be this complicated and being completely fine with that. In other words, over-engineered solutions to under-complicated problems. A DNS server became a distributed cluster. A network monitor became a full observability stack. A 15-minute task became a multi-weekend yak shave.

No fluff, no "now, we will explore." Just the work and over-engineering.