Part 3: The Brain, Hermes Agent, WhatsApp, and the Tiny Model Reality Check
Where we wire everything together, send our first WhatsApp message, and have a long honest conversation about parameter counts.

Introduction
Parts 1 and 2 built the infrastructure: two HAT Pis running hailo-ollama on Hailo-10H silicon, Open WebUI sitting in front of them behind a clean API, SearXNG handling private web search, and hailo-bridge tying the pieces together while solving the streaming problem that would have stopped the whole thing cold.
Now comes the part that makes it feel alive: the agent.
Hermes Agent turns a model endpoint into something that can receive a WhatsApp message, decide what to do, call tools, schedule reminders, and reply without you babysitting every step. It is the difference between a cluster of services that respond to curl commands and a system that can actually do something useful from a chat message.
It is also where the 1.5B model problem stops being theoretical and starts knocking over furniture.
Pi #4 - Hermes Agent
With hailo-bridge running, Pi #4 now presents itself like a standard Ollama-compatible server on port 11434. Hermes does not need to know that the actual model traffic is being routed through a Pi cluster behind the scenes. From Hermes’s point of view, it is simply talking to a local model endpoint.
Now let’s put the agent on the board.
What is Hermes Agent?
Hermes Agent is an open-source AI agent framework by Nous Research. It handles the layer above raw model inference: the conversation loop, tool execution, persistent memory, scheduled jobs, and messaging gateways.
In this build, Hermes is the agent layer. It receives messages, runs the conversation loop, calls tools when needed, manages scheduled jobs, and sends replies back through WhatsApp.
Hermes supports multiple messaging platforms, custom tools, background gateway operation, and scheduled jobs. For this setup, the important pieces are the WhatsApp gateway, the cron scheduler, and the agent loop itself. Web search is handled separately through the hailo-bridge injection approach from Part 2, because asking a tiny local model to reliably decide when and how to call search tools turned out to be optimistic (may be a bit too optimistic).
Install
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash -s -- --skip-browser
The --skip-browser flag skips the Playwright/Chromium browser automation install. This build does not use Hermes’s browser automation tools, so there is no reason to install the Chromium/Playwright stack on the Pi.
The installer launches an interactive setup wizard. Enter the following values:
After the installer finishes, reload your shell so the hermes command is available in the current session:
source ~/.bashrc
Then install and start the Hermes gateway as a system service:
sudo /home/pi/.local/bin/hermes gateway install --system --run-as-user pi
sudo /home/pi/.local/bin/hermes gateway start --system
The first command installs the gateway service so it can run under the pi user. The second command starts the service immediately.
If your Hermes installer prints a slightly different system-service command, use the command it gives you. Hermes is moving quickly, and the installer output is the best source of truth for your installed version.
Configure Hermes
The main Hermes configuration file is:
~/.hermes/config.yaml
The installer will already have created the full file. You do not need to replace everything; update the key sections below.
Note on Hermes versions: Hermes is actively developed and config structure can change between releases. The blocks below reflect what worked on this build. If your generated
config.yamllooks different, treat these as reference examples rather than exact templates — the structure your installer created is the authoritative one for your version.
The default model should point to hailo-bridge on Pi #4. Because hailo-bridge exposes an OpenAI-compatible endpoint, the base URL includes /v1, and Hermes uses the chat_completions API mode.
model:
default: llama3.2:1b
provider: custom
base_url: http://localhost:11434/v1
api_mode: chat_completions
context_length: 65536
api_key: dummy
hailo-ollama can be slow under load, especially when requests are routed through the Pi cluster. Increase the provider timeouts so Hermes does not give up too quickly.
providers:
custom:
request_timeout_seconds: 300
stale_timeout_seconds: 300
Toolsets define which tools Hermes makes available on each platform. Keep this list small: every enabled tool adds to the system prompt, and tiny local models have very little context room to spare.
The web toolset is left enabled as a fallback, but normal search in this build is handled by hailo-bridge before the request reaches the model.
platform_toolsets:
cli:
- cronjob
- todo
- web
whatsapp:
- cronjob
- todo
- web
Both local models route through hailo-bridge, which then sends the request through Open WebUI to the appropriate HAT Pi.
custom_providers:
- name: llama3.2:1b
base_url: http://localhost:11434/v1
model: llama3.2:1b
api_mode: chat_completions
context_length: 65536
api_key: dummy
- name: qwen2.5:1.5b
base_url: http://localhost:11434/v1
model: qwen2.5:1.5b
api_mode: chat_completions
context_length: 65536
api_key: dummy
The agent settings below are the main runtime knobs for retries, gateway timing, tool-use behavior, and long-running sessions. These values are intentionally conservative for slower local inference.
agent:
max_turns: 200
gateway_timeout: 1800
restart_drain_timeout: 180
api_max_retries: 3
service_tier: ''
tool_use_enforcement: auto
gateway_timeout_warning: 900
clarify_timeout: 600
gateway_notify_interval: 180
gateway_auto_continue_freshness: 3600
image_input_mode: auto
disabled_toolsets: []
verbose: false
reasoning_effort: medium
The session reset is important for WhatsApp stability. WhatsApp sessions can accumulate conversation history over time, which increases the payload sent through hailo-bridge. Eventually that context can exceed what the local model can handle reliably.
Resetting idle sessions after 60 minutes keeps the active context small enough for the 1.5B models to remain usable.
session_reset:
mode: idle
idle_minutes: 60
Finally, enable memory if you want Hermes to retain user profile and memory information between sessions.
memory:
memory_enabled: true
user_profile_enabled: true
A note on context_length: 65536 is the value advertised to Hermes so it does not prematurely reject longer sessions, but it is not a promise that the Hailo-backed 1.5B models can actually use that much context. In practice, hailo-bridge and the session reset policy are what keep requests small enough to survive.
WhatsApp setup
During the Hermes install you will have been prompted to set up WhatsApp. If you need to reconnect or reconfigure it:
hermes gateway setup
Select WhatsApp → Personal number (self-chat). This is the important part: Hermes operates in self-chat mode, meaning it reads and responds to messages you send to your own WhatsApp number. You're not setting up a separate bot account — you're wiring Hermes into your existing WhatsApp, accessible only to you.
Scan the QR code that appears using WhatsApp on your phone: Settings → Linked Devices → Link a Device.
Once connected, verify the gateway is running:
cat ~/.hermes/logs/gateway.log | tail -20
You should see ✓ whatsapp connected and Gateway running with 1 platform(s). If the gateway hangs or fails to start, see the Troubleshooting section below.
The Model Quality Problem
Let’s be honest about what happens next, because it took a while to accept.
The expectation was that real hardware, a real inference stack, and a real agent framework would add up to a real AI assistant. In one sense, they do. The compact models running on the Hailo hardware are fast for what they are, and in the Open WebUI chat interface from Part 1, they are genuinely useful. They respond quickly, they can hold a basic conversation, and they are perfectly capable of answering short questions. But as autonomous agents, they are a different story.
In this build, the 1.5B models worked reasonably well for:
simple chat and Q&A
summarizing pre-fetched content
short, single-step responses
They struggled consistently with:
reliable tool calling
creating reminders and scheduled jobs
following multi-step instructions
maintaining coherent context across a complex task
high-trust actions involving tools or external systems
A lot of time went into trying to engineer around these limits: prompt tuning, system prompt truncation, search injection, and toolset minimization. Some of it helped. None of it changed the core reality: the bottleneck was model capability, not the bridge configuration.
Testing with a stronger model via API confirmed that the rest of the infrastructure was sound. The same hailo-bridge endpoint, WhatsApp gateway, search injection, and cron scheduler all behaved correctly once a more capable model was behind them. The pieces were working. The small local model simply was not strong enough to be the agent brain.
That is the practical lesson from this build. The HAT Pis are excellent for local chat, lightweight inference, and Open WebUI access. They are not yet where I would place the reasoning layer for a reliable autonomous agent.
If you want the full Hermes experience without adding new local hardware, point the agent at a capable model API from Anthropic, OpenAI, or another OpenAI-compatible provider. If you want to keep the system fully local, the better path is to keep the HAT Pis serving Open WebUI and move the agent brain to a stronger local machine, such as an Apple Silicon Mac running a 14B-class model at interactive speeds.
The tough realization is that this architecture wants to grow in a specific direction: the Pi cluster remains the local inference lab, and the stronger machine becomes the agent brain.
The bridge solved the plumbing problem. Unfortunately, it did not turn a tiny model into a senior operations engineer with 23 certifications.
Using Hermes
CLI
Start Hermes from the terminal:
hermes
The TUI shows the active model, token usage, and response time in the status bar.
Type naturally; it is a conversation, not a REST API with feelings. Here are some useful commands:
/model llama3.2:1b # switch to the local Llama model
/model qwen2.5:1.5b # switch to the local Qwen model
/tools # list loaded tools
/cron list # list scheduled cron jobs
/reset # clear conversation history
/exit # quit
Message yourself on WhatsApp. The same general command style works there too. Look at the image below:
Did you notice that the header "🤖 Jim Says ..." in the response? That was me having fun! It's a simple configuration change: edit the ~/.hermes/config.yaml configuration file and add the following block (\\n instead of \n to guarantee clean line break inside double quotes):
whatsapp:
reply_prefix: "🤖 *Jim Says...*\\n\\n"
A few things worth knowing:
Local 1.5B model responses can take 15 to 45 seconds. This is normal.
If responses stop or become erratic, send /reset to clear the conversation history.
The
session_reset.idle_minutes: 60config helps clean up old sessions automatically after inactivity.Model switching with
/modelmay not work reliably from WhatsApp in this setup. If needed, change the default model inconfig.yamlinstead.
Some things to try once it is working:
hello
latest news on AI
latest news on Raspberry Pi
what's the weather like in New York today
Scheduled Tasks
Cron jobs are the one place where the local model limitations do not bite as hard, because the schedule itself is explicit. You provide the cron expression and the prompt directly. The model only runs when the job fires; it does not need to invent the schedule from scratch while juggling tools and context like a tiny overworked intern.
hermes /cron add "0 8 \* \* \*" "search for top technology news today and summarise"
/cron list
/cron delete <job-id>
Model Capabilities Reference
The table below is a quick reality check on what tiny local models can do, what bigger models do better, and where the 1.5B agent dream starts making dial-up noises.
Troubleshooting
Most problems in this build fall into a few predictable buckets: the Hailo backend is overloaded, the prompt is too large, WhatsApp history has grown too much, or one of the background services is stuck. The commands below are the fastest way to figure out which part is misbehaving.
Note*: These are the issues that caused me lose a bit of hair. Your setup might break in entirely new, creative, and beautiful ways—so grab some coffee and prepare to hunt your own demons!*
hailo-ollama returning 500
A 500 from hailo-ollama usually means the Hailo chip is overloaded, the service crashed, or the request payload was too large for the model to handle. Start by checking whether the HAT Pis are still responding.
curl -s http://192.168.x.PI1:8000/api/tags
curl -s http://192.168.x.PI2:8000/api/tags
If either Pi fails to respond and you configured a watchdog in Part 1, the watchdog should bring the service back automatically. If not, restart the affected service on that Pi and then check the bridge logs for more context.
sudo journalctl -u hailo-bridge -n 20 --no-pager
System Prompt Truncation
hailo-bridge logs when it trims an oversized system prompt before forwarding the request. This is useful when Hermes is loading too many tools or carrying too much context.
sudo journalctl -u hailo-bridge -f | grep TRUNCATED
If you see something like TRUNCATED system: 14000 → 10000 chars on every request, the active toolset is too large for the model’s usable context window. Reduce the toolsets in config.yaml, shorten the system prompt, or accept the tradeoff: responses may still arrive, but with less instructional context.
WhatsApp Slow or Failing
If WhatsApp responses get slow, erratic, or stop entirely, the conversation history may have grown too large. That history becomes part of the context payload sent through hailo-bridge, and eventually the tiny local model starts carrying more luggage than it was designed for.
Send this from WhatsApp to clear the current session immediately:
/reset
The session_reset config helps prevent this from recurring after inactivity by resetting idle sessions automatically, but it will not shrink an active conversation while you are still using it.
Another way to keep your sessions unclogged is to periodically prune them. Use the following command to manually prune your sessions (run hermes sessions prune --help for options):
# Deletes sessions older than 90 days (default)
hermes sessions prune
If you want automatic pruning, add the following configuration block to your ~/.hermes/config.yaml file:
sessions:
auto_prune: true
retention_days: 90
vacuum_after_prune: true
min_interval_hours: 24
Hermes Gateway Hanging on Restart
The gateway may wait for in-flight requests to drain before shutting down. If a request is stuck while hailo-ollama is failing, the restart can appear to hang.
Try the clean restart first:
sudo systemctl restart hermes-gateway
If the service is still stuck, force-kill the gateway and WhatsApp bridge processes, then start the service again.
sudo kill -9 $(pgrep -f "hermes_cli.main gateway")
sudo kill -9 $(pgrep -f "whatsapp-bridge")
sudo systemctl start hermes-gateway
Use the force option sparingly. It is the software equivalent of pulling the plug, which is sometimes exactly what you need, but it should not be the first button you press.
Search Not Triggering
If search results are not being injected, check whether hailo-bridge detected the query as search-like.
sudo journalctl -u hailo-bridge -n 10 --no-pager | grep SEARCH
If nothing appears, the query probably did not contain one of the configured trigger phrases. Some of the default triggers are: search, look up, find, what is, who is, latest, news, current, today, and recent.
Rephrase the query, or add more trigger phrases inside the is_search_query() function in hailo-bridge.py.
Log Management
Logs accumulate quickly, especially when hailo-ollama is throwing repeated 500 errors. Clearing the application logs occasionally keeps the Pi tidy and makes new failures easier to spot.
For a one-time cleanup, you can truncate the Hermes logs:
truncate -s 0 ~/.hermes/logs/agent.log
truncate -s 0 ~/.hermes/logs/gateway.log
truncate -s 0 ~/.hermes/logs/errors.log
truncate -s 0 ~/.hermes/logs/gateway-exit-diag.log
truncate -s 0 ~/.hermes/logs/gateway-shutdown-diag.log
truncate -s 0 ~/.hermes/whatsapp/bridge.log
You can also reduce systemd journal usage.
sudo journalctl --vacuum-time=1d
If disk space is tight, use a size limit instead:
sudo journalctl --vacuum-size=100M
For ongoing cleanup, add a logrotate rule for the Hermes logs:
sudo nano /etc/logrotate.d/hermes
Paste the following configuration into the file:
/home/pi/.hermes/logs/*.log
/home/pi/.hermes/whatsapp/*.log {
weekly
rotate 4
compress
missingok
notifempty
copytruncate
}
This keeps four weekly compressed log archives and uses copytruncate, which is useful for long-running processes that keep log files open.
You can test the rule without waiting for the next scheduled rotation:
sudo logrotate -d /etc/logrotate.d/hermes
If the dry run looks good, force one rotation to confirm it works:
sudo logrotate -f /etc/logrotate.d/hermes
Service Management
These are the service commands I ended up using most often while testing. They are worth keeping nearby, because eventually something will need a restart. This is still a Raspberry Pi cluster, not a nuclear submarine.
On Pi #4, manage hailo-bridge and the Hermes gateway:
sudo systemctl status hailo-bridge
sudo systemctl restart hailo-bridge
sudo systemctl status hermes-gateway
sudo systemctl restart hermes-gateway
If you need to stop or start them manually, use the same pattern:
sudo systemctl stop hailo-bridge
sudo systemctl start hailo-bridge
sudo systemctl stop hermes-gateway
sudo systemctl start hermes-gateway
On Pi #3, restart the containerized services:
# SearXNG
docker compose -f ~/dockerstore/searxng/docker-compose.yml restart core
# Open WebUI
docker compose -f ~/dockerstore/open-webui/compose.yaml restart open-webui
Key Lessons Learned
This build worked, but it also made the limits very clear. The hardware is useful, the services are stable once configured, and the architecture is sound. The weak link is not the bridge or the messaging gateway. The weak link is asking a tiny model to behave like a reliable autonomous operator.
Infrastructure is the hard part
The models are the last mile. Getting streaming conversion, prompt truncation, search injection, and session management working consumed most of the real engineering effort. None of those pieces is complicated in isolation, but they all have to cooperate.
1.5B models are not agent-capable
They are fine for chat where a human is in the loop and can tolerate some inconsistency. They are not reliable enough for autonomous tasks that require tool use, multi-step reasoning, or precise instruction following. In this setup, 7B-class models feel like the practical minimum.
Inject, do not tool-call
For web search with small models, bypassing tool calling and injecting pre-fetched results into the prompt is far more reliable. The model’s job becomes summarization, not orchestration, which is a much easier assignment for a 1.5B model.
WhatsApp and Session Management
Session Management is non-negotiable for WhatsApp. Conversation history grows quietly and breaks things loudly. Configure session_reset, and keep /reset handy.
Pin your packages.
Both the Hailo packages and the kernel version matter. A routine system update at the wrong time can silently break the inference stack.
What’s Next
The stack works. Messages go in, responses come out, search results are real, and scheduled jobs fire on time. The limitation is now clear and well-defined: the models are too small for full agentic capability.
The next part (or parts) will cover what happens when you swap in a capable model, either through an API or on dedicated local hardware, and what this infrastructure looks like once the model bottleneck is removed.
Follow the series if you want to see where this goes. The hardware is good. The stack and architecture is solid. It just needs a bigger brain.
Stack: Raspberry Pi 5 - 16GB (×4) · Hailo AI HAT+ 2 (×2) · HailoRT 5.2.0 · hailo-ollama · Open WebUI · SearXNG · Valkey · Hermes Agent · custom Python hailo-bridge
Cover image generated by AI. The Pis may be fake, but the over-engineering is very real.





