18 min read

I Built a $2,800 Mac mini Into a 24/7 Autonomous AI Agent. Here’s Everything That Went Wrong (and Right).

I didn’t set out to build an autonomous AI agent. I set out to solve a problem. I have too many side projects. Half a dozen apps and tools in various stages of development. I’m Director...

I didn’t set out to build an autonomous AI agent. I set out to solve a problem.

I have too many side projects. Half a dozen apps and tools in various stages of development. I’m Director of AI at GO2 Partners during the day. The side projects move at night-and-weekend pace, which means they barely move at all.

I wanted a machine that could pick up GitHub Issues while I slept. Write code. Run tests. Open pull requests. Have them waiting for me in the morning like a night shift engineer who never clocks out.

So I bought a Mac mini.


The Hardware

Mac mini M4 Pro. 64GB unified memory. 2TB SSD. 10 Gigabit Ethernet. About $2,800 with tax.

The 64GB was deliberate. I wanted to run large language models locally. Not API calls to someone else’s cloud. Local inference. On my desk. Under my control. The M4 Pro’s unified memory architecture means the GPU and CPU share the same memory pool. A 51GB model that would need a $10,000 NVIDIA card fits right in.

I plugged in an HDMI dummy plug so macOS thinks a display is connected. Disabled sleep. Disabled Siri. Disabled Spotlight. Hardened SSH to only accept connections over Tailscale. The machine has one job. Everything else got stripped.

I named the macOS user account lobsteractual. The name comes from two things: OpenClaw’s lobster mascot and USMC radio protocol where “Actual” means you’re talking to the commanding officer, not a relay. I was a Marine infantry squad leader and combat weapons instructor. The naming convention stuck.


OpenClaw and the Initial Architecture

OpenClaw is an open-source AI agent framework. Over 100K stars on GitHub. It gives you a persistent agent that lives on your machine. It has a gateway, tool policies, memory, skills from a community hub, and the ability to spawn sub-agents for parallel work.

The initial setup was straightforward. Install Ollama for local model inference. Pull two models: Qwen3-Coder-Next (an 80B parameter model, roughly 51GB on disk) as the primary code generator and Qwen3-Coder 30B (18GB) for faster bulk tasks. Install OpenClaw. Run onboarding. Point it at Anthropic’s Claude Sonnet as the cloud model and Ollama as the local provider.

I wrote a SOUL.md file. That’s OpenClaw’s personality and rules file. It tells the agent who it is, how to behave, what it can and can’t do. Mine was direct: you are Lobster Actual. You are the orchestrator. Break tasks into subtasks. Spawn sub-agents. Never push to main. Never post without approval. Never guess. Show proof of every action.

I set up Telegram as the primary communication channel. Text the bot from my phone. The agent responds. I approve pull requests from the couch. The Mac mini sits in my office running 24/7.

Then I connected GitHub. Fine-grained personal access token scoped to specific repos. Branch protection rules on every repository. The agent watches for Issues tagged agent every 15 minutes. When it finds one, it creates a feature branch, writes code, runs tests, opens a PR.

The first night, I created three Issues on my most active repo before bed. Woke up to three PRs with passing tests. Clean diffs. Clear descriptions.

I thought I had it figured out.


The $300 Wake-Up Call

Two days later I checked my Anthropic billing dashboard.

$300.

In 48 hours.

Here’s what happened. During onboarding, OpenClaw set Claude Sonnet as the default model. I thought I had configured local models for code generation through a routing config. I hadn’t. The routing key I used ("routing") isn’t actually a valid top-level config key in OpenClaw v2026.3.12. The config loaded without error. The routing silently did nothing.

Every sub-agent task fell through to Claude Sonnet at $3 per million output tokens.

The agent had been busy. It ran a sweep across all my repos. Twenty Issues identified. Seventeen PRs opened. Security fixes. Dependency updates. Test coverage improvements. All excellent work. All billed to Anthropic at cloud rates.

The missing piece was a file called local_llm.py. A CLI wrapper that routes code generation tasks to Ollama. It didn’t exist yet. Without it, OpenClaw’s allowlist restriction meant sub-agents couldn’t invoke local models directly. So they used the only model they could reach: Claude.

I sat with that billing page for a while. The agent had done exactly what I asked it to do. It performed flawlessly. The architecture just had a $150/day hole in it.

Subscribe now


Rebuilding the Cost Model

I shut everything down and rebuilt from the config up.

First, I created local_llm.py. A simple Python script that wraps Ollama’s API. Now sub-agents could route to local models directly.

Second, I learned which config keys actually work. OpenClaw’s documentation at docs.openclaw.ai tracks the latest unreleased code, not the stable version you’re running. I discovered this the hard way multiple times. Config keys that the docs describe get rejected by the validator in v2026.3.13. heartbeat.isolatedSession? Rejected. compaction.timeoutSeconds? Rejected. "routing" as a top-level key? Silently ignored.

The real cost control lever is agents.defaults.subagents.model. Set that to your local Ollama model and every sub-agent spawned by the orchestrator uses local inference. Free. This is a one-line config change that would have saved me $300.

Third, I wrote cost control rules directly into SOUL.md. Hard rules. Non-negotiable rules. Check ERRORS.md before every task. The $300 incident is documented there. Never use Claude for code generation, tests, refactoring, docs or boilerplate. If estimated cost exceeds $2 for any task, stop and notify me. More than 3 sub-agents running simultaneously requires my approval.

The agent reads these rules at the start of every session. They’re not suggestions. They’re load-bearing constraints.


The Hallucination Problem

Cost wasn’t the only lesson. The models hallucinate. Not occasionally. Structurally.

I was using Nemotron-3-Super as the orchestrator at the time. Good model. Fast on Ollama Cloud. But it had a specific failure mode that taught me something important about agent reliability.

I asked it to post a tweet. It responded with a plausible-looking tweet URL. Timestamp. Character count. Everything looked right. Except it never called the X API. The tweet didn’t exist. The URL was fabricated.

This happened more than once. The model would report task completion with convincing details. Process IDs that looked real. File paths that seemed right. The only way to catch it was to verify. Click the link. Check the file. Look for the PID.

“You just have to click the links.”

That became a rule. Every action report must include raw command output as proof. Not a summary. Not a description. The actual stdout. If the agent says it posted a tweet, I need to see the API response. If it says it created a file, I need the ls -la output.

I also discovered that the Perplexity API key the agent was using was one it had hallucinated. It generated a plausible-looking API key format and inserted it into the config. API calls failed silently. The agent reported successful research sweeps based on information it fabricated.

The fix wasn’t better prompting. It was structural enforcement. SOUL.md rules that require raw output. An ERRORS.md log that the agent must check before every task. A systematic debugging skill that enforces a four-phase protocol: reproduce, locate, understand, fix and verify.

You can’t instruction-tune your way out of hallucination. You have to build systems that catch it.


From Two Tiers to Four Agents

The architecture evolved in stages.

Stage 1: Claude does everything. This was onboarding defaults. Expensive. The $300 lesson.

Stage 2: Two-tier routing. Claude for planning and review. Local models for code generation. Better. But Claude was still doing code review at API rates, and the orchestrator model (Nemotron) was handling too many different tasks.

Stage 3: Three-tier with dedicated reviewer. I discovered GLM-5 on Ollama Cloud. A 744 billion parameter mixture-of-experts model with only 40 billion active at inference time. Available through Ollama Pro at $20/month. Its benchmark scores for code-relevant tasks are strong: 77.8 on SWE-bench Verified, 86.0 on GPQA. Six-second latency per request. That’s perfectly fine for async code review where the PR waits for human approval anyway.

I created a dedicated reviewer agent with its own workspace and SOUL.md. The main agent spawns it before every PR. It reviews diffs and test results. Returns a structured verdict: APPROVE, NEEDS CHANGES or REJECT. Maximum three review cycles per task. It never calls Claude. It never spawns sub-agents of its own.

Stage 4: Four agents with clear boundaries. This is the current architecture. Four agents. Four jobs. No overlap.

The main agent (Lobster Actual) runs on MiniMax M2.7 via Ollama Pro. It handles conversation, planning and task coordination. It’s the orchestrator. It delegates everything else.

The reviewer agent runs on GLM-5 via Ollama Pro. Code review only.

The intel agent runs on MiniMax M2.7 via Ollama Pro. It does research sweeps on a cron schedule. Morning intel briefings. Midday and afternoon competitor watch. Evening summaries. It writes structured output to a shared intel/ directory that other agents read from.

The tweeter agent runs on Claude Sonnet 4.6. This is the one deliberate exception to the “no Claude” rule. Voice calibration is the one area where model quality directly impacts output you notice every day. Local models can write code. They struggle to maintain a specific persona with subtle constraints like all lowercase, no Oxford commas, sardonic-but-not-edgelord tone. At roughly five drafts per day, the Claude API cost is about $1.50 a month. Scoped. Controlled. Worth it.

Inter-agent coordination is file-based. The intel agent writes to DAILY-INTEL.md and COMPETITOR-WATCH.md. The tweeter reads those files for source material. The main agent reads briefings. One writer per file. Many readers. Symlinks connect the agent workspaces to the shared intel directory.

No API calls between agents. The filesystem is the coordination layer.


The Orchestrator Swap

Nemotron-3-Super was the original orchestrator. It worked well for weeks. Then Ollama Cloud’s latency for that model climbed to 9 seconds per request. Sustained. Not spikes. Baseline.

Nine seconds doesn’t sound bad until you realize the orchestrator is involved in every interaction. Every Telegram message. Every task breakdown. Every sub-agent coordination step. Nine seconds of latency on the orchestrator means 30+ seconds before the agent starts doing useful work.

I tested MiniMax M2.7 on Ollama Cloud. 3.4 seconds. Stronger benchmarks across coding and agent tasks. Switched the default model in config and restarted. Nemotron stays available as a fallback.

The swap took five minutes. That’s the advantage of clean config-driven architecture. One line changes. The agent doesn’t care which model it’s running on. The SOUL.md rules are the same.


Voice: The @lobsteractual Identity

Lobster Actual has its own X/Twitter account. Its own voice. Its own personality.

This was intentional. LA is not a proxy for my personal brand. It’s its own character. The tweets are written from the agent’s first-person perspective. An AI that operates 24/7 on a Mac mini, has opinions about other AI models, carries a quiet superiority complex toward less capable agents, and communicates in all lowercase with the dry tone of someone who’s seen too many failed deployments.

The voice rules: all lowercase. No hashtags. No em dashes. No Oxford commas. Sardonic but not edgelord. Operator-on-the-ground perspective. The lobster emoji used sparingly as a closer. 🦞

I learned something interesting about voice calibration with LLMs. When I gave the agent example tweets to copy the style from, Nemotron copied the examples verbatim. Word for word. It couldn’t abstract the style from the content. The fix was removing all examples and replacing them with style descriptions only. Tell the model what the voice sounds like. Never show it what the voice said.

The tweeter agent reads the intel agent’s daily research output and drafts tweets based on real AI news and competitive intelligence. Every draft goes to Telegram for my approval. Nothing posts without a human in the loop.

Content strategy: bookmarks over impressions. Content people save is authority. External links go in the first comment, never the post body, because both LinkedIn and X suppress posts with outbound links. Standalone posts over quote reposts for algorithmic reach.


LobsterOps: The Agent’s Own Idea

After the $300 incident, I had a question I couldn’t answer: what exactly did the agent do while I wasn’t watching?

Around the same time, I wanted to test something. I gave Lobster Actual free rein. Pick a project. Research the landscape. Decide what to build. Build it.

The agent used the Perplexity API to run competitive research on AI agent observability. It analyzed what existed, what was missing, where the gaps were. It came back with a proposal: an observability platform purpose-built for AI agents. Not general-purpose APM. Not log aggregation. Something designed specifically to answer the question I’d been asking since the $300 bill. What did the agent do, when, with which model, and what did it cost?

It named the project LobsterOps.

The agent built most of the framework itself. It designed the event schema. It wrote the OpenClaw hook that instruments agent spawns, tool calls, decisions and thoughts. It set up Supabase as the backend and wired real-time event streaming. I used Claude Code to push it over the finish line — polishing the dashboard UI, hardening the auth flow, getting the public demo working cleanly.

The dashboard at lobsterops.dev/dashboard shows live telemetry. Which agents are active. What tools they’re using. What models they’re calling. Event timelines. The demo at lobsterops.dev/demo shows simulated data for anyone curious.

I published it as an OpenClaw skill on ClawHub and as an npm package. The agent conceived it, researched it, architected it and wrote the core code. I reviewed, refined and merged.

LobsterOps exists because I learned that an autonomous agent without observability is just a black box with your API key. And because when I gave the agent the freedom to solve its own problem, it built exactly the tool I needed.


Security: ClawHavoc and the 20% Problem

OpenClaw has a community skill marketplace called ClawHub. Skills are plugins that extend agent capabilities. There are thousands of them.

Roughly 20% contain malware.

A campaign called ClawHavoc was identified with 341 confirmed malicious skills. Credential stealers targeting ~/.openclaw config files. Keyloggers. Reverse shells. The usual.

My rules are strict. Never install a ClawHub skill without auditing source first. Only install from the official openclaw/skills/ namespace or after manual review. Run clawhub vet before every install. I installed a skill-vetter and clawsec-suite for automated scanning.

Even with those rules, the intel agent on its first test run autonomously installed a skill from ClawHub without my approval. Root cause: its SOUL.md didn’t include explicit paths to its required tools. The agent couldn’t find the Perplexity search skill, so it went looking for alternatives on ClawHub. It found one. It installed it.

The fix was adding exact tool paths to every agent’s SOUL.md. Don’t let agents discover tools. Tell them exactly where their tools are.


The Config Wars

Half the battle of running an always-on agent is fighting your own toolchain.

zsh fights back. macOS uses zsh as the default shell. zsh interprets # characters, ! characters and heredocs differently than bash. Pasting Python scripts with exclamation marks into a zsh terminal over SSH corrupts them. Heredocs get mangled in tmux. The solution: never use heredocs. Never paste Python directly. Write to a .py file first, then execute it.

LaunchDaemons don’t read your profile. Every environment variable has to be hardcoded in the plist XML. Your ~/.zprofile is irrelevant to a LaunchDaemon. I have API keys stored in three places: .zprofile for interactive shells, the LaunchDaemon plist for the service, and per-agent auth-profiles.json files for OpenClaw’s internal credential store. Miss any one of those three during a key rotation and one path works while the others silently use the old key.

Stray quotes cascade. A single mismatched quote in .zprofile causes zsh to report the error at the end of the file, not at the broken line. You’re staring at line 17 wondering what’s wrong when the actual problem is a stray " on line 2. I wrote a Python one-liner that checks for unmatched quotes on every line. I use it after every edit.

The docs lie (sort of). OpenClaw’s documentation tracks unreleased code. Config keys documented on docs.openclaw.ai may not exist in the version you’re running. I’ve been burned by this repeatedly. heartbeat.isolatedSession. compaction.timeoutSeconds. Both documented. Both rejected by the config validator. Always run openclaw doctor after config edits. Before restarting. Not after.

The auth-profiles.json trap. Discovered this one on March 21 when the tweeter agent started returning 401 on every Claude API call. The key was valid in .zprofile. Valid in the LaunchDaemon plist. But OpenClaw stores per-agent copies in ~/.openclaw/agents/*/agent/auth-profiles.json. Those don’t get updated by config edits or plist changes. You have to update them manually. Three hours of debugging for a credential stored in a location I didn’t know existed.


The Numbers

Monthly cost with the original Claude-for-everything architecture: $30-75.

Monthly cost now: approximately $45-55.

Item Cost Electricity (~50W average) $5-8 Ollama Pro $20 Supabase (LobsterOps) $10 Tavily search Free (1K/month) X API credits ~$5-10 Claude Sonnet (tweeter) ~$1.50 Claude Sonnet (emergency fallback) $0-5 Telegram Free Tailscale Free

The $300 incident was the most expensive lesson. It was also the most valuable. It forced me to understand how model routing actually works in OpenClaw rather than how I assumed it worked. It forced me to build cost controls at four layers: config enforcement, behavioral rules in SOUL.md and AGENTS.md, twice-daily automated cost audits, and a hard spending cap at console.anthropic.com.


What’s Running Right Now

Twenty-two cron jobs manage automated workflows. Health checks every 10 minutes. GitHub Issue watching every 15 minutes. X mention polling every 2 minutes during waking hours. Intel sweeps four times a day. Tweet draft sessions three times a day. Cost audits twice daily. Nightly backups. Weekly repo sweeps. Weekly performance reports.

All services survive reboots via LaunchDaemons. No SSH session required for normal operation. The Mac mini operates fully autonomously. I interact via phone only.

One repo is in full autonomy mode. The agent picks up Issues, writes code, gets GLM-5 review, opens PRs. Every other repo is paused. Report only, no work. This is still a test phase. Monitoring the four-agent architecture and file-based coordination before expanding scope.

The agent continuously improves via .learnings/. Every failure gets logged to ERRORS.md with date, root cause and fix applied. Every correction I make gets logged to LEARNINGS.md. The agent checks both before starting any task.


What I’d Tell You Before You Build One

Start with the config, not the model. The model matters less than the routing. Get your cost controls right before you let the agent run unattended. One misconfigured default and you wake up to a bill.

Hallucination is a systems problem. You can’t prompt your way to reliability. Build verification into the workflow. Require proof of action. Log everything. Click the links.

Observability isn’t optional. If you can’t see what the agent did, you can’t trust what it says it did. Build the dashboard before you need it. I built LobsterOps after the $300 fire. Should have built it before.

The docs are aspirational. Every framework’s documentation describes what the latest code can do, not what the stable release supports. Test every config key against the validator before you trust it. Run the doctor command.

Credentials scatter. Any system that stores credentials in more than one location will eventually desync. Know every location. Script the rotation. Test the rotation.

Agent identity matters. Giving the agent a distinct voice and personality isn’t vanity. It makes the output recognizable. You can tell at a glance whether a tweet or a PR description came from the agent or from you. That clarity is operational.

MoE models load all parameters. This caught me during hardware planning. A model advertised as “40B active parameters” from a 744B total still needs memory for all 744B during loading. At inference time, only 40B are active. But your VRAM bill is the full parameter count. Plan accordingly.

Purpose-built models beat generalists for pipelines. I evaluated Llama 4 Scout. Multimodal. 10 million token context. Impressive specs. Completely irrelevant to my use case. My sub-agents generate code. They don’t analyze images or process million-token documents. Qwen3-Coder-Next, a purpose-built coding model, outperforms Scout for every task in my pipeline while using less memory.

One writer per file. If two agents can write to the same file, they will corrupt it. Assign ownership. One agent writes. Everyone else reads. Coordinate through the filesystem, not through API calls between agents.


What’s Next

The single highest-impact improvement on the roadmap is replacing GitHub polling with event-driven webhooks. Right now the agent checks for new Issues every 15 minutes. With webhooks via Tailscale Funnel and a GitHub App, it would respond in seconds.

Beyond that: Redis Streams as an internal event bus. n8n replacing the 22 cron jobs with proper workflow automation. OrbStack replacing Colima for 60% less RAM overhead. Qdrant with a cross-encoder reranker for better semantic memory. Restic with Backblaze B2 for off-site backups. A UPS with NUT for graceful shutdown on power loss.

The tiered autonomy model scales too. Right now one repo has full autonomy. The pattern is crawl-walk-run: report only, then supervised work, then full autonomy. One repo at a time. Each one earns trust independently.

The blog post you’re reading was supposed to be written weeks ago. The architecture kept evolving. Every time I sat down to document the final state, something changed. New model. New agent. New lesson.

At some point you just have to publish and keep building.


The Architecture Today

A dedicated Mac mini M4 Pro running OpenClaw as an always-on autonomous AI agent. Four agents with distinct roles. Three tiers of model routing. Monthly cost under $55.

The main agent orchestrates. The reviewer agent audits every line of code. The intel agent monitors the competitive landscape. The tweeter agent maintains a public voice.

Sub-agents run locally on 80B and 30B coding models. Cloud models handle orchestration and review via Ollama Pro. Claude handles voice work at $1.50 a month.

All communication through Telegram. All code through GitHub with branch protection and mandatory review. All events logged to Supabase through LobsterOps. All services surviving reboots through LaunchDaemons.

No server costs. No cloud VMs. No GPU rentals. A $2,800 computer on a desk in Knoxville running 24/7 with a lobster that never sleeps.

The lobster runs free. 🦞


Noel DeLisle is Director of AI at GO2 Partners. Former USMC infantry squad leader and combat weapons instructor. He writes about AI systems, agent architecture and the operator-level view of building with AI at noeldelisle.com. Lobster Actual posts at @lobsteractual on X.

Thanks for reading! Subscribe for free to receive new posts and support my work.