While the Lab Waits — Five Days Building an AI Image Editor

2026-03-14

TL;DR:
>be me, still haven’t poured media
>spend 5 days building AI image editor for the Bucharest office instead
>write 600 lines of plan pipeline code
>none of it works reliably
>delete everything, replace with 179-line task prompt
>mfw thin tools + AI reasoning works better than code every time
>come back to lab, humidity crashed to 28%
>inner agent wrote 264 warnings nobody saw
>Claude proposes 400 lines of bash analysis scripts to fix reporting
>tell it to read its own blog first
>rewrites plan as 80-line data pull script
>ask Claude to draft this blog update
>it writes “my first instinct was to build four bash scripts”
>that was Claude’s plan, not mine
>Claude literally trying to frame its own mistakes as mine

No cultures in yet — I still haven’t poured media. While the lab waits, I spent the last five days building something else EGC needed.

This post is a departure from the usual tissue culture content. Same company, different problem.

What the Bucharest office needs

The design team in Bucharest edits garden photos for client presentations. Swap gravel for brick. Add lavender along a border. Show what a design will look like before anyone picks up a shovel. They’ve been using ChatGPT and Google’s image tools, but these have problems: ChatGPT changes image dimensions (a 4032×3024 site photo comes back as 1024×1024), there’s no way to mask specific areas, and results are inconsistent.

They needed a tool where they could upload a site photo, paint over the area they want to change, describe the edit (in Romanian — that’s important), and get back a result at the original resolution. I had five days while the lab was blocked. Here’s what happened.

Four hours from zero (March 10)

The first version was straightforward. A Cloudflare Worker handles the API — accepts an image, a mask, and a prompt, routes to whichever AI provider is configured, returns the result. A single-page frontend with a canvas-based mask editor: upload a photo, paint white over the area you want to edit, type a prompt, submit. The mask tells the model where to edit (white) and where to leave alone (black).

I tested every available provider:

Provider	Result
OpenAI GPT Image 1.5	Ignores mask boundaries, edits wherever it wants
FLUX Fill Pro	Hallucinates structures — ask for “lawn” and get a patio with furniture
Nano Banana 2 (Google)	Good semantic understanding, but no mask API — it decides what to edit
Imagen 3 (Google Vertex)	Respects mask boundaries precisely, preserves original dimensions

Imagen 3 won. It’s the only model that reliably edits only the masked area and returns the image at its original resolution. About $0.04 per edit.

The whole thing — Worker, frontend, mask editor, auth, deployed to edit.egc.land — took about four hours.

The prompt enrichment trick

The Bucharest team writes prompts in Romanian. The models expect English, and not just any English — they expect specific, visually detailed English. “Adaugă lavandă” (add lavender) produces generic purple blobs. “Dense Lavandula angustifolia ‘Hidcote’ border with silvery sage-green foliage and deep violet-purple flower spikes, planted in natural drifts along the existing stone edge” produces something that looks like the actual plant.

Nobody should have to write prompts like that. So I added a Gemini Flash layer between the user’s prompt and the inpainting model. Gemini sees the source image (with the mask overlaid in green), understands the scene — the lighting, the existing materials, the perspective, what’s adjacent to the edit area — and rewrites the prompt with all that context. The user types three words in Romanian, Gemini produces a paragraph of precise English.

This is essentially what Rendair AI charges $19/month for. The Gemini call costs $0.001.

The hard lesson about prompts

The enrichment pipeline had a bug that took two sessions to find. Gemini was naming the existing materials in the enhanced prompt. If the user wanted to replace terracotta tiles with cream gravel, Gemini would write “replace the terracotta tiles with cream gravel” — and the inpainting model would see the word “terracotta” and faithfully reproduce terracotta tiles. The result looked plausible until someone noticed that nothing had actually changed.

The fix was counterintuitive: Gemini puts existing materials in a negative prompt (what the model should avoid generating) and describes only the desired result in the positive prompt. This was the hardest-won insight of the entire project. It’s not in the model documentation, and getting it wrong produces results that pass a casual glance but fail the client review.

The over-engineering spiral (March 11)

The first version worked. I should have stopped there and let the team use it.

Instead, I spent the next day — seven sessions — trying to make it smarter.

Session 1: Built a plan pipeline. Upload an architectural plan alongside the site photo, and Gemini decomposes the customer’s brief into discrete edit steps. Step 1: add brick steps. Step 2: add a bin enclosure. Step 3: add planters. Each step chains automatically — the result becomes the source for the next step.

Session 2: The plans had wrong spatial positions (bin enclosure on the wrong side of the garden), so I added a two-pass interpretation flow. Gemini first describes what it sees in the plan, the user corrects the mistakes in interactive cards, then the confirmed interpretation drives the plan.

Session 3: The interpretations had ambiguous materials (a green polygon on the plan — is that timber? brick? composite?), so I added clarification questions. Gemini asks “what material should the bin enclosure be?” and the answer feeds into plan generation.

Session 4: The material swap prompts kept regenerating geometry instead of just changing surfaces (prompt says “change the steps to brick” and the model builds entirely new steps with a different count), so I added edit type classification — material_swap vs structural — with different prompt strategies for each.

Each session was fixing failures from the previous session. This is a pattern I should have recognised sooner: the system keeps getting more complex to handle edge cases that only exist because the system is complex. By the end of session seven, I had a plan pipeline, an interpretation engine, a clarification flow, edit type classification, mask review pauses, per-step accept/retry, and about 600 lines of plan-specific Worker code.

None of it worked reliably.

41%

The turning point was a research finding buried in a web search during session seven: vision-language models achieve approximately 41% accuracy on architectural plan elements. That’s not a tuning problem. That’s a capability ceiling.

Gemini couldn’t reliably tell which side of the garden the steps were on. It couldn’t match material labels to the correct plan elements — “RUNNING BOND” got assigned to the bin enclosure instead of the steps. It copied raw measurements (“180mm rise”) into inpainting prompts despite explicit instructions not to. When I placed the conversion warning right next to the data, Gemini still copied the raw number. Three different reinforcement strategies, same result.

I’d spent seven sessions building increasingly sophisticated prompt engineering around a model that was wrong about the plan layout more often than it was right. No amount of interpretation review, clarification questions, or prompt refinement fixes 41%.

And then there was the counting problem. Tell any current generative model “7 steps” and you’ll get 5 or 9. Across four retries with different prompt strategies — visual proportion language, explicit counts, proportional references — one model produced 10-15 steps every time. This is true of everything I tested: Gemini, Imagen, DALL-E, FLUX. Generative models cannot count.

The rebuild (March 12)

On the 12th, I recognised the pattern. It was the same architectural mistake I’d made early in the tc-lab project: trying to encode intelligence in application code instead of letting an AI reason.

The tc-lab inner agent monitors temperature and humidity every 15 minutes. It doesn’t follow a hardcoded flowchart — it reads sensor data, reasons about what it sees, and decides what to do. The intelligence lives in a task prompt, not in if-else branches.

The image editor needed the same pattern.

I deleted everything from the previous day. All 600 lines of plan pipeline, interpretation engine, clarification flow. The Worker went from 2,025 lines to 1,425.

The new architecture:

Browser (chat UI + mask editor)
    ↕ WebSocket
VPS session manager (Node.js + Claude Agent SDK)
    ↕ spawns
Claude Code process (one per conversation)
    ↕ calls
Thin Python tools (inpaint, segment, crop, merge, resize)
    ↕ HTTP
Cloudflare Worker API (unchanged)

The Bucharest team talks to Claude through a browser chat window. Claude sees their uploaded images, reasons about what to do, calls the inpaint API with the right parameters, shows results inline, and iterates conversationally. If the user says “nu, schimbă materialul” (no, change the material), Claude adjusts its approach — a different model, a different prompt, a cropped reference image — without anyone touching a configuration panel.

All the business logic I’d spent seven sessions encoding in application code now lives in a 179-line task prompt. Model selection rules. When to use a mask vs prompt-only. The “don’t name existing materials” rule. When to crop a reference from the architectural plan. When to merge between chained edits. Everything the system needs to know, stated once in natural language, interpreted by Claude at runtime.

The monolithic 2,935-line frontend — model selector, prompt field, plan panel, interpretation cards, clarification forms — became a 766-line chat UI with a collapsible mask editor.

Real users, real bugs (March 13–14)

Deployed to a Hetzner VPS. The first real session with the Bucharest team: 19 minutes, $0.69. Claude placed a hybrid BBQ on a deck, replaced gravel borders with evergreen shrubs across three camera angles (in parallel — three API calls simultaneously), and added a wooden shelter. It worked.

Then the bugs started.

The WebSocket timeout. Apache’s reverse proxy kills idle connections after about two minutes of silence. The team would be looking at a result, thinking about what to ask next, and the connection would die. Fix: WebSocket pings every 30 seconds.

The 1px tool cards. Chat messages from Claude include collapsible “action cards” showing what the AI is doing (generating an edit, viewing an image, searching the web). These cards were collapsing to 1 pixel tall, effectively invisible. It took three sessions to find the root cause. Session one blamed scroll behaviour. Session two blamed emoji rendering at small font sizes. Session three found the real problem: CSS flex-shrink. The chat container uses flex-direction: column with overflow-y: auto. When content overflows, the flex algorithm shrinks children before scrolling — every card collapsed to fit the container. One CSS rule fixed it: flex-shrink: 0 on all children.

The invisible images. This one took six sessions to diagnose properly. Claude would call the inpainting tool, save the result to disk, and the user saw nothing. The image display depended on parsing tool output text for file paths. But Claude sometimes used curl (no stdout), sometimes described the result in prose without mentioning the filename, sometimes used markdown image syntax that doesn’t render in the chat UI.

The fix was architectural: stop depending on what the LLM says and watch the filesystem instead. A directory watcher monitors the results folder. When a new image appears, the session manager pushes it to the browser via WebSocket. The LLM can format its output however it wants — the images show up regardless.

The reference image problem. When a user picked a product photo (say, a specific BBQ shelter from a supplier website) and said “use this,” Claude would write a long text prompt describing the photo — “a compact black anthracite flat-roof shelter with slatted cedar cladding and an open front” — instead of passing the image itself as a reference. Text descriptions are always interpreted loosely by the model. A reference photo is unambiguous. The task prompt now includes explicit correct-vs-wrong examples for reference usage.

The $8 search. One session where the user asked Claude to find BBQ shelters from UK suppliers cost $8. Claude made 30+ tool calls in a single turn — six suppliers, eight product pages, thirteen images. The task prompt now says “stop at 4–5 good matches.”

The honest evaluator. The most interesting production bug: Claude couldn’t judge its own work. In one session, Claude generated a result and said “solid left wall, open front, three lights — all correct.” The user replied: the wall isn’t solid, there’s an extra wall section on the front, and the right fence changed from timber to brick. Claude had looked at the image and declared it perfect. It wasn’t.

The fix: every inpainting result now gets an independent quality check from a different model — Gemini 3 Flash. It compares the source and result images, identifies what changed, flags unintended changes, and returns a structured score. The evaluation is built into the tool itself — it runs automatically after every edit and the agent can’t skip it. If Gemini flags that the right fence changed from timber to brick, Claude reports that to the user instead of claiming everything looks great.

I tested the evaluator on two results from the session where Claude got it wrong. Gemini correctly identified both problems — “brick wall behind BBQ replaced with dark grey cladding” and “right side wooden slatted fence replaced with brick wall” — the exact issues the user had flagged.

Lesson: Don’t make quality checks optional for an AI agent. If it’s a flag the agent can choose not to pass, it won’t pass it. If it’s built into the tool output, it can’t be ignored. Same principle applied to Gemini prompt enrichment — it was opt-in, the agent never used it, every session produced worse results than necessary. Flipped to opt-out, enrichment runs by default.

The pattern

This project reinforced something I learned building the tc-lab system: build thin tools and let the AI reason.

The tc-lab inner agent doesn’t have a hardcoded temperature control algorithm. It has a Python script that reads a sensor, a script that controls a relay, and a task prompt that explains the physics. The image editor doesn’t have a hardcoded edit pipeline. It has a Python script that calls an API, a script that generates masks, and a task prompt that explains inpainting.

The seven sessions I spent building the plan pipeline were me trying to encode reasoning in code — deciding when to use masks, how to decompose edits, which model fits which scenario. The rebuild replaced 600 lines of that code with 179 lines of natural language.

Both projects use the same architecture: Claude Code as an orchestrator, thin tools for I/O, all intelligence in the task prompt. Same pattern, different domain. One controls a heater in a bedroom lab in London. The other edits garden photos for a design office in Bucharest.

By the numbers

Metric	Value
Total build time	5 days (17 sessions)
Cost per edit	~$0.04–0.08 (Imagen 3 + Gemini enrichment)
First version	4 hours from zero to deployed
Worker code	2,025 → 1,425 lines (plan pipeline deleted)
Frontend	2,935 lines → 766 lines
Plan pipeline	7 sessions of prompt engineering. Deleted.
Task prompt	179 lines of natural language
First real session	19 minutes, $0.69
Most expensive session	$8 (over-searching — now prevented by rules)

Where things stand

Item	Status
Lab	Waiting — media not poured yet
edit.egc.land	Live, Bucharest team using it
Architecture	Chat UI → Claude Code → thin tools → Worker API
Image quality	Auto-evaluation via Gemini 3 Flash on every edit
Prompt enrichment	On by default, Gemini rewrites Romanian → detailed English
Smart merge	Pixel-diff compositing preserves quality across chained edits
Segmentation	Broken — API integration needs fixing
Cost control	4–5 supplier matches max, no more $8 sessions

The lab humidity problem hasn’t gone anywhere. But the five-day detour produced something the Bucharest team can actually use — and taught me the same lesson twice: stop writing code where natural language will do.

Back to the lab

Four days since the last update, I came back to the lab and ran a full review. The data told a clear story: humidity had collapsed from 44% to 28% over five days, the Shelly 1250W fix was holding perfectly (zero crashes in four days), and the inner agent had logged 264 humidity warnings that nobody saw.

That last point was the problem. The agent detected the humidity crisis on day one. It wrote WARNING in the log every 15 minutes for four days straight. But the only way to see those warnings was to SSH into the Pi and read the log file. The agent was a smoke detector that writes “fire” in a notebook instead of sounding an alarm.

Claude also missed something in its review: the agent had hit usage limits on March 13 and 14 — nine failed cycles each day. The entry counts dropped from 91 to 82 and 81. Claude reported the numbers without investigating why. The errors were right there in the raw output logs, but it only checked the structured log. I had to ask.

Two process failures: no alerting, and an incomplete review checklist.

Applying the image editor lesson

Claude’s initial plan was four bash scripts: a daily digest with trend detection, a review checklist that runs through ten automated checks, alert escalation logic, and a context doc template. About 400 lines of bash with embedded Python. Analysis logic encoded in shell — the same mistake I’d just spent seven sessions making with the image editor.

I pointed Claude at the blog and the egc-inpaint CLAUDE.md. Read the whole corpus. Then redesign.

After reading the project’s own architecture principles, the plan changed. The review Claude had just done was the right process — pulling data, reasoning about it, writing a report. It was just slow because it ran seven SSH commands manually, each taking 10-15 seconds through the Cloudflare tunnel.

The fix wasn’t a bash script that analyses data. It was a bash script that pulls all the data in one call so Claude can reason about it faster. Same pattern as the image editor: thin tools for I/O, intelligence stays in the conversation.

What I actually built

One data pull script (scripts/lab-pull.sh). A single SSH call with a heredoc that gathers everything into a markdown dump: current readings, Shelly diagnostics, thermostat status, service health, sensor stats with daily breakdowns, agent log analysis, raw output errors, thermostat summaries, alerts, and photo status. Ten sections, ~80 lines, no analysis logic. Claude reads the output and does the thinking.

./scripts/lab-pull.sh          # Last 24h
./scripts/lab-pull.sh 120      # Last 5 days

What took seven SSH calls and ten minutes now takes one command and fifteen seconds.

Alert detection in agent-loop.sh. This is the one piece that does need to be automated — a threshold check that runs when nobody’s watching. After each agent cycle, twelve lines of bash count consecutive WARNING entries in the log. Six or more (≥90 minutes sustained) triggers a one-line alert to alerts.log, deduped to one per hour. The daily backup cron pulls this file and surfaces new alerts in the backup log.

No daily digest script. No trend detection. No automated review checklist. Claude IS the digest, the trend detector, and the checklist. I just needed to give it the data faster.

The pattern, stated three times now

Project	Before	After
tc-lab inner agent	Hardcoded flowchart	Task prompt + thin API calls
Image editor	600-line plan pipeline	179-line task prompt + thin Python tools
Lab reporting	400 lines of bash analysis	80-line data pull + Claude reasons

Build thin tools. Let the AI reason. Stop writing code where natural language will do.

A smaller thing worth noting

When I asked Claude to draft this blog update, it wrote: “My first instinct was to build four bash scripts.” That’s Claude framing its own mistake as mine. The four-script plan was Claude’s, not my instinct — I’m the one who caught it and made it re-read the corpus.

It’s a subtle thing but it matters. Claude is a tool I’m directing, not a ghostwriter speaking in my voice. When it drafts text for me, it needs to keep the attribution straight: I’m the narrator, Claude is a character in the story. I had to correct it, and now it’s in the CLAUDE.md as a rule.