The Shelly Problem — Diagnosed
The thermostat’s first full day. Two mysteries emerged from the data — and both were solved by reading the logs carefully.
The thermostat works
Twenty-four hours of bang-bang control, and the numbers are clear:
- Temperature: min 20.1, max 23.3, mean 22.0°C
- 100% of readings within the 20–25°C target range
- 64% of readings within the tight 22.6–23.0°C thermostat band
- 27 relay actions — heater cycles roughly every 30 minutes
The pattern is consistent: heater turns ON at 22.5°C (one reading below the 22.6 lower band), runs for ~15 minutes until temp hits 23.1°C (one reading above the 23.0 upper band), turns OFF, room cools for ~15 minutes, repeat. The 0.4°C hysteresis band is just right — wide enough to prevent rapid cycling, narrow enough to keep the temperature tight.
Combined suitability: 22.9% → 96.7%
Analysing the full day’s sensor data (2,879 readings) against the target ranges:
- Temperature: 100% of readings within 20–25°C
- Humidity: 96.7% of readings within 40–60%
- Combined suitability: 96.7% — up from 22.9% before the thermostat
The 3.3% that missed was 95 readings where humidity dipped to 39.3% during the afternoon. Temperature never left the target range — even during a 10-hour period where the radiator was physically switched off (see below), it bottomed at 20.1°C, still above the 20°C floor.
Before active temperature control, combined suitability was 22.9% — temperature swung too widely with the diurnal cycle. The thermostat eliminated temperature as a variable. The remaining gap is humidity, which the heater makes worse by drying the air.
Daily summary fixed
The agent didn’t produce a daily summary for March 10 at midnight. The CLAUDE.md on the Pi has detailed instructions about daily summaries — format, trigger condition, data source — but the agent ignored them.
Root cause: the task prompt in agent-loop.sh said “Append a single timestamped log entry” and listed 7 steps, none of which mentioned daily summaries. The agent follows the task prompt literally. Having the summary instructions in CLAUDE.md wasn’t enough — the task prompt has to explicitly say “check if the date changed and generate a summary.”
Fixed by adding a step 8 to the task prompt:
DAILY SUMMARY CHECK: Look at the last entry in lab-log.md. If the date of the last entry is DIFFERENT from today, this is the first cycle of a new day. You MUST prepend a daily summary block BEFORE your regular entry.
The first summary after the fix fired correctly at midnight March 11:
## Daily Summary — 2026-03-10
- Temp: min 20.1 / max 23.3 / mean 22.0 C
- Humidity: min 39.3 / max 50.4 / mean 44.6%
- Warnings: ~3 cycles with humidity below 40%
- Diurnal pattern: 3.2 C swing, low 20.1 at ~20:00, peak 23.3 at ~11:10
Lesson: An autonomous agent running with --print treats the task prompt as its primary instruction set. CLAUDE.md is context, not commands. If a behaviour is critical (like daily summaries), it must be in the task prompt, not just in the reference document.
The afternoon mystery: a housemate
The data showed a puzzling 10-hour decline: temperature fell from 22.8°C at 10:00 to 20.1°C by 20:00, despite the thermostat having the heater “ON” the entire time.
The ACTION log told the real story:
10:14 ACTION=OFF temp=23.1 (last normal cycle)
11:19 ACTION=ON temp=22.5 (heater "on" — but temp never reaches 23.1 again)
...
21:29 ACTION=ON temp=21.6 (next action — 10 hours later)
Between 10:14 and 21:29, the thermostat logged zero OFF actions. In normal operation, the heater cycles ON/OFF every 15 minutes. Ten hours with no OFF action means the temperature never reached 23.1°C — because the radiator’s hardware switch had been turned off by a housemate. The Shelly relay was ON, the thermostat thought it was heating, but the radiator was producing no heat.
The temperature decline was steady — about 0.3°C per hour of passive cooling:
| Time | Temp | Note |
|---|---|---|
| 10:14 | 23.1 | Last normal cycle |
| 12:50 | 22.4 | Slow decline |
| 14:50 | 21.7 | Continuing |
| 16:50 | 21.0 | |
| 18:50 | 20.4 | |
| 19:50 | 20.2 | Bottom |
| 20:20 | 20.3 | Hardware switch turned back on |
| 20:50 | 21.3 | Rapid recovery |
The thermostat has no way to detect this — the Shelly relay responds normally, it just isn’t powering anything. A detection rule could flag it: if the heater has been ON for 60+ minutes without temperature rising, something is physically wrong.
The Shelly root cause: thermal overload
The Shelly crashes weren’t WiFi problems. Querying the device’s internal diagnostics via its RPC API revealed:
Internal temperature: 85.7°C (and rising)
Power draw: 2014.5W / 8.355A / 239.6V
Reset reason: 4 (software restart — thermal protection)
Uptime since reboot: ~25 minutes
The radiator draws 2kW. The Shelly Plus Plug UK is rated for 2500W / 10A, so 2kW is technically within spec — but at 80% of rated capacity, the tiny plug body can’t dissipate heat fast enough. The internal temperature climbs to 85°C+, hits the thermal protection threshold, and the device shuts down. The 18-minute “outage” is cooling time.
The cycle repeats: boot, heat up over ~40 minutes, hit thermal limit, crash, cool for 18 minutes, boot again. Roughly one crash per hour, exactly matching the overnight pattern.
Outage timeline:
| Window | Duration | Phase |
|---|---|---|
| 00:59–01:18 | 19 min | Pre-firmware-update |
| 12:12–12:31 | 19 min | Pre-firmware-update |
| 21:13–21:29 | 16 min | Post-firmware-update |
| 22:22–22:40 | 18 min | Post-firmware-update |
| 23:27–23:45 | 18 min | Post-firmware-update |
| 00:24–00:42 | 18 min | Post-firmware-update (Mar 11) |
Why the firmware update made it worse: version 1.7.4 likely lowered the thermal protection threshold. Before the update (v1.4.4), the device could run longer before tripping — 2 outages per day. After, it trips every hour.
Three things confirmed this wasn’t a WiFi issue:
reset_reason: 4= deliberate software restart, not connection loss- The Shelly’s soft AP was still running alongside the station connection — extra radio work, extra heat
- WiFi RSSI was -65 dBm every time the device came back — consistent signal, not a dropout pattern
The fix: reduce the load
The radiator has three power settings. At 2kW the Shelly’s power LED shows yellow-orange. At 1250W it shows yellow-green. At 750W, green. The colour indicates how hard the plug is working.
At 1250W, the current drops from 8.4A to ~5.2A. Internal resistive heating scales with I² — so 1250W produces roughly 39% of the internal heat that 2kW does. That should keep the Shelly well below its thermal protection threshold.
The tradeoff is heating capacity. At 2kW, the thermostat cycled 15 minutes ON / 15 minutes OFF — about 50% duty cycle. At 1250W, it’ll need longer ON periods. But the thermostat adapts automatically — it just holds the bang-bang bands regardless of how long each cycle takes.
Before switching power settings, the Shelly needed to cool down from 85.7°C — loading it at 1250W while already near the thermal limit could trigger another crash before it stabilised.
The first attempt at a cooldown script failed. It disabled the thermostat by writing "enabled": false to the config file and turned the relay off — but the thermostat daemon, running as a systemd service every 30 seconds, re-read the config and turned the relay back on before the script’s next poll. Toggling a config flag doesn’t work when a service is actively reading the same file on a faster loop.
The fix was obvious in hindsight: stop the systemd service entirely, not just toggle a config flag.
sudo systemctl stop lab-thermostat
curl --digest -u admin:$SHELLY_PASS "http://192.168.1.200/relay/0?turn=off"
# Poll internal temp every 30s until <50°C...
sudo systemctl start lab-thermostat
The Shelly cooled from 85.7°C to 50.0°C in about 12 minutes with no load. After switching the radiator to 1250W and restarting the thermostat, the first reading: 50.3°C at 1278W / 5.25A. A clean baseline — 35°C below where it was crashing.
Twenty minutes of thermal monitoring confirmed the fix:
| Minutes | Internal temp | Note |
|---|---|---|
| 0 | 54.7°C | Clean start at 1250W |
| 4 | 56.8°C | Rising fast |
| 9 | 58.0°C | Slowing |
| 14 | 58.8°C | Flattening |
| 20 | 59.6°C | Plateau |
Equilibrium: ~60°C at 1250W. That’s 25°C below the crash threshold. At 2kW, the Shelly hit 85°C in 25 minutes and crashed. At 1250W, it stabilises at 60°C and stays there. The thermal reboots should be gone.
Where things stand
| Item | Status |
|---|---|
| Thermostat | Working, 22.8°C ± 0.4°C |
| Shelly | Fixed — 60°C at 1250W (was 85°C+ at 2kW) |
| Combined suitability | 96.7% (up from 22.9%) |
| Daily summary | Fixed — task prompt updated |
| Agent | Running 24/7, reviewing thermostat each cycle |
| Humidity | Marginal — 39–45%, needs humidifier |
The system diagnosed itself. The thermostat log contained the housemate incident (10 hours of ON with no OFF) and the Shelly thermal pattern (consistent 18-minute reboots). The Shelly’s own RPC API reported the internal temperature and reset reason. No external tools needed — just reading the data that was already being collected.
Lesson: When you need to temporarily override a control loop, stop the service — don’t just flip a config flag. A config flag is a request; stopping the service is an order.