TC Lab Blog

The Shelly Problem — Diagnosed


The thermostat’s first full day. Two mysteries emerged from the data — and both were solved by reading the logs carefully.

The thermostat works

Twenty-four hours of bang-bang control, and the numbers are clear:

The pattern is consistent: heater turns ON at 22.5°C (one reading below the 22.6 lower band), runs for ~15 minutes until temp hits 23.1°C (one reading above the 23.0 upper band), turns OFF, room cools for ~15 minutes, repeat. The 0.4°C hysteresis band is just right — wide enough to prevent rapid cycling, narrow enough to keep the temperature tight.

Combined suitability: 22.9% → 96.7%

Analysing the full day’s sensor data (2,879 readings) against the target ranges:

The 3.3% that missed was 95 readings where humidity dipped to 39.3% during the afternoon. Temperature never left the target range — even during a 10-hour period where the radiator was physically switched off (see below), it bottomed at 20.1°C, still above the 20°C floor.

Before active temperature control, combined suitability was 22.9% — temperature swung too widely with the diurnal cycle. The thermostat eliminated temperature as a variable. The remaining gap is humidity, which the heater makes worse by drying the air.

Daily summary fixed

The agent didn’t produce a daily summary for March 10 at midnight. The CLAUDE.md on the Pi has detailed instructions about daily summaries — format, trigger condition, data source — but the agent ignored them.

Root cause: the task prompt in agent-loop.sh said “Append a single timestamped log entry” and listed 7 steps, none of which mentioned daily summaries. The agent follows the task prompt literally. Having the summary instructions in CLAUDE.md wasn’t enough — the task prompt has to explicitly say “check if the date changed and generate a summary.”

Fixed by adding a step 8 to the task prompt:

DAILY SUMMARY CHECK: Look at the last entry in lab-log.md. If the date of the last entry is DIFFERENT from today, this is the first cycle of a new day. You MUST prepend a daily summary block BEFORE your regular entry.

The first summary after the fix fired correctly at midnight March 11:

## Daily Summary — 2026-03-10

- Temp: min 20.1 / max 23.3 / mean 22.0 C
- Humidity: min 39.3 / max 50.4 / mean 44.6%
- Warnings: ~3 cycles with humidity below 40%
- Diurnal pattern: 3.2 C swing, low 20.1 at ~20:00, peak 23.3 at ~11:10

Lesson: An autonomous agent running with --print treats the task prompt as its primary instruction set. CLAUDE.md is context, not commands. If a behaviour is critical (like daily summaries), it must be in the task prompt, not just in the reference document.

The afternoon mystery: a housemate

The data showed a puzzling 10-hour decline: temperature fell from 22.8°C at 10:00 to 20.1°C by 20:00, despite the thermostat having the heater “ON” the entire time.

The ACTION log told the real story:

10:14  ACTION=OFF  temp=23.1  (last normal cycle)
11:19  ACTION=ON   temp=22.5  (heater "on" — but temp never reaches 23.1 again)
...
21:29  ACTION=ON   temp=21.6  (next action — 10 hours later)

Between 10:14 and 21:29, the thermostat logged zero OFF actions. In normal operation, the heater cycles ON/OFF every 15 minutes. Ten hours with no OFF action means the temperature never reached 23.1°C — because the radiator’s hardware switch had been turned off by a housemate. The Shelly relay was ON, the thermostat thought it was heating, but the radiator was producing no heat.

The temperature decline was steady — about 0.3°C per hour of passive cooling:

TimeTempNote
10:1423.1Last normal cycle
12:5022.4Slow decline
14:5021.7Continuing
16:5021.0
18:5020.4
19:5020.2Bottom
20:2020.3Hardware switch turned back on
20:5021.3Rapid recovery

The thermostat has no way to detect this — the Shelly relay responds normally, it just isn’t powering anything. A detection rule could flag it: if the heater has been ON for 60+ minutes without temperature rising, something is physically wrong.

The Shelly root cause: thermal overload

The Shelly crashes weren’t WiFi problems. Querying the device’s internal diagnostics via its RPC API revealed:

Internal temperature:  85.7°C (and rising)
Power draw:            2014.5W / 8.355A / 239.6V
Reset reason:          4 (software restart — thermal protection)
Uptime since reboot:   ~25 minutes

The radiator draws 2kW. The Shelly Plus Plug UK is rated for 2500W / 10A, so 2kW is technically within spec — but at 80% of rated capacity, the tiny plug body can’t dissipate heat fast enough. The internal temperature climbs to 85°C+, hits the thermal protection threshold, and the device shuts down. The 18-minute “outage” is cooling time.

The cycle repeats: boot, heat up over ~40 minutes, hit thermal limit, crash, cool for 18 minutes, boot again. Roughly one crash per hour, exactly matching the overnight pattern.

Outage timeline:

WindowDurationPhase
00:59–01:1819 minPre-firmware-update
12:12–12:3119 minPre-firmware-update
21:13–21:2916 minPost-firmware-update
22:22–22:4018 minPost-firmware-update
23:27–23:4518 minPost-firmware-update
00:24–00:4218 minPost-firmware-update (Mar 11)

Why the firmware update made it worse: version 1.7.4 likely lowered the thermal protection threshold. Before the update (v1.4.4), the device could run longer before tripping — 2 outages per day. After, it trips every hour.

Three things confirmed this wasn’t a WiFi issue:

  1. reset_reason: 4 = deliberate software restart, not connection loss
  2. The Shelly’s soft AP was still running alongside the station connection — extra radio work, extra heat
  3. WiFi RSSI was -65 dBm every time the device came back — consistent signal, not a dropout pattern

The fix: reduce the load

The radiator has three power settings. At 2kW the Shelly’s power LED shows yellow-orange. At 1250W it shows yellow-green. At 750W, green. The colour indicates how hard the plug is working.

At 1250W, the current drops from 8.4A to ~5.2A. Internal resistive heating scales with I² — so 1250W produces roughly 39% of the internal heat that 2kW does. That should keep the Shelly well below its thermal protection threshold.

The tradeoff is heating capacity. At 2kW, the thermostat cycled 15 minutes ON / 15 minutes OFF — about 50% duty cycle. At 1250W, it’ll need longer ON periods. But the thermostat adapts automatically — it just holds the bang-bang bands regardless of how long each cycle takes.

Before switching power settings, the Shelly needed to cool down from 85.7°C — loading it at 1250W while already near the thermal limit could trigger another crash before it stabilised.

The first attempt at a cooldown script failed. It disabled the thermostat by writing "enabled": false to the config file and turned the relay off — but the thermostat daemon, running as a systemd service every 30 seconds, re-read the config and turned the relay back on before the script’s next poll. Toggling a config flag doesn’t work when a service is actively reading the same file on a faster loop.

The fix was obvious in hindsight: stop the systemd service entirely, not just toggle a config flag.

sudo systemctl stop lab-thermostat
curl --digest -u admin:$SHELLY_PASS "http://192.168.1.200/relay/0?turn=off"
# Poll internal temp every 30s until <50°C...
sudo systemctl start lab-thermostat

The Shelly cooled from 85.7°C to 50.0°C in about 12 minutes with no load. After switching the radiator to 1250W and restarting the thermostat, the first reading: 50.3°C at 1278W / 5.25A. A clean baseline — 35°C below where it was crashing.

Twenty minutes of thermal monitoring confirmed the fix:

MinutesInternal tempNote
054.7°CClean start at 1250W
456.8°CRising fast
958.0°CSlowing
1458.8°CFlattening
2059.6°CPlateau

Equilibrium: ~60°C at 1250W. That’s 25°C below the crash threshold. At 2kW, the Shelly hit 85°C in 25 minutes and crashed. At 1250W, it stabilises at 60°C and stays there. The thermal reboots should be gone.

Where things stand

ItemStatus
ThermostatWorking, 22.8°C ± 0.4°C
ShellyFixed — 60°C at 1250W (was 85°C+ at 2kW)
Combined suitability96.7% (up from 22.9%)
Daily summaryFixed — task prompt updated
AgentRunning 24/7, reviewing thermostat each cycle
HumidityMarginal — 39–45%, needs humidifier

The system diagnosed itself. The thermostat log contained the housemate incident (10 hours of ON with no OFF) and the Shelly thermal pattern (consistent 18-minute reboots). The Shelly’s own RPC API reported the internal temperature and reset reason. No external tools needed — just reading the data that was already being collected.

Lesson: When you need to temporarily override a control loop, stop the service — don’t just flip a config flag. A config flag is a request; stopping the service is an order.