OpenClaw Upgrade Plan — claw001: 2026.5.22 → 2026.6.x

Generated 2026-06-20 by the plan-openclaw-upgrade dynamic workflow (5 agents: target-pin, internal runbook survey, risk-surface survey, changelog-delta research, synthesis). Current installed: 2026.5.22. Latest stable found: 2026.6.8 (2026-06-16). Recommendation: HOLD on 5.22. See §1.

Target version landscape (5.22 → latest)

Latest stable: 2026.6.8 (2026-06-16). Versions released after 5.22: 2026.5.26, 5.27, 5.28, 6.1, 6.5, 6.6, 6.8.

Headline-relevant changes for claw001:

6.5 — release cadence change to YYYY.M.PATCH (3rd component is now a monthly patch counter, not calendar day; June floor pinned at 6.5). Auth profiles moved to SQLite. MCP tool-result block coercion (prevents Anthropic 400s). Anthropic extended-thinking recovery after prompt-cache expiry/restart.
6.1 — plugin SDK baseline refresh; auth-profile failover; externalized Tokenjuice + Copilot to npm plugins; Google provider resolves to google-generative-ai.
5.28 — Codex app-server/helper failures no longer tear down shared state; Opus 4.8 added; auth-profile canonical rewrite.
6.6 — tighter security boundaries (sandbox binds, host env inheritance, MCP stdio, Codex HTTP, elevated-sender checks); exec approvals fail-closed on timeout; safer Telegram delivery (streamed text survives tool calls); Fable 5 adaptive thinking. Externalized Llama.cpp.
6.8 — richer Telegram/WhatsApp delivery; OAuth image-default routing through Codex; managed SecretRef auth; OpenRouter/Vertex prefix normalization; CLI exit-code semantics change (usage errors now classified as failures — touches our fail-closed host scripts).

Sources: local ~/openclaw git clone (tags + CHANGELOG.md), GitHub releases, docs.openclaw.ai.

1. GO / NO-GO

NO-GO on 2026.6.8. Conditional, caveated GO on 2026.6.6 only if a non-image driver forces a move.

Single most important reason: The image-gen routing nondeterminism (#90074, OPEN, "needs maintainer review") that pinned us to 5.22 is NOT fixed anywhere in 5.22→6.8. Upgrading buys zero on the original pin reason while adding three SQLite migrations (cron, sessions, auth-profiles) and a CLI exit-code semantics change that touch our fail-closed host scripts. Separately, 2026.6.8 is community-flagged "wait for next release" with a CRITICAL gateway-startup regression (#94570 ERR_MODULE_NOT_FOUND) that can brick the container, plus #90361 (memory_search reindex race → hits Coach) and #94033 (isolated-cron timeout → hits our compose-and-deliver crons).

Recommendation: stay on 5.22. The prior #88312 (multi-tool Codex turn death) gate must also be confirmed reverted in the target changelog before ANY move — codex is the farm's primary chat path. If you must move (and only after confirming #88312 fixed), target 2026.6.6, never 6.8, and treat everything below as the 6.6 procedure (substitute v2026.6.6 for <TARGET>). The plan below is written for the GO case so it's executable when the gate clears.

Hard pre-req for any GO: re-auth codex OAuth in the window opener (token expires ~2026-06-29, ~9 days out; single-use refresh can be spent during chaotic warmup).

<TARGET> = v2026.6.6 (or the first post-#88312-fix tag). <CURRENT> = v2026.5.22.

2. Pre-flight (run ahead of window, low-risk)

P0 — Gate checks (abort if any fail):

# Confirm #88312 (multi-tool Codex turn death) reverted/fixed in TARGET changelog:
cd ~/openclaw && git fetch --tags && git log --oneline v2026.5.22..v<TARGET> | grep -iE '88312|codex.*turn|multi.?tool'
# Confirm TARGET is NOT a flagged release (re-check clawstat.us before window).
git tag | grep v<TARGET>     # tag must exist

P1 — Codex OAuth re-auth (THE SPOF — do this first, before snapshot):

docker exec -it openclaw-openclaw-gateway-1 openclaw models auth login --provider openai-codex
# Device code, ~2min, Daniel's ChatGPT-Pro account. Gets a fresh long-lived token,
# dodges the 06-29 cliff AND the single-use-refresh race during warmup.
docker exec openclaw-openclaw-gateway-1 openclaw config get agents.defaults.model   # sanity

P2 — Retired-model audit (doctor no-op insurance):

grep -E '"model"|"primary"|"fallbacks"' ~/.openclaw/openclaw.json
# Cross-ref against 6.6 changelog retired list; replace any retired ref via
# openclaw config set BEFORE building. Note 6.1 resolves Google → google-generative-ai;
# audit any Google/OpenRouter/Vertex provider IDs for canonical form.

P3 — Confirm patches still apply against TARGET (dry-run, no commit):

cd ~/openclaw && git stash -u 2>/dev/null; git checkout v<TARGET> -- Dockerfile docker-compose.yml
for p in ~/openclaw-ops/host-overrides/*.patch; do
  patch -p1 --dry-run --forward < "$p" && echo "OK: $p" || echo "FAIL — regenerate: $p"
done
git checkout v2026.5.22 -- Dockerfile docker-compose.yml   # restore working tree

If any FAIL: regenerate via git diff <file> > host-overrides/<name>.patch against the new tag. cloud-build.sh will bail loudly post-checkout otherwise, leaving a partial tree.

P4 — Headroom: df -h / → confirmed 26G free (want >20G). Good. Confirm .env pin is an explicit tag (:v2026.5.22, confirmed — not :latest).

3. Snapshot + capture rollback identifiers

# Prune Docker space first (disk insurance):
docker builder prune -af && docker image prune -f && df -h /

# Snapshot (writes state.tar.gz = full restorable + frozen agents/models/cron/plugins/devices txt):
~/openclaw-ops/scripts/snapshot-openclaw.sh "pre-v<TARGET>"

# Pre-upgrade extras:
mkdir -p ~/openclaw-ops/backups/pre-upgrade-extras
# (a) all 4 agents' auth-profiles.json — the codex-OAuth SPOF store (rewritten by 5.28/6.1 migration):
tar czf ~/openclaw-ops/backups/pre-upgrade-extras/auth-profiles-pre-v<TARGET>.tgz \
  ~/.openclaw/agents/*/agent/auth-profiles.json
# (b) baseline extension dir on CURRENT image for KEEP-list diff:
docker exec openclaw-openclaw-gateway-1 ls /app/dist/extensions/ \
  > ~/openclaw-ops/backups/pre-upgrade-extras/extensions-v2026.5.22.txt
# (c) the SQLite cron/session DBs (NEW — migrations are one-way; capture pre-migration state):
tar czf ~/openclaw-ops/backups/pre-upgrade-extras/runtime-state-pre-v<TARGET>.tgz \
  ~/.openclaw/jobs.json ~/.openclaw/agents/*/sessions/sessions.json 2>/dev/null
# (d) crontab explicit copy (clean restore one-liner):
crontab -l > ~/openclaw-ops/backups/crontab.bak.upgrade-v<TARGET>

The tarball is the reliable rollback artifact; CLI text captures are best-effort under load.

4. Build (Cloud Build, from claw001)

~/openclaw-ops/scripts/cloud-build.sh v<TARGET>

This auto: git fetch → reset Dockerfile/compose mods → git checkout v<TARGET> → re-apply host-overrides/cloudbuild.yaml + all 3 patches (patch -p1 --forward, bails exit 1 if any fails) → gcloud builds submit.

Watch for the 3 known Cloud Build quirks (all pre-fixed, confirm they hold):

# syntax=docker/dockerfile:1.6 directive applied → else build fails "/${OPENCLAW_BUNDLED_PLUGIN_DIR}" not found (BuildKit can't expand ${VAR} in --mount=source=). Do NOT "fix" via --build-arg or by dropping --cache-from.
logging: GCS_ONLY + logsBucket in cloudbuild.yaml → claw-backup-writer SA lacks Cloud Logging read.
NO --cache-from → cache-key breaks on the ARG-mount syntax.

Monitor (sequential, never stack docker exec):

gcloud builds list --region=northamerica-northeast2 --limit=3
# On failure, read logs from GCS (SA can't read Cloud Logging):
gcloud storage cat gs://clawsorg-claw001-backups/cloudbuild-logs/log-<build_id>.txt

Wall time ~9-12 min. Gate before deploy:

gcloud artifacts docker tags list \
  northamerica-northeast2-docker.pkg.dev/clawsorg/openclaw/gateway
# Expect: SUCCESS, :v<TARGET> + :latest on new digest, AND :v2026.5.22 STILL TAGGED (rollback target).

5. Deploy + canary

NEW_IMAGE=northamerica-northeast2-docker.pkg.dev/clawsorg/openclaw/gateway:v<TARGET>

# (11) Swap .env:
sed -i.bak.pre-upgrade 's|gateway:v2026.5.22|gateway:v<TARGET>|' ~/openclaw/.env
grep OPENCLAW_IMAGE ~/openclaw/.env   # verify

# (12) Pause host crontab (maintenance window — backup already captured):
crontab -r

# (13) Pull:
cd ~/openclaw && docker compose pull openclaw-gateway

# (14) KEEP-list / extension-dir audit BEFORE recreating production:
docker run --rm --entrypoint='' "$NEW_IMAGE" ls /app/dist/extensions/ \
  > ~/openclaw-ops/backups/pre-upgrade-extras/extensions-v<TARGET>.txt
diff ~/openclaw-ops/backups/pre-upgrade-extras/extensions-v2026.5.22.txt \
     ~/openclaw-ops/backups/pre-upgrade-extras/extensions-v<TARGET>.txt
# NOTE: 6.1 externalized Tokenjuice + Copilot to npm plugins; 6.6 externalized Llama.cpp.
# If a dir you depend on (KEEP="openai exa browser openrouter telegram memory-core
# image-generation-core") vanished or a new depended-on dir appeared → STOP,
# edit ~/.openclaw/gateway-start.sh KEEP= first (bind-mounted, no rebuild needed).

# (15) Recreate — up -d NOT restart (restart won't re-read .env):
docker compose up -d openclaw-gateway

Doctor + version verify (STOP gate):

# Wait healthy:
until docker ps --filter name=gateway-1 --format '{{.Status}}' | grep -q healthy; do sleep 5; done
docker exec openclaw-openclaw-gateway-1 openclaw --version   # expect TARGET

# CRITICAL timing: do NOT run doctor concurrent with cold warmup (starves event loop ~10min
# on e2-medium → unhealthy flapping). Wait for the 'provider auth state pre-warmed' log line:
docker logs --tail 200 openclaw-openclaw-gateway-1 | grep -i 'pre-warmed'

docker exec openclaw-openclaw-gateway-1 openclaw doctor --fix 2>&1 \
  | tee ~/openclaw-ops/backups/pre-upgrade-extras/doctor-fix-v<TARGET>.log
grep -iE 'error|fail|migrat' ~/openclaw-ops/backups/pre-upgrade-extras/doctor-fix-v<TARGET>.log
# Doctor performs the SQLite migrations (cron jobs.json→SQLite, session metadata, agent
# registry, auth-profile canonical rewrite). Any 'error'/'migration failed' → halt + read,
# do NOT force-restart. Confirm ~/.openclaw/openclaw.json.bak written (restore = cp bak json + restart).
# If wedged: docker exec openclaw-openclaw-gateway-1 pkill -9 -f openclaw-doctor

# After doctor's config writes, clean restart on migrated config:
docker compose restart openclaw-gateway   # ~130s

Canary window: 60-min active watch, then 7-day passive observation.

Smoke tests (sequential, sleep 2 between — never parallel; e2-medium thrashes to 400% otherwise):

# (A) Codex auth FIRST (the SPOF — verify chat didn't fall to OpenRouter-Haiku):
docker exec openclaw-openclaw-gateway-1 openclaw config get agents.defaults.model
cat ~/.openclaw/agents/main/agent/auth-profiles.json | grep -c openai-codex   # token still present

# (B) Channels probe (30s timeout — 10s default too tight on e2-medium):
docker exec openclaw-openclaw-gateway-1 openclaw channels status --probe --timeout 30000
# Expect all 4 Telegram bots: connected, mode:polling, works. 1008 pairing required → see §6 wipe.

# (C) Heartbeat cron canary (most reliable liveness):
docker exec openclaw-openclaw-gateway-1 openclaw cron run 4febe374-a480-46e6-9ea7-bc87be107e57
sleep 30
docker exec openclaw-openclaw-gateway-1 openclaw cron runs --id 4febe374-a480-46e6-9ea7-bc87be107e57 --limit 1
# Expect duration <60s, ok:true, fresh ~/.openclaw/workspace-kit/HEARTBEAT.md

# (D) Per-cron OUTPUT canary (heartbeat does NOT catch degraded content):
~/openclaw-ops/scripts/main-farm-health.sh    # G1 freshness + content-degradation sentinel
# Then eyeball ONE real composed artifact — trigger coach-morning-brief, wait ~60s, confirm
# the delivered summary has no 'unavailable'/'couldn't'/'did not run'/'jq…failed'/⚠️/🛠️.
# If a cron's tools collapsed: check payload.toolsAllow (stale names) + lightContext;
# fix = openclaw-cron-edit.sh ... --clear-tools --no-light-context (no restart).
# NOTE 6.6: cron list shape is now SQLite-backed — confirm cron list --json parses unchanged
# and openclaw-cron-edit.sh still writes correctly (re-validate the #31425 no-op behavior).

# (E) image_generate (the pinned-failure path — DO NOT skip, fails silently):
# Test via the REAL Telegram agent path, NOT CLI infer (different path, masks failures):
#   Telegram → Coach: "send the quad reset cards"  (reusable-visuals skill)
# Expect: image lands. If image_generate hard-fails → codex auth degraded OR routing
# flapped to codex-responses bridge (#90074, UNFIXED — known, tolerated, message-first).
# DANGER: do NOT apply apiKey SecretRef {source:env} to force routing — crash-loops gateway.

# (F) Diff against snapshot (SEQUENTIAL):
SNAP=$(ls -1dt ~/openclaw-ops/backups/pre-upgrade/*pre-v<TARGET>/ | head -1)
docker exec openclaw-openclaw-gateway-1 openclaw agents list   # diff vs $SNAP/agents.txt
sleep 2; docker exec openclaw-openclaw-gateway-1 openclaw cron list      # vs cron.txt
sleep 2; docker exec openclaw-openclaw-gateway-1 openclaw config get plugins   # vs plugins.txt
# Expected: version bumps, ID churn, time shifts, new bundled plugins.
# UNEXPECTED (investigate): missing agents, lost bindings, lost crons, plugin loaded→disabled.

6. Verification checklist (tied to actual risk surface)

Risk	Check	Pass criteria
Codex OAuth SPOF	`auth-profiles.json` has token; chat model not Haiku	`agents.defaults.model` = codex; token present in main's store
Image-gen routing (#90074, UNFIXED)	Telegram→Coach "send quad reset cards"	Image lands. Flapping tolerated; never block. NO SecretRef apiKey
Cron capability semantics (UNCHANGED in range)	`audit_cron_prompts.py --contracts` clean	No G4 scan_capability violations; read/write/exec names intact
Cron SQLite migration (NEW 6.1/6.5)	`cron list --json` parses; `openclaw-cron-edit.sh` writes	Shape unchanged; all 18 crons present, status:ok; #31425 no-op re-verified
Session SQLite migration (NEW 6.5)	session-reset-monitor + persona paths resolve	`sessions.json` path still readable OR tooling updated; persona-delete trick re-tested
CLI exit-code change (6.8 only — N/A on 6.6)	host scripts' `$?` handling	If on 6.8: re-validate main-farm-health/main-telegram-watch fail-closed logic
Persona bootstrap (12000-char cap)	session-start log	No unexpected "Bootstrap truncation warning"; sessions re-injected cleanly
Spool wedge (structural, guarded)	`spool-oom-sweep.sh` in crontab post-restore	Cron line present; no stranded `.processing` post-recreate
Dreaming event-loop block (guarded)	`gateway-cpu-watchdog.sh` in crontab	Line present; CPU not pinned post-warmup
coach-checkin SPOF	plugin enabled + linked	`~/.openclaw/plugins-src/coach-checkin/` intact; enabled (6.6 #93886 plugin-load boundary watch)
bonjour guard	`plugins.deny` contains bonjour	Persists (load-bearing anti-crash-loop)
Channels	4 Telegram bots	connected, polling, works

7. Rollback (to 2026.5.22 — reversible, this is why rigor is right-sized to canary-and-watch)

Trigger: any STOP gate fails or broken & not fixable in <15 min.

SNAP=$(ls -1dt ~/openclaw-ops/backups/pre-upgrade/*pre-v<TARGET>/ | head -1)
sed -i 's|gateway:v<TARGET>|gateway:v2026.5.22|' ~/openclaw/.env
cd ~/openclaw && docker compose down
tar -xzf "$SNAP/state.tar.gz" -C ~       # restores ~/.openclaw/, ~/openclaw-ops/, .env, override
# CRITICAL for this upgrade: the SQLite migrations are one-way. state.tar.gz restores the
# PRE-migration ~/.openclaw (jobs.json + sessions.json + auth-profiles.json all pre-rewrite),
# which is exactly what v5.22 expects. Do NOT keep the migrated SQLite DBs.
docker compose pull openclaw-gateway     # re-pull old image (works ∵ :v2026.5.22 stayed tagged, §4 gate)
docker compose up -d openclaw-gateway
until docker ps --filter name=gateway-1 --format '{{.Status}}' | grep -q healthy; do sleep 5; done
docker exec openclaw-openclaw-gateway-1 openclaw --version   # expect 2026.5.22
crontab ~/openclaw-ops/backups/crontab.bak.upgrade-v<TARGET>   # restore paused host crons
# If codex token was spent during the window: re-auth (device code).
docker exec -it openclaw-openclaw-gateway-1 openclaw models auth login --provider openai-codex

Rollback hinges on: (1) :v2026.5.22 AR tag preserved (§4 gate), (2) state.tar.gz source-of-truth (NOT the migrated SQLite), (3) git-head.txt for source revert. File the failure in BACKLOG.md.

8. Breaking-change migration steps (NEW for this jump vs the v5.7→5.22 template)

These are the deltas the v5.22 plan template does NOT cover — all driven by openclaw doctor post-upgrade, captured pre-upgrade in §3(c):

Cron jobs.json → SQLite (6.1/6.5/6.6) — auto-migrated in doctor preflight, one-way. Before: snapshot cron state (done §3). After: verify cron list --json shape, re-validate openclaw-cron-edit.sh + audit_cron_prompts.py + snapshot-openclaw.sh frozen-diff against SQLite; add the new cron SQLite DB to backup-claws.sh tarball.
Session metadata + agent registry → SQLite (6.5, verified-before-cleanup 6.6) — via doctor. After: confirm sessions/sessions.json paths still resolve for session-reset-monitor / persona-rotation, or update tooling; re-test the persona-delete-from-sessions immediate-refresh trick.
Auth-profile canonical rewrite (5.28 + 6.1) — rewrites the codex-OAuth SPOF store. Before: backed up §3(a) + re-auth §P1. After: verify main-farm-health codex sentinel still fires.
Google provider → google-generative-ai (6.1) + OpenRouter/Vertex prefix normalization (6.8) — audit openclaw.json provider IDs (§P2).
Externalized runtimes (6.1: Tokenjuice/Copilot npm; 6.6: Llama.cpp) — verify the build didn't silently drop a runtime (KEEP-diff §14). Not believed in use on claw001.
CLI exit-code semantics (6.8 ONLY) — usage errors now classified as failures. Only relevant if you ignore the 6.6 recommendation. Re-validate all fail-closed host cron scripts' $? handling.

Cron capability semantics (lightContext/toolsAllow→promptMode:minimal, read/write/exec names) are UNCHANGED in 5.22→6.8 — the CLAUDE.md placement mandate and audit_cron_prompts.py G4 assumptions hold; do not rewrite them.

Bottom line: Hold on 5.22. The pin reason (image-gen #90074) is unresolved through 6.8, 6.8 is flagged-skip, and the move adds three one-way SQLite migrations. If a non-image driver forces it and #88312 is confirmed fixed in the changelog, land on 6.6 (not 6.8), re-auth codex first, run the tight canary above, and keep state.tar.gz + the :v2026.5.22 AR tag as the one-command rollback.

⚠️ Verification caveat

The GitHub issue numbers cited (#90074, #94570, #90361, #94033, #88312, #93886, #31425) and the "6.8 is community-flagged skip" claim come from the workflow's web-research agent and have not been independently re-fetched. Before acting on any GO, confirm each issue exists and matches the described symptom (per the verify-upstream-issue-refs lesson). The version landscape, the SQLite-migration deltas, and the internal runbook steps are grounded in local files + changelog and are higher-confidence.