longhorn topology mismatch

Quick Navigation

Context

This document captures troubleshooting performed for Harbor workloads stuck in Init / ContainerCreating due to Longhorn volume issues in a 2-node cluster.

Affected Harbor pods during investigation:

  • harbor/harbor-database-0
  • harbor/harbor-trivy-0

Primary Symptoms

Root Cause Pattern

There were two concurrent problems:

  1. Topology mismatch for defaults (numberOfReplicas=3 in a 2-node cluster) causing persistent under-replication pressure.
  2. Longhorn engine/replica reconciliation instability (stale unknown replica endpoints + backend I/O errors), causing mount failures.

See also:

After Adding Node 2

With one node, Longhorn behaves closer to local-only data paths (fewer distributed replica operations).
After adding a second node, Longhorn began doing cross-node attach/rebuild/reconcile workflows for many existing volumes.

This increases control-plane and data-path churn:

  • volumes may attach on node A while replica data is on node B
  • Longhorn must continuously reconcile replica endpoints and rebuild state
  • old volumes still requesting 3 replicas cannot be fully satisfied on 2 nodes

The net effect is more frequent transitions and retries.
If any replica endpoint becomes stale or backend I/O is unstable, mount failures become visible to workloads.

Related definitions:

Terms Used In This Incident

Reconcile / Reconciliation

Reconciliation means Longhorn continuously compares:

  • desired state (what should exist), vs
  • actual state (what currently exists),

and then applies actions until they match.

For one volume, desired state may be:

  • attached to the intended node,
  • configured replica count met,
  • healthy replica endpoints in the engine,
  • no stale/error replicas.

If actual state differs (for example stale UNKNOWN ... ERR endpoint, missing replica, failed rebuild), Longhorn loops:

  1. detect mismatch,
  2. try corrective action (attach/detach, remove stale endpoint, rebuild replica),
  3. re-check state,
  4. repeat until stable.

When this loop never converges, workloads can remain stuck in Init/ContainerCreating.

Attach/Rebuild State Transition

This means a volume is moving between lifecycle phases while Longhorn tries to make it usable:

  • Attach transition: volume goes from detached to attaching to attached on a target node.
  • Rebuild transition: Longhorn tries to create/sync healthy replicas so the engine has valid backends.

If transitions keep repeating or never settle, workloads can stay stuck in Init/ContainerCreating.

Stale Remote Replica Endpoint

A stale remote replica endpoint is an old network address for a replica process that no longer exists or is no longer reachable.

Example pattern from this incident:

  • engine replicaModeMap contains UNKNOWN-tcp://10.42.1.67:<port>: ERR

What this implies:

  • the engine still references a replica endpoint from prior state
  • Longhorn keeps trying to remove or rebuild around it
  • reconciliation loops can fail (FailedDeleting, snapshot purge/rebuild errors)
  • block device mount/format operations can fail with I/O errors

Under-Replication Pressure

Under-replication pressure means a volume is configured for more replicas than the cluster can currently keep healthy/schedulable.

Example:

  • desired replicas: 3
  • healthy/schedulable replicas available: 1 or 2

What happens:

  • Longhorn continuously tries to rebuild/place missing replicas.
  • Volumes may remain degraded for long periods.
  • Attach/rebuild transitions occur more often.
  • Any extra instability (stale endpoint, network flap, disk I/O issue) is more likely to push a volume into unknown and cause mount failures.

Important Operational Guardrail

If Argo CD manages Harbor manifests, pause Argo reconciliation before manual recovery to avoid immediate rollbacks during detach/attach operations.

Then run the Step 2: Per-Volume Non-Destructive Recovery Procedure.

For copy/paste command flow, use Exact Recovery Commands (Generic).

Step 1: Topology Alignment (Completed)

Set defaults to match 2 nodes:

  • Longhorn setting default-replica-count set to 2.
  • Existing StorageClass parameters are immutable, so a new default class was created:
    • longhorn-r2 with numberOfReplicas: "2".
    • old longhorn default annotation flipped to false.

Verification:

  • longhorn-r2 is default class with replicas 2.
  • Longhorn default replica count reports 2.

Recovery Results

harbor-database-0

  • Volume detached cleanly after clearing attachment context and setting replicas to 2.
  • Workload scaled up and database pod reached Running.

harbor-trivy-0

  • Same recovery flow applied.
  • Pod reached Running and ready=true.
  • Volume attached with replicas=2.
  • Engine replica map shows healthy active replica (RW) instead of stale unknown entry.

Symptom Detection Commands

Use these commands to quickly detect this failure pattern.
Tip: after collecting symptoms, apply Step 2: Per-Volume Non-Destructive Recovery Procedure.

1) Find non-healthy pods cluster-wide

kubectl get pods -A --no-headers | rg -v ' (Running|Completed) '

2) Check recent warning events

kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp

3) Focus on mount/fsck/mkfs failures

kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp | \
  rg 'FailedMount|MountVolume|mkfs|fsck|Input/output error'

4) Inspect a failing pod directly (example: Harbor DB/Trivy)

kubectl -n harbor describe pod harbor-database-0
kubectl -n harbor describe pod harbor-trivy-0

5) List Longhorn volumes and spot degraded/unknown

kubectl -n longhorn-system get volumes.longhorn.io

6) Inspect one Longhorn volume in detail

kubectl -n longhorn-system describe volumes.longhorn.io <volume-name>

7) Detect stale unknown replica endpoints in engines

kubectl -n longhorn-system get engines.longhorn.io -o yaml | rg 'UNKNOWN-tcp://|replicaModeMap|ERR'

8) Inspect a specific engine replica map

kubectl -n longhorn-system get engines.longhorn.io <engine-name> -o jsonpath='engineState={.status.currentState}{"\n"}replicaMap={.status.replicaModeMap}{"\n"}'

9) Watch longhorn-manager loop for one volume

kubectl -n longhorn-system logs -l app=longhorn-manager --since=30m | \
  rg '<volume-name>|FailedDeleting|FailedStartingSnapshotPurge|failed to start rebuild|Found unknown replica'

10) Check CSI logs for NodeStageVolume mount errors

kubectl -n longhorn-system logs -l app=longhorn-csi-plugin -c longhorn-csi-plugin --since=30m | \
  rg '<volume-name>|NodeStageVolume|MountDevice|mkfs|Input/output error|DeadlineExceeded'

11) Check Longhorn attachment tickets

kubectl -n longhorn-system get volumeattachments.longhorn.io <volume-name> -o yaml

If attachmentTickets stays non-empty after workload scale-down, detach can remain stuck.

Step 2: Per-Volume Non-Destructive Recovery Procedure

Use this for each affected Harbor stateful volume.

Before running it, review:

  1. Scale workload down to release the volume.
  2. Confirm pod termination.
  3. Patch Longhorn volume:
    • clear attachment intent (spec.nodeID="")
    • set spec.numberOfReplicas=2
  4. Ensure VolumeAttachment tickets are empty (or reconcile to empty).
  5. Wait for volume to reach state=detached.
  6. Scale workload back to 1.
  7. Verify pod transitions to Running and mount errors stop.
  8. Verify Longhorn engine replica map no longer shows stale UNKNOWN ... ERR entry for that volume.

Exact Recovery Commands (Generic)

Use these commands for any affected StatefulSet volume in a 2-node cluster.

Set variables:

NS="<workload-namespace>"
STS="<statefulset-name>"
VOL="<longhorn-volume-name>"   # usually PVC id like pvc-xxxxxxxx-...
REPLICAS_TARGET="2"

A) Scale workload down and confirm pod release

kubectl -n "$NS" scale statefulset "$STS" --replicas=0
kubectl -n "$NS" get pods

B) Request detach + align replica target

kubectl -n longhorn-system patch volumes.longhorn.io "$VOL" --type=merge \
  -p "{\"spec\":{\"nodeID\":\"\",\"numberOfReplicas\":${REPLICAS_TARGET}}}"

C) Check/clear attachment tickets (if detach appears stuck)

kubectl -n longhorn-system get volumeattachments.longhorn.io "$VOL" -o yaml
 
# Optional only if tickets are stuck and workload is already scaled down:
kubectl -n longhorn-system patch volumeattachments.longhorn.io "$VOL" --type=merge \
  -p '{"spec":{"attachmentTickets":{}}}'

D) Wait for clean detach

kubectl -n longhorn-system get volumes.longhorn.io "$VOL" \
  -o jsonpath='state={.status.state} robustness={.status.robustness} currentNode={.status.currentNodeID}{"\n"}'

Expected before continuing:

  • state=detached

E) Scale workload back up

kubectl -n "$NS" scale statefulset "$STS" --replicas=1
kubectl -n "$NS" get pods -w

F) Verify recovery

kubectl -n "$NS" get pods
kubectl -n "$NS" describe pod <pod-name>
 
kubectl -n longhorn-system get volumes.longhorn.io "$VOL" \
  -o jsonpath='state={.status.state} robustness={.status.robustness} replicas={.spec.numberOfReplicas}{"\n"}'
 
kubectl -n longhorn-system get engines.longhorn.io "${VOL}-e-0" \
  -o jsonpath='engineState={.status.currentState}{"\n"}replicaMap={.status.replicaModeMap}{"\n"}'

Healthy signals:

  • workload pod reaches Running (and ready)
  • no new FailedMount/Input/output error events
  • engine replica map does not keep looping with stale UNKNOWN ... ERR endpoints

Why degraded May Still Appear

In a 2-node cluster, degraded can still occur during transitions/rebuilds and is not always fatal by itself.

Treat as urgent if combined with:

  • robustness=unknown
  • repeated unknown replica deletion loops
  • repeated mount/fsck/mkfs I/O errors

How to detect quickly: Symptom Detection Commands.

Post-Recovery Validation Checklist

  • kubectl get pods -A
  • kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp
  • kubectl -n longhorn-system get volumes.longhorn.io
  • Spot-check engine states for previously affected volumes:
    • no stale UNKNOWN ... ERR replica map loops
    • no repeating FailedMount / Input/output error in pod events

If checks fail, repeat Step 2: Per-Volume Non-Destructive Recovery Procedure per affected volume. If you need copy/paste flow, use Exact Recovery Commands (Generic).

  • Continue volume-by-volume remediation for any remaining affected workloads.
  • Check node-level disk and network health for both Longhorn nodes.
  • Upgrade Longhorn to a patched stable release after cluster stabilizes.
  • Re-enable Argo CD only after storage-backed workloads remain stable.

Related:

Domain