longhorn topology mismatch
Quick Navigation
- Context
- Primary Symptoms
- Root Cause Pattern
- Why This Appeared After Adding Node 2
- Terms Used In This Incident
- Important Operational Guardrail
- Step 1: Topology Alignment (Completed)
- Recovery Results
- Symptom Detection Commands
- Step 2: Per-Volume Non-Destructive Recovery Procedure
- Exact Recovery Commands (Generic)
- Why
degradedMay Still Appear - Post-Recovery Validation Checklist
- Remaining Recommended Follow-Up
Context
This document captures troubleshooting performed for Harbor workloads stuck in Init / ContainerCreating due to Longhorn volume issues in a 2-node cluster.
Affected Harbor pods during investigation:
harbor/harbor-database-0harbor/harbor-trivy-0
Primary Symptoms
- Pods blocked on mount with events like:
- Longhorn volume states:
attached+robustness=unknown(ordegraded)- engine
replicaModeMapcontains stale endpoint entries such as: - inspect volume details, inspect engine replica map
- Event loop repeats:
Root Cause Pattern
There were two concurrent problems:
- Topology mismatch for defaults (
numberOfReplicas=3in a 2-node cluster) causing persistent under-replication pressure. - Longhorn engine/replica reconciliation instability (stale unknown replica endpoints + backend I/O errors), causing mount failures.
See also:
After Adding Node 2
With one node, Longhorn behaves closer to local-only data paths (fewer distributed replica operations).
After adding a second node, Longhorn began doing cross-node attach/rebuild/reconcile workflows for many existing volumes.
This increases control-plane and data-path churn:
- volumes may attach on node A while replica data is on node B
- Longhorn must continuously reconcile replica endpoints and rebuild state
- old volumes still requesting
3replicas cannot be fully satisfied on 2 nodes
The net effect is more frequent transitions and retries.
If any replica endpoint becomes stale or backend I/O is unstable, mount failures become visible to workloads.
Related definitions:
Terms Used In This Incident
Reconcile / Reconciliation
Reconciliation means Longhorn continuously compares:
- desired state (what should exist), vs
- actual state (what currently exists),
and then applies actions until they match.
For one volume, desired state may be:
- attached to the intended node,
- configured replica count met,
- healthy replica endpoints in the engine,
- no stale/error replicas.
If actual state differs (for example stale UNKNOWN ... ERR endpoint, missing replica, failed rebuild), Longhorn loops:
- detect mismatch,
- try corrective action (attach/detach, remove stale endpoint, rebuild replica),
- re-check state,
- repeat until stable.
When this loop never converges, workloads can remain stuck in Init/ContainerCreating.
Attach/Rebuild State Transition
This means a volume is moving between lifecycle phases while Longhorn tries to make it usable:
- Attach transition: volume goes from
detachedtoattachingtoattachedon a target node. - Rebuild transition: Longhorn tries to create/sync healthy replicas so the engine has valid backends.
If transitions keep repeating or never settle, workloads can stay stuck in Init/ContainerCreating.
Stale Remote Replica Endpoint
A stale remote replica endpoint is an old network address for a replica process that no longer exists or is no longer reachable.
Example pattern from this incident:
- engine
replicaModeMapcontainsUNKNOWN-tcp://10.42.1.67:<port>: ERR
What this implies:
- the engine still references a replica endpoint from prior state
- Longhorn keeps trying to remove or rebuild around it
- reconciliation loops can fail (
FailedDeleting, snapshot purge/rebuild errors) - block device mount/format operations can fail with I/O errors
Under-Replication Pressure
Under-replication pressure means a volume is configured for more replicas than the cluster can currently keep healthy/schedulable.
Example:
- desired replicas:
3 - healthy/schedulable replicas available:
1or2
What happens:
- Longhorn continuously tries to rebuild/place missing replicas.
- Volumes may remain
degradedfor long periods. - Attach/rebuild transitions occur more often.
- Any extra instability (stale endpoint, network flap, disk I/O issue) is more likely to push a volume into
unknownand cause mount failures.
Important Operational Guardrail
If Argo CD manages Harbor manifests, pause Argo reconciliation before manual recovery to avoid immediate rollbacks during detach/attach operations.
Then run the Step 2: Per-Volume Non-Destructive Recovery Procedure.
For copy/paste command flow, use Exact Recovery Commands (Generic).
Step 1: Topology Alignment (Completed)
Set defaults to match 2 nodes:
- Longhorn setting
default-replica-countset to2. - Existing
StorageClassparameters are immutable, so a new default class was created:longhorn-r2withnumberOfReplicas: "2".- old
longhorndefault annotation flipped tofalse.
Verification:
longhorn-r2is default class with replicas2.- Longhorn default replica count reports
2.
Recovery Results
harbor-database-0
- Volume detached cleanly after clearing attachment context and setting replicas to
2. - Workload scaled up and database pod reached
Running.
harbor-trivy-0
- Same recovery flow applied.
- Pod reached
Runningandready=true. - Volume attached with
replicas=2. - Engine replica map shows healthy active replica (
RW) instead of stale unknown entry.
Symptom Detection Commands
Use these commands to quickly detect this failure pattern.
Tip: after collecting symptoms, apply Step 2: Per-Volume Non-Destructive Recovery Procedure.
1) Find non-healthy pods cluster-wide
kubectl get pods -A --no-headers | rg -v ' (Running|Completed) '2) Check recent warning events
kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp3) Focus on mount/fsck/mkfs failures
kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp | \
rg 'FailedMount|MountVolume|mkfs|fsck|Input/output error'4) Inspect a failing pod directly (example: Harbor DB/Trivy)
kubectl -n harbor describe pod harbor-database-0
kubectl -n harbor describe pod harbor-trivy-05) List Longhorn volumes and spot degraded/unknown
kubectl -n longhorn-system get volumes.longhorn.io6) Inspect one Longhorn volume in detail
kubectl -n longhorn-system describe volumes.longhorn.io <volume-name>7) Detect stale unknown replica endpoints in engines
kubectl -n longhorn-system get engines.longhorn.io -o yaml | rg 'UNKNOWN-tcp://|replicaModeMap|ERR'8) Inspect a specific engine replica map
kubectl -n longhorn-system get engines.longhorn.io <engine-name> -o jsonpath='engineState={.status.currentState}{"\n"}replicaMap={.status.replicaModeMap}{"\n"}'9) Watch longhorn-manager loop for one volume
kubectl -n longhorn-system logs -l app=longhorn-manager --since=30m | \
rg '<volume-name>|FailedDeleting|FailedStartingSnapshotPurge|failed to start rebuild|Found unknown replica'10) Check CSI logs for NodeStageVolume mount errors
kubectl -n longhorn-system logs -l app=longhorn-csi-plugin -c longhorn-csi-plugin --since=30m | \
rg '<volume-name>|NodeStageVolume|MountDevice|mkfs|Input/output error|DeadlineExceeded'11) Check Longhorn attachment tickets
kubectl -n longhorn-system get volumeattachments.longhorn.io <volume-name> -o yamlIf attachmentTickets stays non-empty after workload scale-down, detach can remain stuck.
Step 2: Per-Volume Non-Destructive Recovery Procedure
Use this for each affected Harbor stateful volume.
Before running it, review:
- Scale workload down to release the volume.
- Confirm pod termination.
- Patch Longhorn volume:
- clear attachment intent (
spec.nodeID="") - set
spec.numberOfReplicas=2
- clear attachment intent (
- Ensure
VolumeAttachmenttickets are empty (or reconcile to empty). - Wait for volume to reach
state=detached. - Scale workload back to
1. - Verify pod transitions to
Runningand mount errors stop. - Verify Longhorn engine replica map no longer shows stale
UNKNOWN ... ERRentry for that volume.
Exact Recovery Commands (Generic)
Use these commands for any affected StatefulSet volume in a 2-node cluster.
Set variables:
NS="<workload-namespace>"
STS="<statefulset-name>"
VOL="<longhorn-volume-name>" # usually PVC id like pvc-xxxxxxxx-...
REPLICAS_TARGET="2"A) Scale workload down and confirm pod release
kubectl -n "$NS" scale statefulset "$STS" --replicas=0
kubectl -n "$NS" get podsB) Request detach + align replica target
kubectl -n longhorn-system patch volumes.longhorn.io "$VOL" --type=merge \
-p "{\"spec\":{\"nodeID\":\"\",\"numberOfReplicas\":${REPLICAS_TARGET}}}"C) Check/clear attachment tickets (if detach appears stuck)
kubectl -n longhorn-system get volumeattachments.longhorn.io "$VOL" -o yaml
# Optional only if tickets are stuck and workload is already scaled down:
kubectl -n longhorn-system patch volumeattachments.longhorn.io "$VOL" --type=merge \
-p '{"spec":{"attachmentTickets":{}}}'D) Wait for clean detach
kubectl -n longhorn-system get volumes.longhorn.io "$VOL" \
-o jsonpath='state={.status.state} robustness={.status.robustness} currentNode={.status.currentNodeID}{"\n"}'Expected before continuing:
state=detached
E) Scale workload back up
kubectl -n "$NS" scale statefulset "$STS" --replicas=1
kubectl -n "$NS" get pods -wF) Verify recovery
kubectl -n "$NS" get pods
kubectl -n "$NS" describe pod <pod-name>
kubectl -n longhorn-system get volumes.longhorn.io "$VOL" \
-o jsonpath='state={.status.state} robustness={.status.robustness} replicas={.spec.numberOfReplicas}{"\n"}'
kubectl -n longhorn-system get engines.longhorn.io "${VOL}-e-0" \
-o jsonpath='engineState={.status.currentState}{"\n"}replicaMap={.status.replicaModeMap}{"\n"}'Healthy signals:
- workload pod reaches
Running(and ready) - no new
FailedMount/Input/output errorevents - engine replica map does not keep looping with stale
UNKNOWN ... ERRendpoints
Why degraded May Still Appear
In a 2-node cluster, degraded can still occur during transitions/rebuilds and is not always fatal by itself.
Treat as urgent if combined with:
robustness=unknown- repeated unknown replica deletion loops
- repeated mount/fsck/mkfs I/O errors
How to detect quickly: Symptom Detection Commands.
Post-Recovery Validation Checklist
kubectl get pods -Akubectl get events -A --field-selector type=Warning --sort-by=.lastTimestampkubectl -n longhorn-system get volumes.longhorn.io- Spot-check engine states for previously affected volumes:
- no stale
UNKNOWN ... ERRreplica map loops - no repeating
FailedMount/Input/output errorin pod events
- no stale
If checks fail, repeat Step 2: Per-Volume Non-Destructive Recovery Procedure per affected volume. If you need copy/paste flow, use Exact Recovery Commands (Generic).
Remaining Recommended Follow-Up
- Continue volume-by-volume remediation for any remaining affected workloads.
- Check node-level disk and network health for both Longhorn nodes.
- Upgrade Longhorn to a patched stable release after cluster stabilizes.
- Re-enable Argo CD only after storage-backed workloads remain stable.
Related: