Change LH default settings to fix maintenance mode issues #788

w13915984028 · 2024-08-06T05:50:18Z

Problem:

The Harvester maintenance mode is blocked by manually attached volumes and single-replica volumes.

Solution:

Change LH default setting to allow node drain on such scenarios.

Refer:

https://github.com/longhorn/charts/blob/v1.6.x/charts/longhorn/values.yaml#L248

https://github.com/longhorn/charts/blob/ad73dc01239b7eeb25ff510ce8358578433d85a5/charts/longhorn/values.yaml#L250

Related Issue:
harvester/harvester#6264
harvester/harvester#6266

Test plan:

Per issue steps.

After a new installation, get such LH default settings:

default-data-path                                                 /var/lib/harvester/defaultdisk                    true      7m19s
...
detach-manually-attached-volumes-when-cordoned                    true                                              true      7m19s
...
node-drain-policy                                                 allow-if-replica-is-stopped                       true      7m19s

Following PRs:

Signed-off-by: Jian Wang <[email protected]>

ejweber

No major concern from me on the Longhorn side.

We default to block-if-contains-last-replica because there are cluster upgrade strategies that remove nodes from a cluster after they have been drained and replace them with new nodes. allow-if-last-replica-is-stopped can not protect data for single replica volumes in these scenarios. Since Harvester controls the upgrade process, it is fine to use allow-if-last-replica-is-stopped assuming Harvester team is sure the situation I described will not occur.

detach-manually-attached-volumes-when-cordoned comes down to a balance of priorities. We chose to default it to false so users would not be surprised by a volume they purposely attached becoming detached during a cluster maintenance operation. We're not sure (in general) why users are manually attaching volumes, so it's hard to gauge the importance of this. If the Harvester support team is repeatedly running into issues in customer environments in which manual attachments are disrupting upgrades, this is probably a good change.

bk201 · 2024-08-12T02:02:03Z

@WebberHuang1118 @FrankYang0529 @Yu-Jack
The PR is to set the setting detach-manually-attached-volumes-when-cordoned to true by default: Longhorn will automatically detach "manually" attached volumes if a volume's workload pod is on the cordoning/draining node.

Can you check if we have any remaining manually attached paths or other concerns? Thanks.

Vicente-Cheng · 2024-08-12T12:51:46Z

Hi @ejweber,
Thanks for the explanation. Just some questions would like to get more clear.

For allow-if-last-replica-is-stopped

there are cluster upgrade strategies that remove nodes from a cluster after they have been drained and replace them with new nodes.

What is this situation? I am sure we did not remove the Longhorn nodes from the cluster when upgrading. I just want to confirm that this behavior is triggered by the user who owns the cluster instead of the Longhorn itself.

For detach-manually-attached-volumes-when-cordoned

We chose to default it to false so users would not be surprised by a volume they purposely attached becoming detached during a cluster maintenance operation.

I prefer this configuration to be false. As you mentioned, the user should know what happened with their volume if the node cordon. Actually, I checked the harvester project, and it looks like we did not use the AttacherTypeLonghornAPI for manual attachment. So far, this setting would not affect us. But I would like to make this false.

w13915984028 · 2024-08-12T13:24:27Z

@Vicente-Cheng I try to answer from the context of Harvester:

From Harvester's practice, the production deployment are mostly based on bare-metal servers, customer has such scenarios which require a node maintenance:
(1) Replace/add hardware
(2) Re-adapt network
(3) Trouble-shooting, migrate workloads and reboot node
(4) Replace a whole bare-metal node, or offline it for some days
(5) Others
LH is passive in this context.

In those cases, customer expects clicking the Node Maintenance is effective and smooth, but currently due to those limitations, we encountered some bug reporting and it took time to troubleshooting.

LH provides many options for various of scenarios and for compatibility. Back to Harvester, we need to recheck each option and select the best match for customer convenience.

We need to make a decision to balance between:

(1) Keep current option, if user accidently/actively attache a volume, he needs to detach it to get the success of node maintenance, normally he will ask for help about failure of node maintenance and after troubleshooting he knows it. We made the implicit decision for user. And tell him to change on embedded LH.

(2) Use detach manually attached volume and highlight it on document, tell user we take precedence of node maintenance and will detach it. But user can actively enable it from the embedded LH UI. Then user knows what and makes the decision.

BTW, please also check the document PR: harvester/docs#619, it has more details about the issue and the updated Harvester document, thanks.

Vicente-Cheng · 2024-08-12T14:32:24Z

Thanks @w13915984028,

For the allow-if-last-replica-is-stopped, I thought you were right. Just curious about the situation Eric mentioned.

For the detach-manually-attached-volumes-when-cordoned,
I do not have any strong opinion of that value. The reason, as I mentioned above, we do not have any feature depends on that (manually attached).
The following is just my personal thought. I thought manually attaching the Longhorn volume was a temporary operation. So, if the maintenance mode is stuck because of that, it makes sense to me. We should give some tips and documents, and users should know what they need to do (detach the volume after they complete any temporary operation).

Just want to make the content of these configs clear. The changes in this PR are ok with me.

FrankYang0529

LGTM. Thanks.

tserong

LGTM

Regarding Eric's comment:

We default to block-if-contains-last-replica because there are cluster upgrade strategies that remove nodes from a cluster after they have been drained and replace them with new nodes. allow-if-last-replica-is-stopped can not protect data for single replica volumes in these scenarios. Since Harvester controls the upgrade process, it is fine to use allow-if-last-replica-is-stopped assuming Harvester team is sure the situation I described will not occur.

AFAIK we upgrade all nodes inplace rather than removing and adding new nodes, so the data should remain intact throughout.

WebberHuang1118 · 2024-08-14T08:33:38Z

@WebberHuang1118 @FrankYang0529 @Yu-Jack The PR is to set the setting detach-manually-attached-volumes-when-cordoned to true by default: Longhorn will automatically detach "manually" attached volumes if a volume's workload pod is on the cordoning/draining node.

Can you check if we have any remaining manually attached paths or other concerns? Thanks.

Harvester doesn't manually attach volume for VM snapshot/backup instead we make use of LH Volume Attach/Detach Implementation since Harvester v1.3.0 harvester/harvester#4910

WebberHuang1118

LGTM, thanks.

w13915984028 · 2024-08-14T08:47:43Z

@mergify backport v1.4

mergify · 2024-08-14T08:47:47Z

backport v1.4

✅ Backports have been created

#812 Change LH default settings to fix maintenance mode issues (backport #788) has been created for branch v1.4

w13915984028 mentioned this pull request Aug 6, 2024

Add troubleshooting of node maintenance mode harvester/docs#619

Open

Change LH default settings to fix maintenance mode issues

540d74d

Signed-off-by: Jian Wang <[email protected]>

w13915984028 force-pushed the fix6264 branch from 6ba14d5 to 540d74d Compare August 6, 2024 11:59

This was referenced Aug 6, 2024

[BUG] The default settings to Longhorn seems not updated to Longhorn harvester/harvester#6273

Closed

fix(manager): add detailed message when deleting pdb is not allowed longhorn/longhorn-manager#3045

Merged

w13915984028 requested review from tserong, bk201 and ibrokethecloud August 6, 2024 13:17

ejweber approved these changes Aug 7, 2024

View reviewed changes

bk201 requested a review from Vicente-Cheng August 7, 2024 14:20

ibrokethecloud approved these changes Aug 12, 2024

View reviewed changes

WebberHuang1118 self-requested a review August 13, 2024 01:24

FrankYang0529 approved these changes Aug 13, 2024

View reviewed changes

Vicente-Cheng approved these changes Aug 14, 2024

View reviewed changes

tserong approved these changes Aug 14, 2024

View reviewed changes

WebberHuang1118 approved these changes Aug 14, 2024

View reviewed changes

bk201 merged commit 54caa47 into harvester:master Aug 14, 2024
6 checks passed

mergify bot mentioned this pull request Aug 14, 2024

Change LH default settings to fix maintenance mode issues (backport #788) #812

Merged

2 tasks

mergify bot mentioned this pull request Sep 2, 2024

Upgrade: change LH default settings to fix maintenance mode issues (backport #6276) harvester/harvester#6474

Merged

1 task

WebberHuang1118 mentioned this pull request Jan 7, 2025

[FEATURE] Rebuild last healthy replica during node drain harvester/harvester#3378

Closed

WebberHuang1118 mentioned this pull request Jan 8, 2025

[ENHANCEMENT] Improve Single-Replica Volume Detection for Upgrade harvester/harvester#7323

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change LH default settings to fix maintenance mode issues #788

Change LH default settings to fix maintenance mode issues #788

w13915984028 commented Aug 6, 2024 •

edited

Loading

ejweber left a comment

bk201 commented Aug 12, 2024

Vicente-Cheng commented Aug 12, 2024

w13915984028 commented Aug 12, 2024 •

edited

Loading

Vicente-Cheng commented Aug 12, 2024

FrankYang0529 left a comment

tserong left a comment

WebberHuang1118 commented Aug 14, 2024

WebberHuang1118 left a comment

w13915984028 commented Aug 14, 2024

mergify bot commented Aug 14, 2024 •

edited

Loading

Change LH default settings to fix maintenance mode issues #788

Change LH default settings to fix maintenance mode issues #788

Conversation

w13915984028 commented Aug 6, 2024 • edited Loading

ejweber left a comment

Choose a reason for hiding this comment

bk201 commented Aug 12, 2024

Vicente-Cheng commented Aug 12, 2024

w13915984028 commented Aug 12, 2024 • edited Loading

Vicente-Cheng commented Aug 12, 2024

FrankYang0529 left a comment

Choose a reason for hiding this comment

tserong left a comment

Choose a reason for hiding this comment

WebberHuang1118 commented Aug 14, 2024

WebberHuang1118 left a comment

Choose a reason for hiding this comment

w13915984028 commented Aug 14, 2024

mergify bot commented Aug 14, 2024 • edited Loading

✅ Backports have been created

w13915984028 commented Aug 6, 2024 •

edited

Loading

w13915984028 commented Aug 12, 2024 •

edited

Loading

mergify bot commented Aug 14, 2024 •

edited

Loading