Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change LH default settings to fix maintenance mode issues #788

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

w13915984028
Copy link
Member

@w13915984028 w13915984028 commented Aug 6, 2024

Problem:

The Harvester maintenance mode is blocked by manually attached volumes and single-replica volumes.

Solution:

Change LH default setting to allow node drain on such scenarios.

Refer:

https://github.com/longhorn/charts/blob/v1.6.x/charts/longhorn/values.yaml#L248

https://github.com/longhorn/charts/blob/ad73dc01239b7eeb25ff510ce8358578433d85a5/charts/longhorn/values.yaml#L250

Related Issue:
harvester/harvester#6264
harvester/harvester#6266

Test plan:

Per issue steps.

After a new installation, get such LH default settings:

default-data-path                                                 /var/lib/harvester/defaultdisk                    true      7m19s
...
detach-manually-attached-volumes-when-cordoned                    true                                              true      7m19s
...
node-drain-policy                                                 allow-if-replica-is-stopped                       true      7m19s

Following PRs:

Copy link

@ejweber ejweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major concern from me on the Longhorn side.

We default to block-if-contains-last-replica because there are cluster upgrade strategies that remove nodes from a cluster after they have been drained and replace them with new nodes. allow-if-last-replica-is-stopped can not protect data for single replica volumes in these scenarios. Since Harvester controls the upgrade process, it is fine to use allow-if-last-replica-is-stopped assuming Harvester team is sure the situation I described will not occur.

detach-manually-attached-volumes-when-cordoned comes down to a balance of priorities. We chose to default it to false so users would not be surprised by a volume they purposely attached becoming detached during a cluster maintenance operation. We're not sure (in general) why users are manually attaching volumes, so it's hard to gauge the importance of this. If the Harvester support team is repeatedly running into issues in customer environments in which manual attachments are disrupting upgrades, this is probably a good change.

@bk201 bk201 requested a review from Vicente-Cheng August 7, 2024 14:20
@bk201
Copy link
Member

bk201 commented Aug 12, 2024

@WebberHuang1118 @FrankYang0529 @Yu-Jack
The PR is to set the setting detach-manually-attached-volumes-when-cordoned to true by default: Longhorn will automatically detach "manually" attached volumes if a volume's workload pod is on the cordoning/draining node.

Can you check if we have any remaining manually attached paths or other concerns? Thanks.

@Vicente-Cheng
Copy link
Contributor

Hi @ejweber,
Thanks for the explanation. Just some questions would like to get more clear.

For allow-if-last-replica-is-stopped

there are cluster upgrade strategies that remove nodes from a cluster after they have been drained and replace them with new nodes.

What is this situation? I am sure we did not remove the Longhorn nodes from the cluster when upgrading. I just want to confirm that this behavior is triggered by the user who owns the cluster instead of the Longhorn itself.

For detach-manually-attached-volumes-when-cordoned

We chose to default it to false so users would not be surprised by a volume they purposely attached becoming detached during a cluster maintenance operation.

I prefer this configuration to be false. As you mentioned, the user should know what happened with their volume if the node cordon. Actually, I checked the harvester project, and it looks like we did not use the AttacherTypeLonghornAPI for manual attachment. So far, this setting would not affect us. But I would like to make this false.

@w13915984028
Copy link
Member Author

w13915984028 commented Aug 12, 2024

@Vicente-Cheng I try to answer from the context of Harvester:

From Harvester's practice, the production deployment are mostly based on bare-metal servers, customer has such scenarios which require a node maintenance:
(1) Replace/add hardware
(2) Re-adapt network
(3) Trouble-shooting, migrate workloads and reboot node
(4) Replace a whole bare-metal node, or offline it for some days
(5) Others
LH is passive in this context.

In those cases, customer expects clicking the Node Maintenance is effective and smooth, but currently due to those limitations, we encountered some bug reporting and it took time to troubleshooting.

LH provides many options for various of scenarios and for compatibility. Back to Harvester, we need to recheck each option and select the best match for customer convenience.

We need to make a decision to balance between:

(1) Keep current option, if user accidently/actively attache a volume, he needs to detach it to get the success of node maintenance, normally he will ask for help about failure of node maintenance and after troubleshooting he knows it. We made the implicit decision for user. And tell him to change on embedded LH.

(2) Use detach manually attached volume and highlight it on document, tell user we take precedence of node maintenance and will detach it. But user can actively enable it from the embedded LH UI. Then user knows what and makes the decision.

BTW, please also check the document PR: harvester/docs#619, it has more details about the issue and the updated Harvester document, thanks.

@Vicente-Cheng
Copy link
Contributor

Thanks @w13915984028,

For the allow-if-last-replica-is-stopped, I thought you were right. Just curious about the situation Eric mentioned.

For the detach-manually-attached-volumes-when-cordoned,
I do not have any strong opinion of that value. The reason, as I mentioned above, we do not have any feature depends on that (manually attached).
The following is just my personal thought. I thought manually attaching the Longhorn volume was a temporary operation. So, if the maintenance mode is stuck because of that, it makes sense to me. We should give some tips and documents, and users should know what they need to do (detach the volume after they complete any temporary operation).

Just want to make the content of these configs clear. The changes in this PR are ok with me.

@WebberHuang1118 WebberHuang1118 self-requested a review August 13, 2024 01:24
Copy link
Member

@FrankYang0529 FrankYang0529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Copy link
Contributor

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Regarding Eric's comment:

We default to block-if-contains-last-replica because there are cluster upgrade strategies that remove nodes from a cluster after they have been drained and replace them with new nodes. allow-if-last-replica-is-stopped can not protect data for single replica volumes in these scenarios. Since Harvester controls the upgrade process, it is fine to use allow-if-last-replica-is-stopped assuming Harvester team is sure the situation I described will not occur.

AFAIK we upgrade all nodes inplace rather than removing and adding new nodes, so the data should remain intact throughout.

@WebberHuang1118
Copy link
Member

@WebberHuang1118 @FrankYang0529 @Yu-Jack The PR is to set the setting detach-manually-attached-volumes-when-cordoned to true by default: Longhorn will automatically detach "manually" attached volumes if a volume's workload pod is on the cordoning/draining node.

Can you check if we have any remaining manually attached paths or other concerns? Thanks.

Harvester doesn't manually attach volume for VM snapshot/backup instead we make use of LH Volume Attach/Detach Implementation since Harvester v1.3.0 harvester/harvester#4910

Copy link
Member

@WebberHuang1118 WebberHuang1118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@bk201 bk201 merged commit 54caa47 into harvester:master Aug 14, 2024
6 checks passed
@w13915984028
Copy link
Member Author

@mergify backport v1.4

Copy link

mergify bot commented Aug 14, 2024

backport v1.4

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants