Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: explain panic mode #4990

Merged
merged 3 commits into from
Jan 8, 2025
Merged

Conversation

zhaohuabing
Copy link
Member

Explain the "panic mode" where failed endpoints exceeds 50% as users were asking why requests were sent to unhealth endpoints.

Signed-off-by: Huabing Zhao <[email protected]>
@zhaohuabing zhaohuabing requested a review from a team as a code owner January 2, 2025 01:12
zirain
zirain previously approved these changes Jan 2, 2025
Copy link

codecov bot commented Jan 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.71%. Comparing base (f71fa99) to head (283a63e).
Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4990      +/-   ##
==========================================
- Coverage   66.77%   66.71%   -0.06%     
==========================================
  Files         209      209              
  Lines       32052    32055       +3     
==========================================
- Hits        21404    21387      -17     
- Misses       9374     9387      +13     
- Partials     1274     1281       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -9,6 +9,11 @@ import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

// HealthCheck configuration to decide which endpoints
// are healthy and can be used for routing.
//
// Please note that Envoy load balancer may behave differently when lots of endpoints are unhealthy because of the "panic mode".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Note: Once the overall health of the backendRef drops below 50% (e.g. a backendRef having 10 endpoints
// with more than 5 unhealthy endpoints), the health check is ignored for the remaining endpoints i.e. they are not
// removed from the load balancing pool. This prevents cascading failures and retry storms in the distributed system

Copy link
Member Author

@zhaohuabing zhaohuabing Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the health check is ignored for the remaining endpoints i.e. they are not
removed from the load balancing pool.

This description may not be accurate, I've verified the following behavior in my test culster:

When the percentage of the unhealth endpoints reach 50%, the healtch check will not be ignored - the unhealth endpoints will still be marked as unhealth, however, the load balancer will distribute requests across all the endpoints, including both the unhealth and the health endpoints.

I guess the reason is to prevent a lot of requests from being sent to a small number of remaining health endpoint, which may overwhelming these health endpoinds and tear town the wole culster.

Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>
@zhaohuabing zhaohuabing requested review from zirain and arkodg January 7, 2025 06:57
@arkodg arkodg requested review from a team January 8, 2025 01:06
@zirain zirain merged commit 00ecd08 into envoyproxy:main Jan 8, 2025
25 checks passed
@zhaohuabing zhaohuabing deleted the chore-panic-mode branch January 8, 2025 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants