docs: explain panic mode #4990

zhaohuabing · 2025-01-02T01:12:17Z

Explain the "panic mode" where failed endpoints exceeds 50% as users were asking why requests were sent to unhealth endpoints.

Signed-off-by: Huabing Zhao <[email protected]>

codecov · 2025-01-02T01:19:06Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.71%. Comparing base (f71fa99) to head (283a63e).
Report is 18 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4990      +/-   ##
==========================================
- Coverage   66.77%   66.71%   -0.06%     
==========================================
  Files         209      209              
  Lines       32052    32055       +3     
==========================================
- Hits        21404    21387      -17     
- Misses       9374     9387      +13     
- Partials     1274     1281       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

arkodg · 2025-01-02T21:37:33Z

api/v1alpha1/healthcheck_types.go

@@ -9,6 +9,11 @@ import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

 // HealthCheck configuration to decide which endpoints
 // are healthy and can be used for routing.
+//
+// Please note that Envoy load balancer may behave differently when lots of endpoints are unhealthy because of the "panic mode".


// Note: Once the overall health of the backendRef drops below 50% (e.g. a backendRef having 10 endpoints
// with more than 5 unhealthy endpoints), the health check is ignored for the remaining endpoints i.e. they are not
// removed from the load balancing pool. This prevents cascading failures and retry storms in the distributed system

the health check is ignored for the remaining endpoints i.e. they are not
removed from the load balancing pool.

This description may not be accurate, I've verified the following behavior in my test culster:

When the percentage of the unhealth endpoints reach 50%, the healtch check will not be ignored - the unhealth endpoints will still be marked as unhealth, however, the load balancer will distribute requests across all the endpoints, including both the unhealth and the health endpoints.

I guess the reason is to prevent a lot of requests from being sent to a small number of remaining health endpoint, which may overwhelming these health endpoinds and tear town the wole culster.

Signed-off-by: Huabing Zhao <[email protected]>

api/v1alpha1/healthcheck_types.go

explain panic mode

0082053

Signed-off-by: Huabing Zhao <[email protected]>

zhaohuabing requested a review from a team as a code owner January 2, 2025 01:12

zirain previously approved these changes Jan 2, 2025

View reviewed changes

arkodg reviewed Jan 2, 2025

View reviewed changes

small wording

183a334

Signed-off-by: Huabing Zhao <[email protected]>

zhaohuabing dismissed zirain’s stale review via 183a334 January 3, 2025 07:13

fix lint

283a63e

Signed-off-by: Huabing Zhao <[email protected]>

zhaohuabing requested review from zirain and arkodg January 7, 2025 06:57

arkodg reviewed Jan 7, 2025

View reviewed changes

api/v1alpha1/healthcheck_types.go Show resolved Hide resolved

arkodg approved these changes Jan 8, 2025

View reviewed changes

arkodg requested review from a team January 8, 2025 01:06

zirain approved these changes Jan 8, 2025

View reviewed changes

zirain merged commit 00ecd08 into envoyproxy:main Jan 8, 2025
25 checks passed

zhaohuabing deleted the chore-panic-mode branch January 8, 2025 07:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: explain panic mode #4990

docs: explain panic mode #4990

zhaohuabing commented Jan 2, 2025

codecov bot commented Jan 2, 2025 •

edited

Loading

arkodg Jan 2, 2025

zhaohuabing Jan 3, 2025 •

edited

Loading

docs: explain panic mode #4990

docs: explain panic mode #4990

Conversation

zhaohuabing commented Jan 2, 2025

codecov bot commented Jan 2, 2025 • edited Loading

Codecov Report

arkodg Jan 2, 2025

Choose a reason for hiding this comment

zhaohuabing Jan 3, 2025 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jan 2, 2025 •

edited

Loading

zhaohuabing Jan 3, 2025 •

edited

Loading