Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(translator): extension server should fail close #4936

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

liorokman
Copy link
Contributor

This PR changes the default behavior for extension servers to fail-close instead of fail open.

Which issue(s) this PR fixes:
Fixes #4155

Release Notes: Yes

Copy link

codecov bot commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 65.60000% with 43 lines in your changes missing coverage. Please review.

Project coverage is 66.75%. Comparing base (00ecd08) to head (52a41c2).

Files with missing lines Patch % Lines
internal/extension/registry/extension_manager.go 0.00% 16 Missing ⚠️
internal/xds/translator/translator.go 83.87% 10 Missing and 5 partials ⚠️
internal/xds/translator/extension.go 25.00% 8 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4936      +/-   ##
==========================================
- Coverage   66.80%   66.75%   -0.06%     
==========================================
  Files         209      209              
  Lines       32264    32367     +103     
==========================================
+ Hits        21554    21606      +52     
- Misses       9419     9455      +36     
- Partials     1291     1306      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@liorokman
Copy link
Contributor Author

/retest

1 similar comment
@liorokman
Copy link
Contributor Author

/retest

@liorokman liorokman marked this pull request as ready for review December 18, 2024 12:35
@liorokman liorokman requested a review from a team as a code owner December 18, 2024 12:35
for _, filter := range listener.FilterChains {
hcm, err := findHCMinFilterChain(filter)
if err != nil {
// no HCM found, skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also cover use cases where tcp_proxy is found instead of HCM? e.g. for filter chains generated for TCP/TLSRoute. Is it currently possible to mutate these filter chains in the extension manager using one of the hooks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The listener hook will be called for non-HTTP(s) listeners.

But for non-HTTP(s) listeners, it's not possible to return an HTTP 500 response. What would "failing-close" look like here?

@@ -7,6 +7,7 @@ breaking changes: |
An empty TLS ALPNProtocols list is now treated as user-defined disablement of the TLS ALPN extension.
Outlier detection (passive health check) is now disabled by default.
refer to https://gateway.envoyproxy.io/docs/api/extension_types/#backendtrafficpolicy for working with passive health checks.
Envoy Gateway treats errors in calls to an extension service as fail-closed by default. The previous behavior can be enabled by setting the `failOpen` field to `true` in the extension service configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can elaborate a bit here on what fail closed means. Basically, translation/connectivity errors would lead to replacement of proxy routing config with immediate responses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -202,6 +202,17 @@ func (t *testingExtensionServer) PostHTTPListenerModify(_ context.Context, req *

// PostTranslateModifyHook inserts and overrides some clusters/secrets
func (t *testingExtensionServer) PostTranslateModify(_ context.Context, req *pb.PostTranslateModifyRequest) (*pb.PostTranslateModifyResponse, error) {
for _, cluster := range req.Clusters {
if edsConfig := cluster.GetEdsClusterConfig(); edsConfig != nil {
if strings.Contains(edsConfig.ServiceName, "fail-close-error") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have some comments in here explaining what this is simulating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -500,6 +572,21 @@ func (t *Translator) addRouteToRouteConfig(
// If no extension exists (or it doesn't subscribe to this hook) then this is a quick no-op.
if err = processExtensionPostVHostHook(vHost, t.ExtensionManager); err != nil {
errs = errors.Join(errs, err)
if t.ExtensionManager != nil && !(*t.ExtensionManager).FailOpen() {
vHost.Routes = []*routev3.Route{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have some comments in here explaining what this setting means, and why its added here, similar comment would be great for other snippets where xds is being modified to return a 500

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@arkodg
Copy link
Contributor

arkodg commented Jan 7, 2025

thanks @liorokman , overall LGTM, added some non blocking comments around adding comments

@liorokman liorokman force-pushed the ext-server-fail-close branch from 2357622 to 0ffd03c Compare January 8, 2025 08:53
@liorokman
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Fail Safe" control knob for Extension Server
3 participants