-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Fail Safe" control knob for Extension Server #4155
Comments
we've fail closed by default for all Policy APIs, should we do the same for extension manager (if unable to connect or unable to get a response in time ) ptal @envoyproxy/gateway-maintainers . If unable to connect during startup, it would mean no config would ever be programmed in envoy proxy, and if the gRPC request times out, we could skip that xDS update, and the data plane would have the last good config |
Maybe add a configuration flag in the extension manager configuration section to specify if the extension manager should fail-open or fail-close? |
yeah thats a good home to add the mode config if user expectations vary, we still need to figure out what the default mode should be - fail open or fail close 😄 I vote for fail close by default, to ensure we fail fast and early like we do for Filters and Policies |
if i understand correctly, since the extension has several hooks (HTTPListener, VirtualHost, Route, and Translation), it would be make sense if 500 behavior is used when Route hook failing. But what about on other hooks? |
I'm not sure about the others, but in the case of HTTPListener, in "failsafe mode" the listener should not be added to list of active listener resources if the hook fails. |
Here's a proposal around how Extension Server failures should be handled by Envoy Gateway: Failure to connect to the Extension ServerIf an Extension Server was configured but was not reachable, then the safest action is to fail-close. An Extension Server might be used to add security critical features into the XDS. If the Extension Server is not reachable then Envoy Gateway should generate a default XDS configuration for each configured Listener that returns a 500 error code for all requests. The generated configuration should override the XDS generated by Envoy Gateway - basically creating XDS that listens on all the ports that are configured as Listeners and override all of the configuration in them. Envoy Gateway contacts the Extension Server multiple times during the translation process from IR to XDS. The Extension Server might become unreachable at any time during the translation process. If an Extension Server is defined and becomes unreachable at any point during the translation process, all the already translated XDS should be discarded and the translation process should be stopped and restarted. Error returned from one of the hooks in the extension serverThe current Extension Server interface has four different hooks. The following table describes the semantics for each of the hooks:
The following sequence diagram shows the sequence of calls to the Extension Server during the translation process: sequenceDiagram
participant EG as Envoy Gateway
participant ES as Extension Server
EG->>EG: Generate IR from CRD resources
loop For each Listener
EG->>EG: Generate XDS for Listener
loop For each Route
EG->>EG: Generate XDS for Route<br>attached to Listener
EG->>EG: Generate XDS for VirtualHost<br>attached to Route
activate ES
EG->>ES: Call PostRouteModifyHook
deactivate ES
end
loop For each VirtualHost attached to Routes of the Listener
activate ES
EG->>ES: Call PostVirtualHostModifyHook
deactivate ES
end
activate ES
EG->>ES: Call PostHTTPListenerModifyHook
deactivate ES
end
activate ES
EG->>ES: Call PostTranslateModifyHook
deactivate ES
Each hook in the Extension Server implementation could potentially handle its internal errors itself. If the Extension Server handles its errors internally then Envoy Gateway will not be aware that an error even occurred. However, if the Extension Server returns an error then it basically delegates responsibility for the error to Envoy Gateway. In this case the safest action is to fail-close. The Extension Server could return a default XDS configuration that returns a 500 error code for all requests. The following are some examples of how the Extension Server could handle its internal errors:
Requiring the Extension Server to handle its internal errors represents overhead for the Extension Server developer. In most cases, an Extension Server developer might prefer that if the Extension Server returns an error, then Envoy Gateway should fail-close automatically using the logic described above. Defaulting to "fail-close" in this way would not require any changes to the Extension Server configuration in Envoy Gateway. Some middle ground might also be reasonable. For example, when an Extension Server is registered in Envoy Gateway, the fail-open or fail-close behavior could be configured per enabled hook. The default should be fail-close, but the current (fail-open) behavior could be adopted instead. The configuration could look like this: extensionManager:
hooks:
# The type of hooks that should be invoked
xdsTranslator:
post:
- type: HTTPListener
failOpen: false # This is the default, if the failOpen attribute is not specified
service:
# The service that is hosting the extension server
fqdn:
hostname: extension-server.envoy-gateway-system.svc.cluster.local
port: 5005
|
Currently, the processing of Extension Server logic (ie
Translate()
step) is "best effort". If the Extension Server fails in some way (ie is not available, crashes during request handling), than the envoy configuration is not impacted.In some situations, this could be a "bad thing". For example, if the Extension Server is being used to add a default Authz filter to all Listeners, if the grpc call to the Extension Server fails, than the Listener will still be activated but will not have the Authz filter (thus incorrectly exposing the resource without the desired protection).
Ideally, similar to #3873, there would be an option added to
ExtensionManager
which would allow either "fail open" (current behavior of best effort) or "fail closed" (alternate behavior of disabling the resource associated with the failed hook).The text was updated successfully, but these errors were encountered: