bug: envoy initial fetch time out when CDS updated. #1035

haorenfsa · 2024-10-25T09:56:15Z

Server should send one more EDS response when any CDS update's ACK is received.

refs: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#xds-ack-nack

In my test , after CDS updated in controller, there's no EDS sent, hence the cluster warming time out.

As I highlighted my logs below:
The EDS update responses are sent during 08:17:55 - 08:18:46 before the CDS update (nonce=10) is ACKed by envoy.

At 08:21:46 (I set my envoy initial fetch timeout to 3min) client side printed initial fetch timeout. because after last CDS updates, there's no EDS response arrived.

PS:
I'm using envoy gateway in my environment.

some logs are added by my debug version here:

valerian-roche · 2024-10-30T00:49:24Z

Hey, thanks for opening the issue. I believe this is the same underlying issue as #1001, which has now been addressed in envoy.
The runtime flag mentioned in the documentation capture has now been defaulted to true as of envoy 1.32, which may not be the version used in your case.
I do not believe the control-plane library will ever address this issue. This would require deep inspection of user resources which is the opposite direction we should take in my opinion. It would also tie more the control-plane with envoy, whereas we have been striving at supporting gRPC clients in recent times.

Can you confirm if you are still encountering this issue when using envoy 1.32.x or activating the runtime flag mentioned in the documentation?

haorenfsa · 2024-10-30T07:42:11Z

Thank you so much! @valerian-roche for the background catch up. I'm using v1.30.6. I'll try the solution you mentioned.

I do not believe the control-plane library will ever address this issue. This would require deep inspection of user resources which is the opposite direction we should take in my opinion. It would also tie more the control-plane with envoy, whereas we have been striving at supporting gRPC clients in recent times.

By the way, I fully agree with you that we should not make any changes that would tie control-plane with envoy.

I think the patch would introduce only a few changes, to make control-plane work with old envoy versions without affect other xDS clients. This enables users to adopt control-plane solutions like envoy-gateway for Kubernetes without changing data plane version. It would be great if you would reconsider with it.

Anyway, thank you again for taking time😊

valerian-roche · 2024-10-31T17:50:56Z

Given that envoy has now defaulted the fix I feel quite strongly that this issue does not justify this abstraction leakage. Users of envoy-gateway do not have to update the data-plane version, as they can simply activate the runtime flag to address the issue is envoy < 1.32. All supported version of envoy have a functional implementation of the EDS cache.

valerian-roche · 2024-11-02T03:14:53Z

@alecholmez FYI I discussed PR #1034 here. Imo as envoy fixed it in 1.32 as default we can keep the control-plane simpler

lukidzi · 2024-11-04T10:40:01Z

I think this could still be an issue. If a user sets initial_fetch_timeout to 0, it disables the timeout, causing Envoy to wait indefinitely.

Envoy waits for an EDS assignment until initial_fetch_timeout times out, and will then apply the cached assignment and finish updating the warmed cluster.

valerian-roche · 2024-11-26T04:26:51Z

I am unclear why a user would set this value as this breaks the EDS behavior of envoy in such case. Can you clarify what use-case is expected to use this?
Can you also open a new issue on envoy to get this behavior addressed? I do not believe there is a reason for envoy to not use the cache in this context (I am actually unclear why envoy does not use its cached value immediately as there is no real reason why the eds resource in cache would be invalid, and if it has changed it will still be received right after)

haorenfsa · 2024-11-28T03:47:38Z

@valerian-roche When initial_fetch_timeout sets to 0 , there's no timeout. If CDS is updated without any further EDS changes, envoy hangs in cluster initializing state, the caches are not used.

haorenfsa · 2024-11-28T03:49:32Z

I know set initial_fetch_timeout to a short time may solve this case. But I think it is a walk round. As technical engineers, we should try to solve the problem from the root.

valerian-roche · 2024-12-02T09:32:19Z

The case of initial_fetch_timeout set to 0 is unclear to me, but should likely be addressed in envoy itself by directly using the cached value. Feel free to open an issue there.
In the general case I do not know why the envoy maintainers chose to wait for the end of the timeout to use the endpoints, but in the end the core issue is in envoy's implementation of xDS clients that is being addressed by the maintainers progressively.

github-actions · 2025-01-01T12:11:08Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2025-01-08T12:11:21Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

haorenfsa mentioned this issue Oct 25, 2024

BackendTlsPolicy specify multiple targetRefs of the same service, only one will work envoyproxy/gateway#4445

Closed

haorenfsa changed the title ~~Race in rewatch when receive delta watch request~~ bug: for envoy, all other xDS updates should be arrived after CDS udpates Oct 28, 2024

haorenfsa changed the title ~~bug: for envoy, all other xDS updates should be arrived after CDS udpates~~ bug: for envoy, all other xDS updates should arrive after CDS udpates Oct 28, 2024

haorenfsa changed the title ~~bug: for envoy, all other xDS updates should arrive after CDS udpates~~ bug: envoy initial fetch time out when CDS updated. Oct 28, 2024

haorenfsa mentioned this issue Oct 29, 2024

Fix envoy initial fetch time out when CDS updated #1034

Closed

valerian-roche mentioned this issue Oct 31, 2024

fix(delta): force push EDS once CDS sent for Ads #981

Closed

lukidzi mentioned this issue Nov 22, 2024

fix(delta): force eds response after cds change kumahq/go-control-plane#8

Closed

github-actions bot added the stale label Jan 1, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: envoy initial fetch time out when CDS updated. #1035

bug: envoy initial fetch time out when CDS updated. #1035

haorenfsa commented Oct 25, 2024 •

edited

Loading

valerian-roche commented Oct 30, 2024

haorenfsa commented Oct 30, 2024

valerian-roche commented Oct 31, 2024

valerian-roche commented Nov 2, 2024

lukidzi commented Nov 4, 2024 •

edited

Loading

valerian-roche commented Nov 26, 2024

haorenfsa commented Nov 28, 2024

haorenfsa commented Nov 28, 2024

valerian-roche commented Dec 2, 2024

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 8, 2025

bug: envoy initial fetch time out when CDS updated. #1035

bug: envoy initial fetch time out when CDS updated. #1035

Comments

haorenfsa commented Oct 25, 2024 • edited Loading

valerian-roche commented Oct 30, 2024

haorenfsa commented Oct 30, 2024

valerian-roche commented Oct 31, 2024

valerian-roche commented Nov 2, 2024

lukidzi commented Nov 4, 2024 • edited Loading

valerian-roche commented Nov 26, 2024

haorenfsa commented Nov 28, 2024

haorenfsa commented Nov 28, 2024

valerian-roche commented Dec 2, 2024

github-actions bot commented Jan 1, 2025

github-actions bot commented Jan 8, 2025

haorenfsa commented Oct 25, 2024 •

edited

Loading

lukidzi commented Nov 4, 2024 •

edited

Loading