Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: envoy initial fetch time out when CDS updated. #1035

Closed
haorenfsa opened this issue Oct 25, 2024 · 11 comments
Closed

bug: envoy initial fetch time out when CDS updated. #1035

haorenfsa opened this issue Oct 25, 2024 · 11 comments
Labels

Comments

@haorenfsa
Copy link

haorenfsa commented Oct 25, 2024

Server should send one more EDS response when any CDS update's ACK is received.

refs: https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#xds-ack-nack

image image

In my test , after CDS updated in controller, there's no EDS sent, hence the cluster warming time out.

As I highlighted my logs below:
The EDS update responses are sent during 08:17:55 - 08:18:46 before the CDS update (nonce=10) is ACKed by envoy.
image

At 08:21:46 (I set my envoy initial fetch timeout to 3min) client side printed initial fetch timeout. because after last CDS updates, there's no EDS response arrived.
image

PS:
I'm using envoy gateway in my environment.

some logs are added by my debug version here:
image

@haorenfsa haorenfsa changed the title Race in rewatch when receive delta watch request bug: for envoy, all other xDS updates should be arrived after CDS udpates Oct 28, 2024
@haorenfsa haorenfsa changed the title bug: for envoy, all other xDS updates should be arrived after CDS udpates bug: for envoy, all other xDS updates should arrive after CDS udpates Oct 28, 2024
@haorenfsa haorenfsa changed the title bug: for envoy, all other xDS updates should arrive after CDS udpates bug: envoy initial fetch time out when CDS updated. Oct 28, 2024
@valerian-roche
Copy link
Contributor

Hey, thanks for opening the issue. I believe this is the same underlying issue as #1001, which has now been addressed in envoy.
The runtime flag mentioned in the documentation capture has now been defaulted to true as of envoy 1.32, which may not be the version used in your case.
I do not believe the control-plane library will ever address this issue. This would require deep inspection of user resources which is the opposite direction we should take in my opinion. It would also tie more the control-plane with envoy, whereas we have been striving at supporting gRPC clients in recent times.

Can you confirm if you are still encountering this issue when using envoy 1.32.x or activating the runtime flag mentioned in the documentation?

@haorenfsa
Copy link
Author

Thank you so much! @valerian-roche for the background catch up. I'm using v1.30.6. I'll try the solution you mentioned.

I do not believe the control-plane library will ever address this issue. This would require deep inspection of user resources which is the opposite direction we should take in my opinion. It would also tie more the control-plane with envoy, whereas we have been striving at supporting gRPC clients in recent times.

By the way, I fully agree with you that we should not make any changes that would tie control-plane with envoy.

I think the patch would introduce only a few changes, to make control-plane work with old envoy versions without affect other xDS clients. This enables users to adopt control-plane solutions like envoy-gateway for Kubernetes without changing data plane version. It would be great if you would reconsider with it.

Anyway, thank you again for taking time😊

@valerian-roche
Copy link
Contributor

Given that envoy has now defaulted the fix I feel quite strongly that this issue does not justify this abstraction leakage. Users of envoy-gateway do not have to update the data-plane version, as they can simply activate the runtime flag to address the issue is envoy < 1.32. All supported version of envoy have a functional implementation of the EDS cache.

@valerian-roche
Copy link
Contributor

@alecholmez FYI I discussed PR #1034 here. Imo as envoy fixed it in 1.32 as default we can keep the control-plane simpler

@lukidzi
Copy link
Contributor

lukidzi commented Nov 4, 2024

I think this could still be an issue. If a user sets initial_fetch_timeout to 0, it disables the timeout, causing Envoy to wait indefinitely.

Envoy waits for an EDS assignment until initial_fetch_timeout times out, and will then apply the cached assignment and finish updating the warmed cluster.

@valerian-roche
Copy link
Contributor

I am unclear why a user would set this value as this breaks the EDS behavior of envoy in such case. Can you clarify what use-case is expected to use this?
Can you also open a new issue on envoy to get this behavior addressed? I do not believe there is a reason for envoy to not use the cache in this context (I am actually unclear why envoy does not use its cached value immediately as there is no real reason why the eds resource in cache would be invalid, and if it has changed it will still be received right after)

@haorenfsa
Copy link
Author

@valerian-roche When initial_fetch_timeout sets to 0 , there's no timeout. If CDS is updated without any further EDS changes, envoy hangs in cluster initializing state, the caches are not used.

@haorenfsa
Copy link
Author

I know set initial_fetch_timeout to a short time may solve this case. But I think it is a walk round. As technical engineers, we should try to solve the problem from the root.

@valerian-roche
Copy link
Contributor

The case of initial_fetch_timeout set to 0 is unclear to me, but should likely be addressed in envoy itself by directly using the cached value. Feel free to open an issue there.
In the general case I do not know why the envoy maintainers chose to wait for the end of the timeout to use the endpoints, but in the end the core issue is in envoy's implementation of xDS clients that is being addressed by the maintainers progressively.

Copy link

github-actions bot commented Jan 1, 2025

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 1, 2025
Copy link

github-actions bot commented Jan 8, 2025

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants