Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint-restore with WebFlux and Undertow does not work when graceful shutdown is enabled #43655

Open
sdeleuze opened this issue Jan 3, 2025 · 3 comments
Labels
type: bug A general bug
Milestone

Comments

@sdeleuze
Copy link
Contributor

sdeleuze commented Jan 3, 2025

After updating https://github.com/spring-projects/spring-lifecycle-smoke-tests to run tests against Spring Boot 3.4.x, I have noticed that framework:webflux-undertow:checkpointRestoreAppTest is broken with Boot 3.4.x while still green with Boot 3.3.x, even if both are using the same Undertow version with the following error:

Error (criu/libnetlink.c:54): -95 reported by netlink: Operation not supported
Error (criu/net.c:3744): Unable to create a veth pair: -95

While discussing with @snicoll about what could caused that, he mentioned that Spring Boot 3.4.x enables graceful shutdown by default, so I tried server.shutdown=immediate and found that it fixes the test.

Could the Spring Boot team see if we could avoid this regression and keep WebFlux + Undertow CRaC support working out of the box? I suspect that when graceful shutdown is enabled, it is not finished when JVM checkpoint is invoked, letting the socket in a bad state, hence the error above.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Jan 3, 2025
@wilkinsona
Copy link
Member

This doesn't look like a regression to me as it also fails (although perhaps differently) with Boot 3.3.x when graceful shutdown is enabled:

> Task :framework:webflux-undertow:checkpointRestoreAppTest FAILED

WebfluxApplicationTests > stringResponseBody(WebTestClient) STANDARD_OUT
    09:43:18.371 [Test worker] ERROR org.springframework.test.web.reactive.server.ExchangeResult -- Request details for assertion failure:

    > GET http://localhost:38021
    > accept-encoding: [gzip]
    > user-agent: [ReactorNetty/1.1.25]
    > host: [localhost:38021]
    > accept: [*/*]
    > WebTestClient-Request-Id: [1]

    No content

    < 503 SERVICE_UNAVAILABLE Service Unavailable
    < Connection: [keep-alive]
    < Content-Length: [0]
    < Date: [Fri, 03 Jan 2025 09:43:18 GMT]

    0 bytes of content (unknown content-type).


WebfluxApplicationTests > stringResponseBody(WebTestClient) FAILED
    java.lang.AssertionError at WebfluxApplicationTests.java:18

WebfluxApplicationTests > resourceInStatic(WebTestClient) STANDARD_OUT
    09:43:18.401 [Test worker] ERROR org.springframework.test.web.reactive.server.ExchangeResult -- Request details for assertion failure:

    > GET http://localhost:38021/foo.html
    > accept-encoding: [gzip]
    > user-agent: [ReactorNetty/1.1.25]
    > host: [localhost:38021]
    > accept: [*/*]
    > WebTestClient-Request-Id: [2]

    No content

    < 503 SERVICE_UNAVAILABLE Service Unavailable
    < Connection: [keep-alive]
    < Content-Length: [0]
    < Date: [Fri, 03 Jan 2025 09:43:18 GMT]

    0 bytes of content (unknown content-type).

@wilkinsona wilkinsona changed the title Regression on WebFlux + Undertow with Project CRaC Checkpoint-restore with WebFlux and Undertow does not work when graceful shutdown is enabled Jan 3, 2025
@wilkinsona wilkinsona added this to the 3.3.x milestone Jan 3, 2025
@wilkinsona wilkinsona added type: bug A general bug and removed status: waiting-for-triage An issue we've not yet triaged labels Jan 3, 2025
@wilkinsona
Copy link
Member

With Boot 3.4.1, I'm seeing the same behavior as Boot 3.3.x when graceful shutdown is enabled. The checkpoint works, the app starts successfully upon restore, and then rejects requests with a 503. This happens because Undertow's GracefulShutdownHandler is only single-use. Once it has been shut down (as happens when taking the checkpoint) the shutdown bit is set in its state field. The bit isn't cleared upon restore so the handler still believes that Undertow has been shut down. There's no API to clear it so we may have to resort to reflection if this is something that we want to support. Alternatively, it might be possible to ignore the handler somehow when taking a checkpoint so that it isn't shut down.

@sdeleuze
Copy link
Contributor Author

sdeleuze commented Jan 6, 2025

For the automatic checkpoint/restore at startup use case where -Dspring.context.checkpoint=onRefresh is set, graceful shutdown is IMO not needed (for any webserver) since no request is expected to have been received, so if you can disable it (for Undertow or all servers) for that use case specifically, that would make sense. Spring Boot can leverage DefaultLifecycleProcessor#CHECKPOINT_PROPERTY_NAME and DefaultLifecycleProcessor#ON_REFRESH_VALUE.

For the on-demand checkpoint/restore of a running application, I think graceful shutdown makes more sense, so maybe I could create a related GracefulShutdownHandler feature request on Undertow bug tracker and for now we just document in https://github.com/spring-projects/spring-lifecycle-smoke-tests that people using Undertow + CRaC + on-demand checkpoint/restore should disable graceful shutdown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

3 participants