Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage_controller: fix node flap detach race #10298

Merged
merged 3 commits into from
Jan 8, 2025
Merged

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Jan 7, 2025

Problem

The observed state removal may race with the inline updates of the
observed state done from Service::node_activate_reconcile.

This was intended to work as follows:

  1. Detaches while the node is unavailable remove the entry from the
    observed state.
  2. Service::node_activate_reconcile diffs the locations returned
    by the pageserver with the observed state and detaches in-line
    when required.

Summary of changes

This PR removes step (1) and lets background reconciliations
deal with the mismatch between the intent and observed state.
A follow up will attempt to remove Service::node_activate_reconcile altogether.

Closes #10253

The same failpoint is used for a new test by a follow up commit
and that needs a pausable failpoint.
The observed state removal may race with the inline updates of the
observed state done from `Service::node_activate_reconcile`.

This was intended to work as follows:
1. Detaches while the node is unavailable remove the entry from the
   observed state.
2. `Service::node_activate_reconcile` diffs the locations returned
   by the pageserver with the observed state and detaches in-line
   when required.

This commit removes step (1) and lets background reconciliations
deal with the mismatch between the intent and observed state.
@VladLazar VladLazar changed the title Vlad/issue 10253 storage_controller: fix node flap detach race Jan 7, 2025
Copy link

github-actions bot commented Jan 7, 2025

7264 tests run: 6906 passed, 0 failed, 358 skipped (full report)


Flaky tests (2)

Postgres 17

  • test_physical_replication_config_mismatch_max_locks_per_transaction: release-arm64

Postgres 15

Code coverage* (full report)

  • functions: 31.2% (8409 of 26962 functions)
  • lines: 48.0% (66772 of 139227 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
8e77e10 at 2025-01-07T18:03:08.044Z :recycle:

@VladLazar VladLazar marked this pull request as ready for review January 7, 2025 17:33
@VladLazar VladLazar requested a review from a team as a code owner January 7, 2025 17:33
@VladLazar VladLazar requested review from arpad-m and jcsp and removed request for arpad-m January 7, 2025 17:33
@VladLazar VladLazar added this pull request to the merge queue Jan 8, 2025
Merged via the queue into main with commit dc28424 Jan 8, 2025
85 checks passed
@VladLazar VladLazar deleted the vlad/issue-10253 branch January 8, 2025 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage controller: race between reconcilation and node_activate_reconcile on multi-node availability flap
2 participants