Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support_bundle_fail_expunged should not look at expunged datasets, zones #7319

Open
smklein opened this issue Jan 7, 2025 · 2 comments · May be fixed by #7325
Open

support_bundle_fail_expunged should not look at expunged datasets, zones #7319

smklein opened this issue Jan 7, 2025 · 2 comments · May be fixed by #7325
Labels
bug Something that isn't working.

Comments

@smklein
Copy link
Collaborator

smklein commented Jan 7, 2025

          Sorry for the drive-by comment - I don't have much context here, but I skimmed over `support_bundle_fail_expunged` and had a question. It looks like that's checking the current blueprint for _expunged_ zones/datasets specifically. How will that interact with dropping expunged entities from the blueprint altogether?

Originally posted by @jgallagher in #7063 (comment)

support_bundle_fail_expunged could be implemented without looking at explicitly expunged datasets and zones -- it could consider the set of "all zones and datasets that are in-service", and consider anything outside of that (whether expunged, or pruned, or whatever) as "gone".

Related: I should take a look at other checks for BlueprintZoneFilter::Expunged, and see if anyone else has fallen into this expunged vs pruned issue.

@smklein smklein added the bug Something that isn't working. label Jan 7, 2025
@smklein
Copy link
Collaborator Author

smklein commented Jan 8, 2025

FYI @davepacheco , I was looking for other users of BlueprintZoneFilter::Expunged, and came across this code:

// Identify any Nexus zones that have been expunged and need to have sagas
// re-assigned.
//
// TODO: Currently, we take any expunged Nexus instances and attempt to
// assign all their sagas to ourselves. Per RFD 289, we can only re-assign
// sagas between two instances of Nexus that are at the same version. Right
// now this can't happen so there's nothing to do here to ensure that
// constraint. However, once we support allowing the control plane to be
// online _during_ an upgrade, there may be multiple different Nexus
// instances running at the same time. At that point, we will need to make
// sure that we only ever try to assign ourselves sagas from other Nexus
// instances that we know are running the same version as ourselves.
let nexus_zone_ids: Vec<_> = blueprint
.all_omicron_zones(BlueprintZoneFilter::Expunged)
.filter_map(|(_, z)| {
z.zone_type
.is_nexus()
.then(|| nexus_db_model::SecId(z.id.into_untyped_uuid()))
})
.collect();

Is this problematic, for the same reasons @jgallagher mentioned? Namely, would the following be possible:

  • A Nexus zone with sagas is expunged
  • The expunged nexus zone is pruned (fully deleted from the blueprint) before the saga reassignment work executes successfully
  • The reassign_sagas_from_expunged code never sees this "expunged" Nexus zone anymore because it no longer exists in the blueprint

If this is a real issue, we could possibly resolve this similarly to how I resolved #7325 :

  • Collect the set of "in-service" Nexus zones from the blueprint
  • Find all sagas that are not part of the "in-service" set, and re-assign them

(This would treat "expunged" and "pruned" Nexus zones identically)

@davepacheco
Copy link
Collaborator

FYI @davepacheco , I was looking for other users of BlueprintZoneFilter::Expunged, and came across this code:
...
Is this problematic, for the same reasons @jgallagher mentioned? Namely, would the following be possible:

* A Nexus zone with sagas is expunged

* The expunged nexus zone is pruned (fully deleted from the blueprint) before the saga reassignment work executes successfully

* The `reassign_sagas_from_expunged` code never sees this "expunged" Nexus zone anymore because it no longer exists in the blueprint

I believe that when we talked about this in the context of #5136, the plan was that pruning an expunged Nexus zone from the blueprint would be gated on having verified that the zone was actually gone and we'd finished reassigning its sagas. I think the idea was to use inventory for the first part and the put the saga counts per-nexus into the planning input, like we do for networking and maybe disk resources? I'm not sure -- it wasn't fully fleshed out because we don't do pruning yet.

If this is a real issue, we could possibly resolve this similarly to how I resolved #7325 :

* Collect the set of "in-service" Nexus zones from the blueprint

* Find all sagas that are not part of the "in-service" set, and re-assign them

(This would treat "expunged" and "pruned" Nexus zones identically)

I think this would work provided that the query that attempts to re-assign sagas in this way is conditional on the target blueprint not having changed. I don't feel as confident about it and would want to think through that more but that's a plausible path, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants