You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think fixing this might break zone expungement for these zones. Zone expungement removes both the dataset and the zone (in one blueprint step). During execution, this would attempt to delete the dataset before attempting to delete the zone. I assume that would fail, since the dataset is still in use. And that currently causes the rest of execution to fail (#6999).
The idea that's been floated of combining PUT /datasets with PUT /omicron-zones could address this.
Dataset {addition,removal} happens before zone {addition,removal}.
For zone addition: This works fine, it effectively requests the dependencies of a zone (all datasets) before requesting the zone.
For zone removal: This is problematic, as it would request the removal of a dataset before removing the removal of the zone. This works today only because #6177 is not implemented, and the "removed" dataset is ignored, instead of being deleted.
The following statements about our system are currently true:
Dataset deployment (PUT /datasets) and Zone deployment (PUT /omicron-zones) are distinct endpoints that accept the set of all {datasets,zones} on a sled.
The call to PUT /omicron-zones will only succeed when creating zones if their datasets already exist (have been created by PUT /datasets).
During blueprint execution, datasets are deployed before zones.
With this current API structure, there is no ordering that the blueprint executor can use to make this correct: if it's "datasets before zones", then "zone add" works, but "zone removal" breaks because the dataset would be removed too early. If it's "zones before datasets", then "zone add" breaks because the dataset wouldn't exist when the zone is requested.
Ways out?
As proposed on the original comment, PUT /datasets and PUT /omicron-zones could be merged. This would let Nexus supply datasets and zones simultaneously. The Sled Agent would then be responsible for handling the addition/removal requests in a sensible order (maybe remove old zones -> remove unused datasets -> add new datasets -> add new zones?)
Alternatively, the sled agent could make the PUT /datasets endpoint a little smarter for the case where a dataset is still in-use.
Option 1: The sled agent could avoid deleting the dataset until it is no longer used - basically treating the usage by a zone as a strong reference
Option 2: The sled agent could cause the deletion of a dataset to automatically trigger zone removal. Namely: if Zone Z uses dataset D, and dataset D is removed, then zone Z would automatically also be removed.
The text was updated successfully, but these errors were encountered:
As proposed on the original comment, PUT /datasets and PUT /omicron-zones could be merged. This would let Nexus supply datasets and zones simultaneously. The Sled Agent would then be responsible for handling the addition/removal requests in a sensible order (maybe remove old zones -> remove unused datasets -> add new datasets -> add new zones?)
I'm in favor of this one (and would go a step further and say we should merge in PUT /omicron-physical-disks while we're at it). I think @andrewjstone suggested it originally, although I can't find the source of that at the moment. It also seems in line with what I've advocated we do for blueprints (#7078): combining the multiple maps all keyed by SledUuid into a single map of SledUuid -> SledConfig.
I think the ordering here has to be blueprint first, then sled-agent API. If the blueprint has a SledConfig, it can nearly-trivially break that out into separate zones / datasets / disks PUTs. But the other way around doesn't work: if the blueprint has the three separate zones / datasets / disk configs, it can't combine them into a SledConfig because it wouldn't know what generation to use for that combined struct.
Originally posted by @davepacheco in #6177
Filing a new issue here, because this is distinct enough from #6177 to merit a separate discussion.
https://github.com/oxidecomputer/omicron/blob/main/nexus/reconfigurator/execution/src/lib.rs#L121-L133
Dataset {addition,removal} happens before zone {addition,removal}.
For zone addition: This works fine, it effectively requests the dependencies of a zone (all datasets) before requesting the zone.
For zone removal: This is problematic, as it would request the removal of a dataset before removing the removal of the zone. This works today only because #6177 is not implemented, and the "removed" dataset is ignored, instead of being deleted.
The following statements about our system are currently true:
PUT /datasets
) and Zone deployment (PUT /omicron-zones
) are distinct endpoints that accept the set of all {datasets,zones} on a sled.PUT /omicron-zones
will only succeed when creating zones if their datasets already exist (have been created byPUT /datasets
).With this current API structure, there is no ordering that the blueprint executor can use to make this correct: if it's "datasets before zones", then "zone add" works, but "zone removal" breaks because the dataset would be removed too early. If it's "zones before datasets", then "zone add" breaks because the dataset wouldn't exist when the zone is requested.
Ways out?
PUT /datasets
andPUT /omicron-zones
could be merged. This would let Nexus supply datasets and zones simultaneously. The Sled Agent would then be responsible for handling the addition/removal requests in a sensible order (maybe remove old zones -> remove unused datasets -> add new datasets -> add new zones?)PUT /datasets
endpoint a little smarter for the case where a dataset is still in-use.The text was updated successfully, but these errors were encountered: