-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reconfigurator-cli] Fix blueprint-edit failing if there are multiple expunged zones with the same external IP #7307
base: main
Are you sure you want to change the base?
Conversation
The last commit has the fix for #7305, and it's small enough that I appended it to this PR instead of opening a followup. The crux of it is this change: - for (_, zone) in
- parent_blueprint.all_omicron_zones(BlueprintZoneFilter::All)
+ for (_, zone) in parent_blueprint
+ .all_omicron_zones(BlueprintZoneFilter::ShouldBeRunning) when reconfigurator-cli is finding external networking resources to construct the omicron/nexus/db-queries/src/db/datastore/external_ip.rs Lines 363 to 366 in 0afbd6e
so I believe this is a reconfigurator-cli-only bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I just had a minor question about some changed test output.
crucible 4f2eb088-7d28-4c4e-a27c-746400ec65ba in service fd00:1122:3344:102::2f | ||
crucible 53dd7fa4-899e-49ed-9fc2-48222db3e20d in service fd00:1122:3344:102::2a | ||
crucible 7db307d4-a6ed-4c47-bddf-6759161bf64a in service fd00:1122:3344:102::2c | ||
crucible 95ad9a1d-4063-4874-974c-2fc92830be27 in service fd00:1122:3344:102::29 | ||
crucible bc095417-e2f0-4e95-b390-9cc3fc6e3c6d in service fd00:1122:3344:102::28 | ||
crucible d90401f1-fbc2-42cb-bf17-309ee0f922fe in service fd00:1122:3344:102::2b | ||
crucible e8f994c0-0a1b-40e6-8db1-40a8ca89e503 in service fd00:1122:3344:102::2d | ||
crucible e9bf481e-323e-466e-842f-8107078c7137 in service fd00:1122:3344:102::2e | ||
crucible f97aa057-6485-45d0-9cb4-4af5b0831d48 in service fd00:1122:3344:102::25 | ||
crucible_pantry eaec16c0-0d44-4847-b2d6-31a5151bae52 in service fd00:1122:3344:102::24 | ||
crucible_pantry f97aa057-6485-45d0-9cb4-4af5b0831d48 in service fd00:1122:3344:102::25 | ||
external_dns eaec16c0-0d44-4847-b2d6-31a5151bae52 in service fd00:1122:3344:102::24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these changes were just due to adding external DNS shifting the generated UUIDs around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (and this is strongly related to your other comment, will follow up there)
crucible 09937ebb-bb6a-495b-bc97-b58076b70a78 in service fd00:1122:3344:103::2b | ||
crucible a999e5fa-3edc-4dac-919a-d7b554cdae58 in service fd00:1122:3344:103::25 | ||
crucible b416f299-c23c-46c8-9820-be2b66ffea0a in service fd00:1122:3344:103::26 | ||
crucible b43ce109-90d6-46f9-9df0-8c68bfe6d4a0 in service fd00:1122:3344:103::2e | ||
crucible b5d5491d-b3aa-4727-8b55-f66e0581ea4f in service fd00:1122:3344:103::2a | ||
crucible cbe91cdc-cbb6-4760-aece-6ce08b67e85a in service fd00:1122:3344:103::2d | ||
crucible cd3bb540-e605-465f-8c62-177ac482d850 in service fd00:1122:3344:103::28 | ||
crucible f3628f0a-2301-4fc8-bcbf-961199771731 in service fd00:1122:3344:103::2c | ||
crucible f52aa245-7e1b-46c0-8a31-e09725f02caf in service fd00:1122:3344:103::29 | ||
crucible fae49024-6cec-444d-a6c4-83658ab015a4 in service fd00:1122:3344:103::27 | ||
crucible_pantry e8971ab3-fb7d-4ad8-aae3-7f2fe87c51f3 in service fd00:1122:3344:103::24 | ||
external_dns cc1dc86d-bd6f-4929-aa4a-9619012e9393 in service fd00:1122:3344:103::23 | ||
internal_dns 728db429-8621-4e1e-9915-282aadfa27d1 in service fd00:1122:3344:2::1 | ||
internal_ntp c8aa84a5-a802-46c9-adcd-d61e9c8393c9 in service fd00:1122:3344:103::21 | ||
nexus e7dd3e98-7fe7-4827-be7f-395ff9a5f542 in service fd00:1122:3344:103::22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really clear why this block and the next one changed. It looks like the IPs shifted around? Is that just because of the new external DNS zone too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more that the UUIDs continued shifting around. The zone UUIDs are all coming out of a deterministic RNG, but that's for the entire blueprint. So the initial sled got a new zone (external DNS), which means it now grabbed one more zone UUID than it did previously. Going up, the "new" UUID it got went to one of the crucible zones (4f2eb088-7d28-4c4e-a27c-746400ec65ba
) - that UUID used to belong to this sled's internal NTP zone (the first one ExampleSystemBuilder
allocates for a sled). Following that, the UUIDs for every zone on this sled (and all subsequent sleds) shifted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, part of the goal of RNG trees was to minimize this shifting. Can it be restructured to achieve that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the output, I think each sled should definitely get a separate RNG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great call - after pulling in #7321, all three blocks have just a few lines shifted around: https://github.com/oxidecomputer/omicron/pull/7307/files#diff-18cea7c50c6bb09d57e37c8bf51e112a953dc2827d2fe4f300f9b6aac2f5b0fb
if builder | ||
.current_sled_zones(sled_id, BlueprintZoneFilter::All) | ||
.any(|z| z.id == zone_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this match all zones or some subset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think All
is right - in this case the user passed in an explicit zone ID, and we're just trying to find it. This would allow someone to try to use reconfigurator-cli to expunge an already-expunged zones, but that seems fine, I think?
crucible 09937ebb-bb6a-495b-bc97-b58076b70a78 in service fd00:1122:3344:103::2b | ||
crucible a999e5fa-3edc-4dac-919a-d7b554cdae58 in service fd00:1122:3344:103::25 | ||
crucible b416f299-c23c-46c8-9820-be2b66ffea0a in service fd00:1122:3344:103::26 | ||
crucible b43ce109-90d6-46f9-9df0-8c68bfe6d4a0 in service fd00:1122:3344:103::2e | ||
crucible b5d5491d-b3aa-4727-8b55-f66e0581ea4f in service fd00:1122:3344:103::2a | ||
crucible cbe91cdc-cbb6-4760-aece-6ce08b67e85a in service fd00:1122:3344:103::2d | ||
crucible cd3bb540-e605-465f-8c62-177ac482d850 in service fd00:1122:3344:103::28 | ||
crucible f3628f0a-2301-4fc8-bcbf-961199771731 in service fd00:1122:3344:103::2c | ||
crucible f52aa245-7e1b-46c0-8a31-e09725f02caf in service fd00:1122:3344:103::29 | ||
crucible fae49024-6cec-444d-a6c4-83658ab015a4 in service fd00:1122:3344:103::27 | ||
crucible_pantry e8971ab3-fb7d-4ad8-aae3-7f2fe87c51f3 in service fd00:1122:3344:103::24 | ||
external_dns cc1dc86d-bd6f-4929-aa4a-9619012e9393 in service fd00:1122:3344:103::23 | ||
internal_dns 728db429-8621-4e1e-9915-282aadfa27d1 in service fd00:1122:3344:2::1 | ||
internal_ntp c8aa84a5-a802-46c9-adcd-d61e9c8393c9 in service fd00:1122:3344:103::21 | ||
nexus e7dd3e98-7fe7-4827-be7f-395ff9a5f542 in service fd00:1122:3344:103::22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the output, I think each sled should definitely get a separate RNG.
Suggested by @sunshowers (#7307 (comment)). This should prevent changes to one sled (e.g., adding a new zone) from perturbing IDs on all subsequent sleds in otherwise-stable tests.
* Dump them to stdout instead of stderr * Omit timestamps (which should fix the test failure on #7307, where our expectorate tests don't know how to redact slog timestamps)
My main goal with this PR was to fix the failing test on #7307 by emitting slog logs without timestamps (to avoid having to redact them in expectorate tests, and because they're not actually useful in a CLI). As a nice side effect, the output in a terminal is much nicer. On main, I see this when planning a blueprint: ![reconfig-main](https://github.com/user-attachments/assets/8a9ed6f0-418c-40ed-8488-45e44afc0067) I assume this is some bad interaction between slog-async / logs going to stderr / reedline. On this branch, I now get: ![reconfig-with-changes](https://github.com/user-attachments/assets/2495880a-4715-4d39-8a08-59f5edb7fb58)
Suggested by @sunshowers (#7307 (comment)). This should prevent changes to one sled (e.g., adding a new zone) from perturbing IDs on all subsequent sleds in otherwise-stable tests. (I took a stab at combining this with #7307 and confirmed the diff output is much more reasonable; will update #7307 after this lands.) The actual code changes here are small and mostly-mechanical; the changes in `rng.rs` are the driver (moving the sled-specific RNGs into a per-sled RNG container and dealing with handing those out keyed by sled ID), and everything else fell out from that. Unfortunately this changes _all_ zone/dataset/NIC/IP UUIDs in expectorate tests, so the diff volume is... quite large.
I tried to reproduce #7305 using just reconfigurator-cli's simulated example system. This required one change (adding external DNS zones, db90b4a), and while I was here I smoothed over a couple of minor ergonomic annoyances (275e24e is uncontroversial, I hope, and a147954 which is a little less clearly good - happy to back this one out if folks don't like it).
With these changes, I can get to a reproduction:
Fixes #7305