-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some datasets in RSS blueprint erroneously include addresses #7299
Comments
The way I actually found this is that I:
Digging into the unexpected diffs revealed two things:
The combination of these caused the spurious-looking diffs in step 3 (and not step 5), since the addresses were changing. In more detail: there are three blueprints:
I expected to see no changes among any of the blueprints. Instead I saw this:
There are no changes at all between blueprints 2e24acb1-21c8-469a-ace5-bec12f89a539 and f55291a6-4265-4178-a269-7df22eb0a2b7. So I dug into the diff between 831679c9-26f8-4e3b-9873-e2522cfdc087 and 2e24acb1-21c8-469a-ace5-bec12f89a539. I misread the diff output and thought the asterisks were telling me that those datasets were new. So I picked one, '801a8141-9e83-4cc0-9428-fb1db210657d', and looked for it in both the old and new ledger. It was in both! In fact, the ledgered dataset configuration on sled
So I looked at the database rows for that dataset:
As expected, we've got three rows. They're mostly the same -- except for the IP/port. It's non-NULL for the initial blueprint and NULL for the other ones. At this point, I probably should have figured out what was going on with the diff output (it was saying these rows had changed, but wasn't printing the specific fields that had changed), but I didn't. I dug into the diff impl and found that
and that we're determining whether it's modified using an equality check: omicron/nexus/types/src/deployment/blueprint_diff.rs Lines 667 to 672 in c140817
which will include all the fields, including those not printed. At this point I think all the behavior is explained. |
I also looked briefly into the history to understand why this check looks the way it did. The check is unchanged since #6229. I thought maybe this code was correct in #6229? Maybe in that PR the only datasets in the blueprint were persistent zone datasets? It looks like the other datasets were added to the blueprint in #7000? Though it seems like you'd still hit this in #6229 if you had, say, a CockroachDB zone on the same pool as a Crucible zone -- the CockroachDB zone's persistent dataset would have the wrong value filled in for the IP/port fields. But based on this comment, this behavior wasn't observed in #7000. I'm not sure why that is, unless the behavior that causes the planner to fix this in the blueprint was only added more recently? |
Raw ledgers from sled g0: a4x2datasets - generations 1 and 2.zip |
) This should prevent #7299 from occurring on new invocations of RSS; I'll add a followup PR (with more testing) that adds the `CHECK` constraint mentioned in that issue and a migration that does cleanup of any bad dataset address rows. While I was in here, fixed #7115. It was smaller than I thought; should've done that a while ago.
After a fresh deployment of a4x2, I found that many datasets in the system's initial blueprint have non-NULL (and misleading) address and port fields in the database. (This is very far removed from the initial symptoms so I'll jump to the root cause here and put the consequences / debugging process into a separate comment.) I think the problem is here:
omicron/sled-agent/src/rack_setup/service.rs
Lines 1542 to 1569 in c140817
This code is taking the
DatasetsConfig
that was generated during RSS and converting it into aBlueprintDatasetsConfig
that will become the rack's initial blueprint. The blueprint struct has space for a socket address (IP addr and TCP port), which is only used for one kind of dataset: the persistent dataset of a Crucible zone. That's not inDatasetsConfig
. This code has to fill that in from the zone information. For each dataset inDatasetsConfig
, it does this by looking for any zone of type "Crucible" on the same pool. If it finds one, then it populates the newBlueprintDatasetConfig
for this dataset with the socket address (IP address and TCP port) of that Crucible zone. I think this is just wrong. As an example, my system has these datasets on this pool:That's one Crucible zone's persistent dataset, a debug dataset, and a couple of transient zone root filesystems. In the initial blueprint, all of these have the same IP address and port (the one from the Crucible zone):
I believe this is wrong because the IP/port fields are supposed to be NULL for datasets other than a Crucible zone's persistent dataset. It's also misleading because if you didn't know that, you might reasonably think that the value for the NTP zone's dataset there is the IP of the NTP zone (for example), but it's not.
The text was updated successfully, but these errors were encountered: