-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose instance auto-restart status in the console #2469
Comments
Tagged this for v13 as this is not gaining the visibility it deserves. |
We can probably slip this into the properties table state and perhaps the instance list table too if we figure out an elegant popover. I need to wrap my head around the state flow of that a little – thank you for your documentation on this! Regarding the policy itself – we should also be adding the ability to manage a. Instance create form (we can probably tuck into the advanced accordion) |
Few initial questions @hawkw
|
Accidentally closed! |
The control plane internally tracks why an instance is being started in the I'd definitely like to get that into the console (and CLI etc) eventually, but I'd file it under "future work" for now.
The cooldown period starts when an instance is automatically restarted. It does not reset if the instance is automatically restarted and then fails again: the intention behind the cooldown is primarily to reduce the impact on the rest of the system when an instance crashes every time it's restarted. In order to avoid an instance restarting and then immediately crashing again in a hot loop, potentially impacting other instances, we restart the instance immediately the first time it crashes. Then, we start tracking the cooldown period once we restart the instance. If the instance fails again before the cooldown has elapsed, it will be restarted once the cooldown has elapsed. If it fails after the cooldown has elapsed, it will be restarted immediately. Each time it fails, the cooldown is reset. This way, we will not immediately restart the instance multiple times in short succession, but if it fails today and then fails again a few weeks later, it will be automatically restarted immediately both times.
It shouldn't take too long. I believe the task responsible for automatically restarting failed instances will run about once every minute, and there's some internal bookkeeping that must be done before an instance can be restarted in order to clean up any resources left behind by its past incarnation. So, there's some delay. I would generally expect a
I think this falls under the purview of the ongoing fault management work --- we'll definitely want to generate more detailed reports of why an instance has failed within FMA. At present, the control plane doesn't really know anything about why an instance has failed. |
Overall, looks good to me. I think we might want to tweak some details of the wording to be closer to the API language to make it easier to match up to the API docs, for example using the word "cooldown" somewhere when we're in the waiting state. For state 3, when the instance is in the starting state after an auto-restart, I'm not sure we have a way to distinguish that from a regular start. So we may want to cut that one, i.e., starting is just starting. |
PR oxidecomputer/omicron#6503 implemented automatic restarts of instances in the
Failed
state. This change introduced some additional instance state that should be exposed to users. In particular:Failed
instance is automatically restarted, a cooldown timer is started for that instance. If that instance fails again while the cooldown period is still active, it will not be automatically restarted again until the cooldown period has elapsed.Failed
.New fields were added to the external-API instance message to report state related to automatic restarts. Instances now have an
auto_restart_enabled: boolean
field that indicates if their auto-restart policy permits restarting the instance, and anauto_restart_cooldown_expiration: string
representing the date and time at which the cooldown period will have completed (allowing the instance to be restarted again). See: https://github.com/oxidecomputer/omicron/blob/45813be40b62167eff75333c410515e8bee24211/openapi/nexus.json#L15094-L15104This data should probably be exposed to users: if an instance is in the
Failed
state, the user will want to know why it has not yet been automatically restarted, whether it will ever be automatically restarted, and if it will, when that will happen. We probably only need to display this information for instances which areFailed
. If aFailed
instance hasauto_restart_enabled
set tofalse
, we should tell the user that auto-restart is disabled for that instance. Otherwise, if there is anauto_restart_cooldown_expiration
timestamp, we should tell the user that the instance will be restarted only after that time. Ifauto_restart_enabled
is not false and there is noauto_restart_cooldown_expiration
timestamp, then the instance will be automatically restarted --- we might want to indicate that as well.The text was updated successfully, but these errors were encountered: