-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document ReconcileResult type (fixes #1419) #1463
base: main
Are you sure you want to change the base?
Conversation
pkg/reconciliation/generic.go
Outdated
@@ -53,5 +54,5 @@ func ReconcileObject[U any, T Reconcileable[U]](ctx context.Context, kClient cli | |||
} | |||
return result.RequeueSoon(requeueDelay) | |||
} | |||
return result.Done() | |||
return result.Continue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this might be controversial, but given my current understanding of the API that's the right thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: I don't think it has to be controversial, but I'd actually like to see all of this in a design doc so that I can understand what you're trying to achieve with these changes. To me, they have a very wide blast radius, and there are some questions and themes in here which suggest that we probably need to align on certain topics (e.g. how caches work, how optimistic locking works, how the API server is asynchronous) before moving forward.
One of the main changes I think we need here is more descriptive naming for these implementations; and therefore alignment on what each of them is intended to signify. Once we have that it should be easy to figure out what the answer should be to this return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in the previous comment, if we all agree that writes are synchronous, there's no need to force a requeue at this level. Creating a configmap for example won't require a requeue.
@@ -52,7 +52,7 @@ func RefreshSecrets(dc *cassdcapi.CassandraDatacenter, ctx context.Context, clie | |||
case recRes.IsRequeue(): | |||
requeues++ | |||
continue | |||
case recRes.IsDone(): | |||
case !recRes.Completed(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to keep the exact same behavior, but the overall function has issues: the block if requeues > 0 { ... }
at line 58 is never reached (either with this version or the original ones), so retries are not behaving as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, we should drop the continue statements and move the check of requeues outside the loop for this to make sense.
@@ -18,7 +18,8 @@ type Reconcileable[T any] interface { | |||
*T | |||
} | |||
|
|||
// Try with U, a type of any whose POINTER still fulfils Reoncilable... | |||
// ReconcileObject ensures that desiredObject exists in the given state, either by creating it, or updating it if it | |||
// already exists. | |||
func ReconcileObject[U any, T Reconcileable[U]](ctx context.Context, kClient client.Client, requeueDelay time.Duration, desiredObject U) result.ReconcileResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[thought] This is off-topic, but I don't understand why this function schedules a requeue after a successful creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because that's the pattern followed throughout the codebase. Object creation is asynchronous, acceptance by the API server does not necessarily mean that something won't go wrong downstream of that. As a result, we check again for the success of the creation operation before continuing, which indeed does mean a requeue.
There is also stuff to do with optimistic locking and the way the caches work on the controller which make this necessary - try removing it if you're curious as to what breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object creation is asynchronous, acceptance by the API server does not necessarily mean that something won't go wrong downstream of that.
Is that documented anywhere? The client
package docs do not go into many details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just how asynchronous HTTP APIs work, they return a status code but sometimes object creation might fail out elsewhere.
But I think the more critical thing here is that (from memory) due to some of the caching behaviour of these clients, the object will not appear if you try to go do a Get in some circumstances. I encourage you to try to remove the requeues and see if things break - I believe they do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, this example in the controller-runtime codebase doesn't requeue. They reference pod.CreationTimestamp
, a field that is set by the server during the creation operation, immediately after a successful Create()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "extra" requeues I meant the one after a successful Create, and the one after a successful Update. I pushed the change so you can check the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that (while I'm wrong on the consistency level of writes), I'm right on reads, and some of our logic will probably fail due to failure to refresh the cache.
This depends on what is read after the object was written. If it's something that the Kubernetes API returns, then it's consistent right away. For example, Olivier's example of using creationTimestamp, probably resourceVersion and uid etc are directly accessible in the object you just posted.
However, for many use cases, what happens is that there's other controllers doing some extra processing and then modifying the object on the Kubernetes side (as a lot of Kubernetes processing itself is async with controllers). Can't access or Update .Status for example for some objects, like Pods. Or in our runtimes, if k8ssandra-operator creates CassandraDatacenter, it probably can't Update the Status field of CassandraDatacenter even instantly after, because cass-operator has modified it and the resourceVersions no longer match.
The other thing that should be understood is that Read after Write is cached, but of course Write itself updates the object that was sent. So doing:
obj = Create(obj)
a := obj.ObjectMeta.uid
// and if nothing else is touching our object, we can patch it afterwards also.
Is valid. However, doing:
obj = Create(obj)
obj = Get(obj)
a := obj.ObjectMeta.uid
Is not valid. The Get() call is coming from cache and would probably not return an object at all. Using Reader interface from manager will allow doing this as it isn't cached, but we don't really use that interface a lot (I think only ClientConfigs does that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we're saying here is that a requeue may or may not be needed, depending on the case.
If the requeue itself is harmless, just adding a little delay, but is protecting the code from potential bugs we might as well keep it, don't we?
@olim7t, for the sake of moving this PR forward and keeping its scope to (mostly) the original one, I'd suggest to rollback this change and have a separate discussion for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we're saying here is that a requeue may or may not be needed, depending on the case.
This thread has turned into a generic conversation about requeues, but we are in fact looking at a very specific case. I've only made two changes related to requeues in this PR:
- ReconcileObject tries to create the object and succeeds
- ReconcileObject tries to update the object and succeeds
(it's all in Don't requeue successful writes in ReconcileObject if anyone wants to look at the diff)
If we accept the assertion that writes are synchronous then there's absolutely no need to requeue. ReconcileObject wanted to apply changes, those changes are applied, we're done.
Other controllers and cached reads are valid concerns that we encounter in other places, but not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writes being synchronous, there doesn't seem to be a valid reason to force a requeue for each object creation.
That logic can be offloaded to the specific code (as opposed to this generic reconciler) that will deal with what happens after the creation/update of the object.
Let's move forward and not requeue as part of the generic reconciler. @Miles-Garnsey , sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only offered a very preliminary review here. There are a few overarching considerations I'd like to bring to your attention:
- Issue: whatever you decide
Completed()
,Done()
, and the rest mean needs to align with the usage in ReconcileObject(). Based on that, I have concerns around the blast radius of this change. Have you done an analysis of howReconcileObject
is called throughout the codebase to ensure that changing it's return type in the case of success doesn't break any of the calls? As the core of the reconciliation process we need to treat this delicately. - Issue: across the board, the issue here is that we have never been explicit about WHAT particular reconciliation is
Done()
Completed()
or needs requeue or is in an error state. We should be more explicit about this in the naming of these implementations to avoid problems in future.
// The idiomatic way to handle a ReconcileResult in an intermediary sub-function is: | ||
// | ||
// if recResult := callStep1(); recResult.Completed() { | ||
// // Global success, requeue or error: stop what we're doing and propagate up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: based on the logic you've described here it sounded like a requeue (which is not a success) will also be Completed()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a requeue is Completed()
:
func (c callBackSoon) Completed() bool {
return true
}
But I don't see any issue here: if a step wants a requeue, we stop what we're doing and propagate up. The top-level Reconcile() will eventually return recResult.Output()
(which is ctrl.Result{Requeue: true, RequeueAfter: c.after}, nil
) to the controller runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me put my concern in another way. Right now, we have a struct implementation of this interface instantiated with Done()
which is also Completed()
. Requeue()
and Error()
are also Completed()
. And yet, the English language meaning of the word Completed()
is exactly the same as the meaning of the word Done()
.
To me, having a getter function Completed()
for a field across three different struct implementations ( Done()
, Requeue()
and Error()
- some of which do not mean "Done" in the linguistic sense) is completely untenable.
If you want to continue having something which means "this reconciliation run has progressed as far as possible, and needs to jump back to the top level of the reconciliation to be handled appropriately", we need a different word to Completed()
. At that point, I think we can have a more productive conversation about what behaviours we are trying to model. But until we can express that intention in a function name, I would rather retain the explicit switch statements we have for Done()
, Requeue()
and `Error(), since this makes it explicit what behaviours we are handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of why I think this needs a design doc too - because at the moment it isn't clear to me what behaviour or states we are trying to model, and we need to discuss and agree a target state rather than just say that we're "documenting" existing behaviour but then making quite impactful changes to the core of the reconciliation logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to discuss and agree a target state rather than just say that we're "documenting" existing behaviour
Yes, the goal of this PR is to have that discussion. I proposed an initial version, let's iterate on it with reviews until we reach a common understanding.
this needs a design doc too
Are the go-docs not self-sufficient? Not that I mind having a separate document, but I wouldn't know what else to put in there.
And yet, the English language meaning of the word
Completed()
is exactly the same as the meaning of the wordDone()
.
I've deliberately avoided going down the naming rabbithole so far... May I suggest that we keep that for a second step? I feel like it will get too messy if we rename as we go.
(but yes I agree with most of your comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't about documentation, it is about the development process. We do deep discussions in design docs, not on PRs. If we choose to do it differently here, I'm concerned that we will end up in the same situation that we ended up in when we started making changes to the name overrides in labels and downstream resources, where the blast radius of the change was not adequately considered and customers suffered issues as a result.
I strongly advocate for the creation of a design doc for this piece of work to avoid customer impacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yet, the English language meaning of the word Completed() is exactly the same as the meaning of the word Done().
What is the English language definition difference of while and while do then? Yet, semantics for those two are different and in every programming language. One has while executed after first execution, while one has while executed before execution.
This isn't English.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Miles, and let's not forget many of the people participating in these projects are not necessarily english native speakers. This API brings a level of confusion that is high enough to require having that conversation. There's overlap between Completed()
(which should really be IsCompleted()
), Done()
and Continue()
among other things which require to take a break and think of what this API is solving and how it should do it.
The fact that it's been used successfully (which is debatable) for the past few years isn't a valid reason to push back on the discussion.
I also think it's fair to have a separate doc, in the form of a markdown file, contributed as a PR so that the community can see it, to describe our view on the reconcile process and how ReconcileResult
should be used.
"Design doc" might be an overstatement here and "doc" would probably be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which should really be IsCompleted()
Ugly Java POJO, but this isn't a Java project. We don't need to indicate it's a boolean when a function has return value of boolean. Same as with "Get" when fetching an object.
I also think it's fair to have a separate doc, in the form of a markdown file, contributed as a PR so that the community can see it
Why can't the users view what's defined in the godoc? It would be publicly available and generated already, like currently:
https://pkg.go.dev/github.com/k8ssandra/[email protected]/pkg/reconciliation
What Olivier wrote here would appear there. We can add examples, descriptions etc. Adding markdown doc somewhere assumes users know to search for it (and hope it's not outdated) instead of the "standard" place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an effort to move forward again, let's just use these in code comments as docs on how to use this interface.
The comments will be available in the IDE and in the godocs, without requiring to look at an external doc.
We can add more information here if we feel there's not enough for anyone to build a good understanding of how to use it.
We can extract this into a doc if that makes sense to anyone, but let's not make that mandatory.
func Continue() ReconcileResult { | ||
return continueReconcile{} | ||
} | ||
|
||
// Done indicates that the entire reconciliation loop was successful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: from a design perspective, this seems tricky, since only the last step should ever call Done()
based on this description. No earlier step would ever know that the entire reconcilation was Done()
from what I can see?
And what is the effect of calling Done()
? What does the caller do that is special as a result of seeing that result type?
I think if we're getting into this topic, we should make Done()
more descriptive - perhaps it needs a rename to AllReconciliationDone()
or something similar to that, which indicates that all reconciliation steps to reconcile a given object (from the req
passed to the controller) is done.
We need to be explicit about WHAT is done - that's the issue we're currently struggling with. Naming these implementations correctly might be a good starting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No earlier step would ever know that the entire reconcilation was Done() from what I can see?
This is exactly what it is, the reconciling process is done. It's not even really a necessary process didn't end in a pkg, but in the Reconcile() function. Just return reconcile.Result{}, nil.
We need to be explicit about WHAT is done - that's the issue we're currently struggling with. Naming these implementations correctly might be a good starting point.
Are we having issues? This API didn't even begin here and the original usage had no issues. I'm not convinced we need this ReconcileResult API at all, but then that's not what we're discussing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we having issues? This API didn't even begin here and the original usage had no issues. I'm not convinced we need this ReconcileResult API at all, but then that's not what we're discussing here.
Maybe this should be something we're discussing. We are definitely having issues because there is no consistency in the usage of this API in the codebase and there are regular disagreements on what Done, Continue, and Completed mean, as there have been for several years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are definitely having issues because there is no consistency in the usage of this API in the codebase and there are regular disagreements on what Done, Continue, and Completed mean, as there have been for several years.
No, there hasn't been (and even in this discussion, it doesn't seem like there's that much issue for rest of the team). This API was copied over from cass-operator and it had a clear definition there how things worked. This isn't new API that needs design documents (which are utterly pointless at this point of using this API), just couple of fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there hasn't been
At the very least there's been some ambiguity, as evidenced by the changes we are doing in this PR to align different parts of the code.
it had a clear definition there how things worked
This might be your longer experience of cass-operator speaking, but personally I wouldn't call this API clear, it took way more effort to understand than it should.
Anyway, whether it is clear or not, I don't see any downside to documenting the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's focus on whether or not we agree with the comments that @olim7t added on each function.
@burmanm @Miles-Garnsey, do these comments look good?
Let's not change names for now and just build a common understanding.
pkg/reconciliation/generic.go
Outdated
@@ -53,5 +54,5 @@ func ReconcileObject[U any, T Reconcileable[U]](ctx context.Context, kClient cli | |||
} | |||
return result.RequeueSoon(requeueDelay) | |||
} | |||
return result.Done() | |||
return result.Continue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: I don't think it has to be controversial, but I'd actually like to see all of this in a design doc so that I can understand what you're trying to achieve with these changes. To me, they have a very wide blast radius, and there are some questions and themes in here which suggest that we probably need to align on certain topics (e.g. how caches work, how optimistic locking works, how the API server is asynchronous) before moving forward.
One of the main changes I think we need here is more descriptive naming for these implementations; and therefore alignment on what each of them is intended to signify. Once we have that it should be easy to figure out what the answer should be to this return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: this is potentially a big issue - when changing from Done()
to Continue()
in the ReconcileObject I think you're going to see bugs depending on the caching behaviour of the client.
These errors might manifest in two ways:
- In a caching client, where the return from this function is checked using
Completed()
, this will probably cause downstream failures, since (from memory) the cache refreshes only when the reconcile is restarting, and not while reconciliation is taking place. If you reference that object in downstream logic it may not exist (or be up to date). This is becauseContinue()
is notCompleted()
whileDone()
isCompleted()
. - In all clients (caching and non-caching) this may cause failures when a switch statement is used to check the return result and expects to see a
Done()
. Mostly the switch statements should haveIsRequeue()
,IsError()
withCompleted()
andDone()
then falling through to the default case, but have we checked every call site to ensure this is the case?
You have a lot of unit/integration test failures here, which might be due to either of these issues - I'm not 100% sure.
The switch statement in case recRes.IsRequeue():
return ctrl.Result{RequeueAfter: r.DefaultDelay}, nil I changed it to return return ctrl.Result{Requeue: true, RequeueAfter: c.after}, nil Setting However in if stargateConfigResult.Requeue {
return ctrl.Result{RequeueAfter: r.DefaultDelay}, nil
} Which is wrong, because it doesn't consider that if stargateConfigResult.Requeue || stargateConfigResult.RequeueAfter > 0 { If I fix that condition on -- EDIT: this is just a timeout issue, the test is not waiting long enough for the requeue |
@@ -842,7 +842,7 @@ func createSingleDatacenterCluster(t *testing.T, ctx context.Context, namespace | |||
require.NoError(err, "failed to patch K8ssandraCluster in namespace %s", namespace) | |||
checkStargateReady(t, f, ctx, stargateKey) | |||
checkStargateK8cStatusReady(t, f, ctx, kcKey, dcKey) | |||
checkContainerPresence(t, ctx, f, stargateDeploymentKey, k8ssandra, getPodTemplateSpec, stargate.VectorContainerName) | |||
checkContainerPresenceEventually(t, ctx, f, stargateDeploymentKey, k8ssandra, getPodTemplateSpec, stargate.VectorContainerName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as in unit tests: because the requeue is not ignored anymore, this is taking longer than before and subject to timing issues.
Quality Gate passedIssues Measures |
// it already exists. It returns the current state of the object on the server after the reconciliation. | ||
func ReconcileAndGetObject[U any, T Reconcileable[U]]( | ||
ctx context.Context, kClient client.Client, requeueDelay time.Duration, desiredObject U, | ||
) (result.ReconcileResult, *U) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose we add this variant, it's almost the same but returns the state of the object on the server after the reconciliation.
This is useful in two cases:
- the hash annotation matched so the object was not updated, but we're interested in a field that was updated independently by another controller. For example the IP of a LoadBalancer service in its status.
- we're interested in a field that is set by the write operation itself, for example
CreationTimestamp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this solves some problems but might be overoptimising - it introduces more complexity than I'd like and relies on folks understanding the caching behaviour. Based on the conversation here I think that is tricky for even experienced users and I'd rather avoid callers having to make more decisions about those subtleties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relies on folks understanding the caching behaviour
How so? We're either returning the result of a read that observed matching hash annotations, or the result of a create or update (not subject to caching).
If we don't have this variant, we'll need to resort to ReconcileObject
followed by a separate Get
. If anything, it sounds like it could produce more caching issues (Get not seeing the result of a write in ReconcileObject).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have this variant, we'll need to resort to ReconcileObject followed by a separate Get. If anything, it sounds like it could produce more caching issues (Get not seeing the result of a write in ReconcileObject).
In that behavior, we would need to reconcile and wait for the cache to have that object (and potentially reconcile again, fail etc). Otherwise, we need to use the Create/Update one, but never Create+Get or Update+Get combo.
So it makes sense to get the object back if user knows what to do with it. Actually, why would we even have any non-returning version? Create/Update always return the object also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @burmanm on this. The Get call you're making here won't actually result in a cache update so this variant isn't going to solve the problem.
When I say that folks would need to understand the caching behaviour, what I mean is that they need to know that the object returned by this new variant is not necessarily up to date. They also need to reason about what fields are up to date, which is non-trivial in my view.
I am in favour of just always re-queuing so that we elide these complexities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After chatting with @burmanm, it looks like all 3 of us (Olivier, Micke and myself) agree that a requeue shouldn't be done systematically. Let's then move forward with what @olim7t did here, and follow up with a refactoring that merges this new ReconcileAndGetObject()
function into just ReconcileObject()
, changing the signature to also return the object. That'll require a few changes in the code that calls this function to ignore the object when it's not needed.
If a requeue is needed, one should know and add it after the call to ReconcileObject()
.
If this causes bugs, they should surface in our tests, otherwise it means we're missing some tests.
Seems like we're mostly on the same page, so dismissing the review to get the PR mergeable.
What this PR does:
Which issue(s) this PR fixes:
Fixes #1419
Checklist