-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable per job class max_job_runtime
#240
Conversation
CI should be fixed by #241 |
CI appears to only fail on Rails 5.2. I'm not sure why, and given #241 drops support, I probably won't bother investigating. |
7098006
to
d5a9a3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea! Should we namespace the setting on the job, to avoid conflicts? Something like self.job_iteration_max_runtime
?
This had crossed my mind, and I'm open to it. On the one hand, it makes sense to avoid conflicts. On the other hand, we don't namespace anything else (e.g. |
Hedwig already has its own |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this as a feature! ❤️ A couple small comments, and I'll leave it to the Job Platform experts to make the call on naming, but this looks good from my perspective!
7ad0c57
to
5e59a1c
Compare
class MyJob < ApplicationJob | ||
include JobIteration::Iteration | ||
|
||
self.job_iteration_max_job_runtime = 3.minutes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the class attribute 👍
5e59a1c
to
e503b59
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
e503b59
to
642cd89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation LGTM.
If you can share a bit more context on why this feature is being added now, it could help answer some of the questions I left on the PR.
guides/best-practices.md
Outdated
# ... | ||
``` | ||
|
||
This setting will be inherited by any child classes, although it can be further overridden. Note that no class can **increase** the `max_job_runtime` it has inherited; it can only be **decreased**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious on the reasoning behind this design. One counter-example I can think of:
"All jobs should run for a maximum of 30s per run; except this one job which could run for up to a minute." -> this is not supported by design, why?
The implementation would be simpler if the "only decrease" constraint isn't enforced at the lib level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I considered this scenario, and you're right that enforcing this constraint adds complexity (needing to prepend
the module that verifies it).
The idea was that there should be friction when it comes to increasing the max_job_runtime
, because it ideally wouldn't be done. An owner of a large application may tune the max_job_runtime
, and another developer may think that it is safe for their job to run without limits, and end up impacting the stability of the job infrastructure if they're able to bypass the limit easily.
By requiring the developer to go through a change like this, it would be clearer that they're doing something exceptional:
-JobIteration.max_job_runtime = 5.minutes
class ApplicationIterativeJob < ActiveJob::Base
include JobIteration::Iteration
+ self.max_job_runtime = 5.minutes
end
+
+class ExceptionallyLongRunningIterativeJob < ActiveJob::Base
+ include JobIteration::Iteration
+ self.max_job_runtime = 10.minutes
+end
-class MyLongRunningJob < ApplicationIterativeJob
+class MyLongRunningJob < ExceptionallyLongRunningIterativeJob
end
To me, it feels like JobIteration.max_job_runtime
should be the maximum globally, and feels weird that a single job could violate that invariant. Allowing decreases only means that the invariant continues to hold.
That said, if we have legitimate use cases, then we could certainly change this behaviour:
- We could split the API into one method for decreasing the maximum, and another for increasing it, whose name makes it clear that it's exceptional. For example:
JobIteration.max_job_runtime = 5.minutes class LowPriorityIterativeJob < ApplicationIterativeJob decrease_job_iteration_max_job_runtime(to: 3.minutes) # forbids increasing end class LongRunningIterativeJob < ApplicationIterativeJob increase_job_iteration_max_job_runtime!(to: 10.minutes) # forbids decreasing end
- We could remove the constraint all together, and go with the out-of-the-box behaviour of
class_attribute
(overridable inheritance)
@@ -262,7 +289,7 @@ def output_interrupt_summary | |||
end | |||
|
|||
def job_should_exit? | |||
if ::JobIteration.max_job_runtime && start_time && (Time.now.utc - start_time) > ::JobIteration.max_job_runtime | |||
if job_iteration_max_job_runtime && start_time && (Time.now.utc - start_time) > job_iteration_max_job_runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user of the library, it would be helpful if there was an easy way to get notified (as in ActiveSupport::Notifications.instrument
) when a job is interrupted due to elapsing the max run time setting. For example, a user could track how many times this is happening per some application-level metric to tune the time out setting accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do instrument that a job has been interrupted, although we do not note the reason why.
I think that could be something worth considering, but I would consider it outside the scope of this PR. I've opened #247 to capture the discussion.
I've wanted to push for adopting For example, some of our long jobs run on schedules that mean that they rarely end up interrupted by deploys (middle of the night). We had one such job that was exceptionally interrupted once, only to turn out to improperly deserialize its cursor and blow up every time it tried to resume (the motivation for #73 & #81). |
642cd89
to
a84977f
Compare
03c5306
to
8236f63
Compare
8236f63
to
a1ec6f7
Compare
This allows incremental adoption of the setting, without applying the setting globally. Alternatively, it allows applications to set a conservative global setting, and a more aggressive setting per jobs. In order to prevent rogue jobs from causing trouble, the per-job override can only be set to a value less than the inherited value.
a1ec6f7
to
64655bd
Compare
@@ -262,7 +289,7 @@ def output_interrupt_summary | |||
end | |||
|
|||
def job_should_exit? | |||
if ::JobIteration.max_job_runtime && start_time && (Time.now.utc - start_time) > ::JobIteration.max_job_runtime | |||
if job_iteration_max_job_runtime && start_time && (Time.now.utc - start_time) > job_iteration_max_job_runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to comment on such an old PR, please let me know if you'd rather I convert this to an issue, or move the conversation somewhere else.
This change actually creates a subtle behaviour change. Previously, we read the global configuration at job runtime. Now, we read the global configuration when JobIteration::Iteration
is loaded, subsequent changes are ignored.
Today, we noticed that maintenance tasks were never pausing because that gem were setting a default job runtime in an after_initialze
block that was run after this module was included.
I'm not actually sure which option is better. Just wanted to point out the difference in behaviours. In the meantime, I've proposed Shopify/maintenance_tasks#918 to fix maintenance tasks under the assumption that the current behaviour of job-iteration is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, nice catch! I'd consider that a bug - changes to the global value should be picked up by the jobs as they run. I'm not entirely sure what the best fix is, especially in a way that enforces the "individual jobs must not increase this value".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to have a look on Monday, but I think the fix would be to override the method defined by class_attribute
to delegate to the top level default, as opposed to setting it to the default value at the time the method was defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#436 should fix this.
This allows incremental adoption of the setting, without applying the setting globally.
Alternatively, it allows applications to set a conservative global setting, and a more aggressive setting per jobs.
In order to prevent rogue jobs from causing trouble, the per-job override can only be set to a value less than the inherited value.