Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable per job class max_job_runtime #240

Merged
merged 2 commits into from
Jul 22, 2022
Merged

Conversation

sambostock
Copy link
Contributor

@sambostock sambostock commented Jun 28, 2022

This allows incremental adoption of the setting, without applying the setting globally.

JobIteration.max_job_runtime = nil

class ParentJob < ApplicationJob
  include JobIteration::Iteration
  self.max_job_runtime = 5.minutes
  # ...
end

Alternatively, it allows applications to set a conservative global setting, and a more aggressive setting per jobs.

class ChildJob < ParentJob
  self.max_job_runtime # 5 minutes, inherited
end
class CarefulChildJob < ParentJob
  self.max_job_runtime = 3.minutes # override
end

In order to prevent rogue jobs from causing trouble, the per-job override can only be set to a value less than the inherited value.

class RogueJob < ParentJob
  self.max_job_runtime = 10.minutes # 💥 ArgumentError
end
class UncleJob < ApplicationJob
  self.max_job_runtime = 10.minutes # ✅ No global default or inherited value, so okay
end

⚠️ Before Merging

  • Clean up commits

@sambostock
Copy link
Contributor Author

CI should be fixed by #241

@sambostock sambostock marked this pull request as ready for review June 29, 2022 15:02
@sambostock sambostock requested a review from Mangara June 29, 2022 15:02
@sambostock
Copy link
Contributor Author

CI appears to only fail on Rails 5.2. I'm not sure why, and given #241 drops support, I probably won't bother investigating.

@sambostock sambostock force-pushed the configurable-max-job-runtime branch 2 times, most recently from 7098006 to d5a9a3e Compare June 29, 2022 19:53
Copy link
Contributor

@Mangara Mangara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea! Should we namespace the setting on the job, to avoid conflicts? Something like self.job_iteration_max_runtime?

@sambostock
Copy link
Contributor Author

Should we namespace the setting on the job, to avoid conflicts?

This had crossed my mind, and I'm open to it. On the one hand, it makes sense to avoid conflicts. On the other hand, we don't namespace anything else (e.g. on_start, on_shutdown, & on_complete callback DSL methods). I guess it would be possible for Active Job, or some adapter to introduce a similar idea of a time limit... 🤔

@Mangara
Copy link
Contributor

Mangara commented Jun 29, 2022

Hedwig already has its own max_job_runtime with subtly different semantics 🙈 It's only a global config for now, so no collision yet, but just like this one, it may make sense to have a job-level override.

Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this as a feature! ❤️ A couple small comments, and I'll leave it to the Job Platform experts to make the call on naming, but this looks good from my perspective!

lib/job-iteration/iteration.rb Show resolved Hide resolved
test/unit/iteration_test.rb Show resolved Hide resolved
@sambostock sambostock force-pushed the configurable-max-job-runtime branch from 7ad0c57 to 5e59a1c Compare July 4, 2022 15:14
class MyJob < ApplicationJob
include JobIteration::Iteration

self.job_iteration_max_job_runtime = 3.minutes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the class attribute 👍

@sambostock sambostock force-pushed the configurable-max-job-runtime branch from 5e59a1c to e503b59 Compare July 4, 2022 15:18
@sambostock sambostock mentioned this pull request Jul 4, 2022
2 tasks
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@sambostock sambostock force-pushed the configurable-max-job-runtime branch from e503b59 to 642cd89 Compare July 4, 2022 21:11
Copy link
Contributor

@pedropb pedropb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation LGTM.

If you can share a bit more context on why this feature is being added now, it could help answer some of the questions I left on the PR.

# ...
```

This setting will be inherited by any child classes, although it can be further overridden. Note that no class can **increase** the `max_job_runtime` it has inherited; it can only be **decreased**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious on the reasoning behind this design. One counter-example I can think of:
"All jobs should run for a maximum of 30s per run; except this one job which could run for up to a minute." -> this is not supported by design, why?

The implementation would be simpler if the "only decrease" constraint isn't enforced at the lib level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I considered this scenario, and you're right that enforcing this constraint adds complexity (needing to prepend the module that verifies it).

The idea was that there should be friction when it comes to increasing the max_job_runtime, because it ideally wouldn't be done. An owner of a large application may tune the max_job_runtime, and another developer may think that it is safe for their job to run without limits, and end up impacting the stability of the job infrastructure if they're able to bypass the limit easily.

By requiring the developer to go through a change like this, it would be clearer that they're doing something exceptional:

-JobIteration.max_job_runtime = 5.minutes

 class ApplicationIterativeJob < ActiveJob::Base
   include JobIteration::Iteration
+  self.max_job_runtime = 5.minutes
 end
+
+class ExceptionallyLongRunningIterativeJob < ActiveJob::Base
+  include JobIteration::Iteration
+  self.max_job_runtime = 10.minutes
+end

-class MyLongRunningJob < ApplicationIterativeJob
+class MyLongRunningJob < ExceptionallyLongRunningIterativeJob
 end

To me, it feels like JobIteration.max_job_runtime should be the maximum globally, and feels weird that a single job could violate that invariant. Allowing decreases only means that the invariant continues to hold.


That said, if we have legitimate use cases, then we could certainly change this behaviour:

  • We could split the API into one method for decreasing the maximum, and another for increasing it, whose name makes it clear that it's exceptional. For example:
    JobIteration.max_job_runtime = 5.minutes
    
    class LowPriorityIterativeJob < ApplicationIterativeJob
      decrease_job_iteration_max_job_runtime(to: 3.minutes) # forbids increasing
    end
    
    class LongRunningIterativeJob < ApplicationIterativeJob
      increase_job_iteration_max_job_runtime!(to: 10.minutes) # forbids decreasing
    end
  • We could remove the constraint all together, and go with the out-of-the-box behaviour of class_attribute (overridable inheritance)

@@ -262,7 +289,7 @@ def output_interrupt_summary
end

def job_should_exit?
if ::JobIteration.max_job_runtime && start_time && (Time.now.utc - start_time) > ::JobIteration.max_job_runtime
if job_iteration_max_job_runtime && start_time && (Time.now.utc - start_time) > job_iteration_max_job_runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user of the library, it would be helpful if there was an easy way to get notified (as in ActiveSupport::Notifications.instrument) when a job is interrupted due to elapsing the max run time setting. For example, a user could track how many times this is happening per some application-level metric to tune the time out setting accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do instrument that a job has been interrupted, although we do not note the reason why.

I think that could be something worth considering, but I would consider it outside the scope of this PR. I've opened #247 to capture the discussion.

@sambostock
Copy link
Contributor Author

why this feature is being added now

I've wanted to push for adopting max_job_runtime in a number of apps owned by my team for some time now, but we already have a number of existing jobs. Being able to adopt the setting gradually would make that less risky, so I figured I'd implement it.

For example, some of our long jobs run on schedules that mean that they rarely end up interrupted by deploys (middle of the night). We had one such job that was exceptionally interrupted once, only to turn out to improperly deserialize its cursor and blow up every time it tried to resume (the motivation for #73 & #81).

@sambostock sambostock force-pushed the configurable-max-job-runtime branch from 642cd89 to a84977f Compare July 5, 2022 17:22
guides/best-practices.md Outdated Show resolved Hide resolved
lib/job-iteration.rb Outdated Show resolved Hide resolved
@sambostock sambostock force-pushed the configurable-max-job-runtime branch from 03c5306 to 8236f63 Compare July 5, 2022 18:49
@sambostock sambostock force-pushed the configurable-max-job-runtime branch from 8236f63 to a1ec6f7 Compare July 19, 2022 14:40
This allows incremental adoption of the setting, without applying the setting
globally. Alternatively, it allows applications to set a conservative global
setting, and a more aggressive setting per jobs.

In order to prevent rogue jobs from causing trouble, the per-job override can
only be set to a value less than the inherited value.
@sambostock sambostock force-pushed the configurable-max-job-runtime branch from a1ec6f7 to 64655bd Compare July 19, 2022 14:43
@sambostock sambostock merged commit dd93717 into main Jul 22, 2022
@sambostock sambostock deleted the configurable-max-job-runtime branch July 22, 2022 13:07
@shopify-shipit shopify-shipit bot temporarily deployed to rubygems July 22, 2022 13:09 Inactive
@lavoiesl lavoiesl mentioned this pull request Aug 15, 2023
@@ -262,7 +289,7 @@ def output_interrupt_summary
end

def job_should_exit?
if ::JobIteration.max_job_runtime && start_time && (Time.now.utc - start_time) > ::JobIteration.max_job_runtime
if job_iteration_max_job_runtime && start_time && (Time.now.utc - start_time) > job_iteration_max_job_runtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to comment on such an old PR, please let me know if you'd rather I convert this to an issue, or move the conversation somewhere else.

This change actually creates a subtle behaviour change. Previously, we read the global configuration at job runtime. Now, we read the global configuration when JobIteration::Iteration is loaded, subsequent changes are ignored.

Today, we noticed that maintenance tasks were never pausing because that gem were setting a default job runtime in an after_initialze block that was run after this module was included.

I'm not actually sure which option is better. Just wanted to point out the difference in behaviours. In the meantime, I've proposed Shopify/maintenance_tasks#918 to fix maintenance tasks under the assumption that the current behaviour of job-iteration is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, nice catch! I'd consider that a bug - changes to the global value should be picked up by the jobs as they run. I'm not entirely sure what the best fix is, especially in a way that enforces the "individual jobs must not increase this value".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to have a look on Monday, but I think the fix would be to override the method defined by class_attribute to delegate to the top level default, as opposed to setting it to the default value at the time the method was defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#436 should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants