Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Wait to reenqueue job post-interrupt until after callbacks are finished #66

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adrianna-chang-shopify
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify commented Feb 18, 2021

[WIP] Test suite on Core is failing, need to investigate

Problem

When Sidekiq is being used as the queue adapter, it is possible for job-iteration to reenqueue the job after an interruption prior to the callbacks for the original job completing. This can create race conditions if on_shutdown or {after|around}_perform callbacks touch data that the new job will also touch.

This can be demonstrated by keeping the new test "allows callbacks to finish before reenqueuing job after interrupt" but removing the rest of the changes. The callback order becomes [["before_enqueue"], ["before_enqueue"], ["on_shutdown"]], which is not what we want.

Solution

This can be solved by waiting until after all the callbacks on the original job are complete to reenqueue the iteration job. Rather than reenqueuing the job from within #iterate_with_enumerator, we can prepend an after_perform callback (which ensures the callback runs last) when JobIteration::Iteration is included and reenqueue the job there.

More Context

We're leveraging the job-iteration gem in the new Maintenance Tasks Framework, and we use a Run object to keep track of a maintenance task's progress under the hood. We encountered a race condition which lead to an invalid state and raised an exception. This happened because the on_shutdown callback:
https://github.com/Shopify/maintenance_tasks/blob/fd096a5d2bcbd8340370b950c927076296ece245/app/jobs/maintenance_tasks/task_job.rb#L88-L98
occurred after the new job had already gone through the before_perform hook:
https://github.com/Shopify/maintenance_tasks/blob/fd096a5d2bcbd8340370b950c927076296ece245/app/jobs/maintenance_tasks/task_job.rb#L64-L77
and started performing, which led the @run object to be in an invalid state.

Considerations

I don't think there are any unintended consequences to delaying the reenqueue - the only incompatibility would be if users were depending on reenqueue_iteration_job being called prior to their callbacks finishing and making use of one of the variables (ie. self.already_in_queue or @retried). I've double checked in Shopify/shopify and it doesn't seem that this is the case.

Suggestions for a better way to test the order of the callbacks are also welcome - matching on output sent to $STDOUT in the callbacks seemed like a reasonable and simple way to do it, rather than fiddling around with class variables or something similar.

Any other things I'm missing?

@adrianna-chang-shopify adrianna-chang-shopify changed the title Wait to reenqueue job post-interrupt until after callbacks are finished [WIP] Wait to reenqueue job post-interrupt until after callbacks are finished Feb 22, 2021
@adrianna-chang-shopify adrianna-chang-shopify force-pushed the reenqueue-after-callbacks branch 2 times, most recently from 4897d42 to 5340d0f Compare February 23, 2021 20:51
When Sidekiq is being used as the queue adapter, it is possible for job-iteration to
reenqueue the job after an interruption prior to the callbacks for the original job
completing. This can create race conditions if shutdown or after_perform callbacks
touch data that the new job will also touch.

This can be solved by waiting until after all the callbacks on the original job
are complete to reenqueue the iteration job. Rather than reenqueuing the job
from within #iterate_with_enumerator, we can prepend an after_perform callback (which
ensures the callback runs last) when JobIteration::Iteration is included and reenqueue the job there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant