[WIP] Wait to reenqueue job post-interrupt until after callbacks are finished #66

adrianna-chang-shopify · 2021-02-18T21:12:13Z

[WIP] Test suite on Core is failing, need to investigate

Problem

When Sidekiq is being used as the queue adapter, it is possible for job-iteration to reenqueue the job after an interruption prior to the callbacks for the original job completing. This can create race conditions if on_shutdown or {after|around}_perform callbacks touch data that the new job will also touch.

This can be demonstrated by keeping the new test "allows callbacks to finish before reenqueuing job after interrupt" but removing the rest of the changes. The callback order becomes [["before_enqueue"], ["before_enqueue"], ["on_shutdown"]], which is not what we want.

Solution

This can be solved by waiting until after all the callbacks on the original job are complete to reenqueue the iteration job. Rather than reenqueuing the job from within #iterate_with_enumerator, we can prepend an after_perform callback (which ensures the callback runs last) when JobIteration::Iteration is included and reenqueue the job there.

More Context

We're leveraging the job-iteration gem in the new Maintenance Tasks Framework, and we use a Run object to keep track of a maintenance task's progress under the hood. We encountered a race condition which lead to an invalid state and raised an exception. This happened because the on_shutdown callback:
https://github.com/Shopify/maintenance_tasks/blob/fd096a5d2bcbd8340370b950c927076296ece245/app/jobs/maintenance_tasks/task_job.rb#L88-L98
occurred after the new job had already gone through the before_perform hook:
https://github.com/Shopify/maintenance_tasks/blob/fd096a5d2bcbd8340370b950c927076296ece245/app/jobs/maintenance_tasks/task_job.rb#L64-L77
and started performing, which led the @run object to be in an invalid state.

Considerations

I don't think there are any unintended consequences to delaying the reenqueue - the only incompatibility would be if users were depending on reenqueue_iteration_job being called prior to their callbacks finishing and making use of one of the variables (ie. self.already_in_queue or @retried). I've double checked in Shopify/shopify and it doesn't seem that this is the case.

Suggestions for a better way to test the order of the callbacks are also welcome - matching on output sent to $STDOUT in the callbacks seemed like a reasonable and simple way to do it, rather than fiddling around with class variables or something similar.

Any other things I'm missing?

When Sidekiq is being used as the queue adapter, it is possible for job-iteration to reenqueue the job after an interruption prior to the callbacks for the original job completing. This can create race conditions if shutdown or after_perform callbacks touch data that the new job will also touch. This can be solved by waiting until after all the callbacks on the original job are complete to reenqueue the iteration job. Rather than reenqueuing the job from within #iterate_with_enumerator, we can prepend an after_perform callback (which ensures the callback runs last) when JobIteration::Iteration is included and reenqueue the job there.

adrianna-chang-shopify requested review from etiennebarrie and kirs February 18, 2021 21:12

adrianna-chang-shopify changed the title ~~Wait to reenqueue job post-interrupt until after callbacks are finished~~ [WIP] Wait to reenqueue job post-interrupt until after callbacks are finished Feb 22, 2021

adrianna-chang-shopify force-pushed the reenqueue-after-callbacks branch 2 times, most recently from 4897d42 to 5340d0f Compare February 23, 2021 20:51

adrianna-chang-shopify added 2 commits February 25, 2021 09:54

Try rescuing exceptions and reenqueuing if necessary before re-raising

bf83338

adrianna-chang-shopify force-pushed the reenqueue-after-callbacks branch from 5340d0f to bf83338 Compare February 25, 2021 14:54

adrianna-chang-shopify removed the request for review from kirs February 25, 2021 14:55

adrianna-chang-shopify marked this pull request as draft February 25, 2021 14:55

adrianna-chang-shopify mentioned this pull request Feb 25, 2021

Postpone reenqueuing the iteration job until after callbacks are done Shopify/maintenance_tasks#345

Merged

etiennebarrie mentioned this pull request Nov 22, 2021

Reenqueue the job after shutdown hooks have run #153

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Wait to reenqueue job post-interrupt until after callbacks are finished #66

[WIP] Wait to reenqueue job post-interrupt until after callbacks are finished #66

adrianna-chang-shopify commented Feb 18, 2021 •

edited

Loading

[WIP] Wait to reenqueue job post-interrupt until after callbacks are finished #66

Are you sure you want to change the base?

[WIP] Wait to reenqueue job post-interrupt until after callbacks are finished #66

Conversation

adrianna-chang-shopify commented Feb 18, 2021 • edited Loading

Problem

Solution

More Context

Considerations

adrianna-chang-shopify commented Feb 18, 2021 •

edited

Loading