Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max iters per instance + enqueue after shutdown #99

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GabrielAlacchi
Copy link

I'm opening this up as a draft PR just to get some feedback as these changes may not be aligned with the mainline goals of this gem, if so I will maintain this fork myself and use it for my own purposes. I don't believe these changes are ready to be merged in yet without adding unit test coverage, but I would like to see if there's interest in adding these things in before I bother writing any.

Summary of the change

I wanted to achieve two goals:

  1. Don't enqueue the next instance of the job until the on_shutdown hook has completed.
  2. Allow setting a hard cap on the number of iterations that can run before interrupting the job. This is optional and can be set by calling max_iters_per_run in the class body.

Reasons Why I Made this Change

I needed the following changes to be made for my use case of exporting data to a CSV file for a Heroku application. Since Heroku file storage is ephemeral my strategy was to buffer lines of the CSV and flush the bytes to an Azure Blob Storage file to append them. At a high level my job does the following:

  1. Creates a cursor which iterates over the models and preloads the necessary data
  2. Iterates and creates a row for each model in each_iteration
  3. In the on_shutdown callback it flushes the bytes to the storage blob with an API call.

My reasoning for why I'm flushing bytes in on_shutdown is that I would like each run of the job to be a single unit of work. Think of flushing the bytes as committing that unit of work to the file. If there's an error in one of the iterations I wouldn't want the job's view of how many rows have been flushed to the file to become mismatched with where it is in the enumerator.

Therefore if there's any error generating rows in the CSV, or a network error while flushing the bytes the job can safely restart at the same cursor value and try regenerating and flushing those same bytes as well. Yes this isn't a perfect guarantee of success, but I check for a content length mismatch in build_enumerator and raise an error if this happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant