-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCS/S3 Enumerator #50
Comments
Hey @kirs. I've implemented a similar use case with Azure Blob Storage. My goal was to be able to not have to download the file (imagine we're processing a 1GB dataset or something along those lines). If each worker node that touches the long running task during its many interrupt and resume cycles has to download the whole file this would be a nightmare. The Blob Storage API allows for loading a range of bytes from the file, which I imagine GCS and S3 also supports. My goal was to be able to stream the file without downloading it as needed during the long running task, and then every byte of the file will only have to be downloaded roughly once (a small % more because I buffer by 256KB at a time and thus jobs will be interrupted with an average of 128KB wasted bytes left to consume which have to be downloaded again the next time), but the necessary ingress scales linearly which is acceptable. So here was my approach: First of all I noticed that the
This doesn't work for my purposes and assumes a relatively small file size, so I wrote my own Now this enumerator can accept any underlying Let me know what you think! |
This is great! I agree it's a much better approach. We would love to see this a PR if you have some extra time. |
Hi @kirs ! |
I've heard some people use CSV Enumerator combined with a file on GCS or S3. If that's a common pattern, we may consider exposing an enumerator that would support from out of the box.
cc @GustavoCaso @djmortonShopify @cardy31
The text was updated successfully, but these errors were encountered: