Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3AsyncClient#getObject could complete successfully with no file written to disk #5732

Open
1 task
aymkhalil opened this issue Nov 27, 2024 · 5 comments
Open
1 task
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@aymkhalil
Copy link

Describe the bug

A call to client.getObject(request, destinationPath) could complete successfully with no file written to disk. We've seen this in production and was able to write a minimal repro by starting a download, and then manually deleting the file while its being downloaded (for reproduction purposes only, but think a race condition in the business logic) and we could see the future completes successfully.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

If no file was written to disk, I'd expect the download future to complete exceptionally - either when the external event that deleted the file occurred, or at least when the download operation completes - at this point it should be able to verify if a file was written to disk or not.

Current Behavior

The future will complete successfully although no file is written to disk. An application would typically assume if the future completed without exceptions, then the download was successful and a file was written to disk.

Reproduction Steps

Here is a self contained repo that basically
1\ uploads a 1 GB to
2\ downloads the uploaded file to /tmp/download/1gb.bin
3\ 5 seconds after the download starts (at this point, the file is being written to disk) delete the file
4\ few seconds later, the future would complete without and exception

async-s3-get-object.zip

To run:

unzip async-s3-get-object.zip
cd async-s3-get-object.zip
mvn package
java -jar target/async-s3-get-object-1.0-SNAPSHOT.jar <ACCESS_KEY> <SECRET_KEY> <ENDPOINT> <REGION> <BUCKET> 

(for example: java -jar target/async-s3-get-object-1.0-SNAPSHOT.jar XXX XXX "https://s3.amazonaws.com" "us-west-1" "my-test-bucket"

You should see the follow log

File with 1GB contentLength upload successfully: aym-000-test/1gb.bin
Starting first download attempt: aym-000-test/1gb.bin to /tmp/download/1gb.bin
enforcing a delete after 5 seconds from starting the download
Future completed with no errors, but file not found at: /tmp/download/1gb.bin

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

2.29.21

JDK version used

java version "17.0.3.1" 2022-04-22 LTS

Operating System and version

macOS 14.0

@aymkhalil aymkhalil added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 27, 2024
@debora-ito
Copy link
Member

and then manually deleting the file while its being downloaded

Just to clarify: when you say "manually deleting the file" you mean deleting the file from the destination path?

@aymkhalil
Copy link
Author

Just to clarify: when you say "manually deleting the file" you mean deleting the file from the destination path?

That is correct.

@debora-ito debora-ito added needs-review This issue or PR needs review from the team. and removed needs-triage This issue or PR still needs to be triaged. labels Dec 3, 2024
@debora-ito debora-ito self-assigned this Dec 3, 2024
@debora-ito debora-ito added the p2 This is a standard priority issue label Dec 3, 2024
@debora-ito
Copy link
Member

@aymkhalil we are going to look into why no exception was thrown in this case.

Do you have control over the separate process that is deleting the file? Can you tell us more about why the delete is happening?

@debora-ito debora-ito removed the needs-review This issue or PR needs review from the team. label Dec 16, 2024
@debora-ito debora-ito removed their assignment Dec 16, 2024
@aymkhalil
Copy link
Author

aymkhalil commented Dec 16, 2024

Do you have control over the separate process that is deleting the file? Can you tell us more about why the delete is happening?

We do, that was a race condition that is now fixed: as part of our download failure recovery, we perform a file delete ourselves. In one scenario, we identified a race condition that would trigger the delete from a previous failed download attempt while the newer attempt is progressing.

But this raised question about the reliability of success/failure handling we have today: we wanted guarantees that when a download future completes, either a file is on disk, or the future completes exceptionally so we can "reliably" access the file or recover from failure (assuming all bugs are fixed, like the race condition mentioned above).

Having said that, I noticed the same behavior (that is, successfully completed future, but no file on disk) reproduced differently:
If we cancel the download future and immediately start a new download. I didn't want to use this one as a repro example originally as it is more nuanced because when we cancel the download future, there is no guarantee that cancellation on the sdk has completed, but we still found the final outcome not desirable.

On that last note, I'm curious what would you recommend to reliably cancel a long running download and retry it (in hope that cancel + retry would finish faster that an unexpectedly long running download), I could think of two ways:

  • Delay the new download only after getObject.whenComplete is invoked (we have a chain of futures on top of the sdk one so the final solution would look cumbersome but in principle it is doable)
  • Use a combo of FailureBehavior.LEAVE & FileWriteOption.CREATE_NEW and handle the failed download cleanup ourselves - I tried this one and it works fine although not sure if this completely avoids race conditions between the old and new downloads.

Sorry to piggyback on this issue, but although the two scenarios are different, the undesired final state looks exactly the same so they are "related" in that sense.

@aymkhalil
Copy link
Author

@debora-ito let me know if I should open a new issue regarding:

what would you recommend to reliably cancel a long running download and retry it

@debora-ito debora-ito self-assigned this Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

2 participants