-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dataset::maybe_batch
for faster database write
#591
base: master
Are you sure you want to change the base?
Conversation
@lnicola Sorry for the new pull request. This should be ready now. |
Come to think of it, the API I'd really want looks like: // this will keep track of how many written/updated features there were
let mut tx = ds.batching_transaction(BATCH_SIZE)?;
for row in df {
write_feature(&tx, row)?;
}
tx.commit()?; But I guess that's probably too opinionated. |
@lnicola let me put the response to #584 (comment) here: The main reason I prefer the closure based API instead of the guard object based API is that the former makes it much clearer to handle error cases. With
you can easily end up in a situation where |
src/vector/transaction.rs
Outdated
let force = 0; // since this is for speed | ||
let rv = unsafe { gdal_sys::GDALDatasetStartTransaction(self.c_dataset(), force) }; | ||
let res = func(self); | ||
if rv == OGRErr::OGRERR_NONE { | ||
let rv = unsafe { gdal_sys::GDALDatasetCommitTransaction(self.c_dataset()) }; | ||
if rv != OGRErr::OGRERR_NONE { | ||
return Err(GdalError::OgrError { | ||
err: rv, | ||
method_name: "GDALDatasetCommitTransaction", | ||
}); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the implementation is correct yet.
Something like this is required to handle at least basic errors.
let force = 0; // since this is for speed | |
let rv = unsafe { gdal_sys::GDALDatasetStartTransaction(self.c_dataset(), force) }; | |
let res = func(self); | |
if rv == OGRErr::OGRERR_NONE { | |
let rv = unsafe { gdal_sys::GDALDatasetCommitTransaction(self.c_dataset()) }; | |
if rv != OGRErr::OGRERR_NONE { | |
return Err(GdalError::OgrError { | |
err: rv, | |
method_name: "GDALDatasetCommitTransaction", | |
}); | |
} | |
} | |
let force = 0; // since this is for speed | |
let rv = unsafe { gdal_sys::GDALDatasetStartTransaction(self.c_dataset(), force) }; | |
let res = func(self); | |
if rv == OGRErr::OGRERR_NONE { | |
// check for the result of the write process of the user, | |
// if it's an error we want to rollback, otherwise commit | |
match &res { | |
Ok(_) => { | |
let rv = unsafe { gdal_sys::GDALDatasetCommitTransaction(self.c_dataset()) }; | |
if rv != OGRErr::OGRERR_NONE { | |
// if commit returns an error, at least try to rollback | |
// to keep the dataset in a usable state | |
let rv2 = unsafe { gdal_sys::GDALDatasetRollbackTransaction(self.c_dataset()) }; | |
if rv2 != OGRErr::OGRERR_NONE { | |
// if rollback also returns an error we | |
// want to return the rollback error | |
return Err(GdalError::OgrError { | |
err: rv2, | |
method_name: "GDALDatasetRollbackTransaction", | |
}); | |
} else { | |
// otherwise return the commit error | |
return Err(GdalError::OgrError { | |
err: rv, | |
method_name: "GDALDatasetCommitTransaction", | |
}); | |
} | |
} | |
// if the user returned an error, rollback the changes | |
Err(_) => { | |
let rv = unsafe { gdal_sys::GDALDatasetRollbackTransaction(self.c_dataset()) }; | |
if rv != OGRErr::OGRERR_NONE { | |
return Err(GdalError::OgrError { | |
err: rv, | |
method_name: "GDALDatasetRollbackTransaction", | |
}); | |
} | |
} | |
res | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be a rollback in this transaction. The reason I'm making this is to have a uniform API for drivers that support transactions and those that do not.
So for GPKG it'll use transactions to speed up writing, but for json, shp it will not. The name 'maybe_transaction' might be a bit misleading, but it means it maybe using a transaction to speed up things.
And since we might have any sort of drivers, rolling back changes should not be done, as it can't be done in all of them. If there's commit problem then it should error out and dev should consider exiting it with partially written database.
@weiznich I think there might have been misunderstanding about the function of the I want to clarify the added function is for speeding up the db writing when it can (using transaction), the reason we are not forcing a transaction is that we don't want the rollback feature for uniformity between database that succeeded on transaction process and those that do not, either way we should write to the database. So the algorithm is:
Step 3 is there so that there wouldn't be dangling transaction even if database write fails. But since we don't want rollback, we don't need to check for error/success of writing to database. And there is no rollback on step 3 or 4, is because if the database didn't support transaction at all, then we couldn't do it anyway. So we can only commit the database for uniformity. Commit failing is a critical error, so maybe we should add more documentation explaining how that will break the uniformity. @lnicola Yeah, the batch size part seems like something more advanced than needed for now. It might be nice if we could have the ones not supporting transactions also use the batch size somehow. But without that, I think the closure itself is sufficient... unless there is a limit (that people are likely to hit) to the number of transaction you can have without commit. |
On the subject of naming, I think I like something with Regarding the implementation:
|
I agree on the implementation points. About the name, do you think it's ok to name it |
1994f72
to
b3e28f5
Compare
Dataset::maybe_transaction
for faster database writeDataset::maybe_batch
for faster database write
Same issue with memory leak here. I'm away from keyboard for next 2 hrs so can't squash commits. Please squash commits if you merge. |
Okay, it looks like we should roll back. Even says that |
Hmm... We could roll back when a transaction commit fails, but that would break the uniformity. I think we might still have to send error, maybe we need to make it respond with Something like:
That way lib user can know when transaction was attempted and failed, vs when it wasn't and failed. EDIT: If we rollback it has to be just before the |
Sorry if the previous comment was confusing, What I mean is suppose we have something like this: ds.maybe_batch(|d| {
d.do_sth(....) // succeeds
d.do_sth_else(...) // fails
}) Now, if we have JSON dataset, then it'll already have changes from first task, but not the second task. We can't rollback it. Basically having uniformity on both dataset. And if start transaction fails in database that supports it, we are fine, we'll just assume it doesn't and continue. So I don't think there is no need to propagate that error. But if we rollback when the function fails, then the two cases will have different things in the database. That's why I was talking about not rolling back. But when the closure runs successfully, but the commit fails, is a situation where once again we will have different results from both side, but we can't help that, but wanting to distinguish it from closure failing, I was talking about maybe returning |
I would expect it to roll back here. I see these as a "best-effort transaction". It will get committed at the end because it doesn't make sense otherwise, if one operation fails, then we try to rollback if possible, otherwise tough luck. Generally, you can't
See above: even though I ran |
Moving this from #585
I don't know if there is a case where GPKG will fail to start transaction, but you can still normally write to the dataset. But if there is, then the current implementation will write normally. But if we propagate that error, then we'll error out when we don't need to. Again, I don't know if that can ever happen. So if you think that's not something we should think about, then I can change this function to propagate the all transaction related errors, and the error from closure. We can use the |
Ooo, thank you for that example. I'll update the code to include rollback in case the closure fails, and propagate that. Should I wait until #585 is merged so that I don't have to rebase again? |
Really, if starting a transaction is supported, but fails, that's a really bad sign, we should give up.
I guess it's not what you're saying, but we should check like this instead:
|
This will be what I'll commit once it's merged. /// Optionally start a transaction before running `func` for performance
///
/// This uses transaction if the dataset supports it, otherwise it
/// runs the `func` function as it is on the dataset. If the
/// closure results in error, and it is a transaction, it is
/// rolled back.
pub fn maybe_batch(&mut self, func: impl Fn(&Dataset) -> Result<()>) -> Result<()> {
let force = 0; // since this is for speed
let rv = unsafe { gdal_sys::GDALDatasetStartTransaction(self.c_dataset(), force) };
if rv == OGRErr::OGRERR_UNSUPPORTED_OPERATION {
func(self)
} else if rv == OGRErr::OGRERR_NONE {
let res = func(self);
if res.is_ok() {
let rv = unsafe { gdal_sys::GDALDatasetCommitTransaction(self.c_dataset()) };
if rv != OGRErr::OGRERR_NONE {
Err(GdalError::OgrError {
err: rv,
method_name: "GDALDatasetCommitTransaction",
})
} else {
res
}
} else {
let rv = unsafe { gdal_sys::GDALDatasetRollbackTransaction(self.c_dataset()) };
if rv != OGRErr::OGRERR_NONE {
Err(GdalError::OgrError {
err: rv,
method_name: "GDALDatasetRollbackTransaction",
})
} else {
res
}
}
} else {
// transaction supported but failed
Err(GdalError::OgrError {
err: rv,
method_name: "GDALDatasetStartTransaction",
})
}
} |
473a01d
to
b233c3f
Compare
If |
I made 3 branches in the conditional: transaction not supported, supported and success and supported and failure. It propagates function fail error in all cases except transaction supported but failed because we don't call the function at all in that case See the |
src/vector/transaction.rs
Outdated
} else { | ||
let rv = unsafe { gdal_sys::GDALDatasetRollbackTransaction(self.c_dataset()) }; | ||
if rv != OGRErr::OGRERR_NONE { | ||
Err(GdalError::OgrError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here res
is Err
, but we return the rollback error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we print something to warn about rollback then? I think it's important to show we couldn't rollback. Especially if it means they can't operate on dataset anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. If you're using this API, I think you know you're eschewing most transactional guarantees, and I'm inlined to say that the first error is the most important one. If your disk is failing or something, that's when it happened.
I'm not sure if I mentioned this, but my use case for this feature is a batch process that writes a new dataset at the end. If it crashes, I'll just restart it. Rolling back after an error is nice, but I'm never going to look at the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohk, ignored the rollback error. Left a comment there just in case.
b233c3f
to
b27014d
Compare
Thanks for your work on this. I'll be going now, but I'll try to finish #589 tomorrow, then look some more into this. |
src/test_utils.rs
Outdated
.tempfile() | ||
.unwrap() | ||
.into_parts(); | ||
file.write_all(&input_data).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just use std::fs::copy(path, path_of_the_tempfile)
?
That would save us from copying the data first into memory and then back again to the disk.
(Given that this is test code this is likely irrelevant to fix on it's own)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the function open_gpkg_for_update
. But I have modified based on what you said.
Irrelevant, but I was thinking of making the copy_file
option for tempfile::Builder
so that we can just provide a file, and it'll make a temporary file that is a copy of the file provided using std::fs::copy
. That would save a lot of lines of code, but it seems to be external package, maybe a pull request idea.
Rollback on drop is what I would expect, yes.
Indeed, that's a little unfortunate. I think it's similar to how files get closed on drop and you can't be sure if the cached data reached the disk. Even On the other hand, we're looking at an API that's "best-effort transactional". There's no guarantee you get a transaction (you can't even check), so there's no guarantee you can roll back anything. That's why I prefer a name like
I don't follow. I know why dangling transactions are bad, and how keeping the transaction alive for a long time can lead to that. But a failed rollback is not something you'll see very often. They're usually a network error, which the database will detect and handle correctly. If the But I can see this as a good argument for introducing one or two
Which touches on another potential limitation of this API: the callback must return a our error type, unless we choose to box it. |
59d2a7a
to
1b0f0b6
Compare
@lnicola First of all I missed that this is "best-effort transactional" API and not something that needs to ensure this. This relaxes the requirements a bit.
That's mostly correct for synchronous rust. For async rust things become more complex… It's just that I spend too much time on designing API's for that… End the end the relevant difference in this case is just that you can communicate the rollback error more easily. |
Instead of variants. Do you think returning |
The second commit is for the test utility, so I think it's fine to not squash, right? |
That just makes my head hurt, sorry 🙃. |
No problems. Just suggesting it because we don't have to make another data structure, but yeah readability is not nice. :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was away the past week. I guess we can postpone the separate rollback error variant, but how do you feel about renaming this to run_in_batch
or something similar?
I renamed it to |
CHANGES.md
Outdated
@@ -24,6 +24,7 @@ | |||
- Add `Defn::geometry_type` ([#562](https://github.com/georust/gdal/pull/562)) | |||
- Add `Defn::field_index` and `Feature::field_index` ([#581](https://github.com/georust/gdal/pull/581)) | |||
- Add `Dataset::has_capability` for dataset capability check ([#581](https://github.com/georust/gdal/pull/585)) | |||
- Add `Dataset::maybe_batch` which always succeeds to use transaction when possible for speed reasons -- rollbacks if operation fails under transaction ([#584](https://github.com/georust/gdal/pull/584)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
- Add `Dataset::maybe_batch` which always succeeds to use transaction when possible for speed reasons -- rollbacks if operation fails under transaction ([#584](https://github.com/georust/gdal/pull/584)) | |
- Add `Dataset::maybe_batch` which runs a user function inside a transaction if the dataset supports it, as an optimization for I/O; if the function fails, the transaction may or may might be rolled back ([#584](https://github.com/georust/gdal/pull/584)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may or may not*
src/vector/transaction.rs
Outdated
/// This uses transaction if the dataset supports it, otherwise it | ||
/// runs the `func` function as it is on the dataset. If the | ||
/// closure results in error, and it is a transaction, it is | ||
/// rolled back. If there is error in rollback it is ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// rolled back. If there is error in rollback it is ignored. | |
/// rolled back. If the rollback fails, it not reported. | |
/// | |
/// <div class="warning">This function offers no ACID guarantees and | |
/// no way to tell if the rollback succeeded or not. Do not use it if you | |
/// care about your destination dataset.</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check if this renders fine, please try cargo doc --open
before amending.
src/vector/transaction.rs
Outdated
} | ||
} else { | ||
let _ = unsafe { gdal_sys::GDALDatasetRollbackTransaction(self.c_dataset()) }; | ||
// ignoring rollback error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ignoring rollback error | |
// ignore rollback error because it's not guaranteed to be | |
// supported, and the `func` failure is more important. |
e387814
to
a77e308
Compare
r? @jdroenner since this isn't a very clean API. |
Ok. Renamed, and added the suggestions. I also redid the commits to make it easier to merge, just in case. |
Not really sure what i can add to the discussion. I guess it is a good thing to make it easier to use transactions. |
Use case: For any user facing applications/API. Think of a simple application that takes a vector file, does something and writes a vector file. Now we want to give user freedom of writing to any vector file format. While using transaction to speed up the writing process if it supports it. That way user can choose GeoJSON or such if required, but can also choose GPKG and get the benefit of faster write to file. |
Imagine you're writing an |
I'm not really a fan of the design with the closure. In my opinion we could add a type like |
It should. And it probably should be
I'm not either, but it's not a bad design (certainly not as bad as
I think we have that today, but you still need to put your code in a closure (untested): let do = move |ds: &mut Dataset| -> Result<()> {
let layer = ds.layer(0)?;
let feature = ...;
feature.create(&layer)
};
match dataset.start_transaction() {
Ok(tx) => {
do(&mut tx)?;
tx.commit()?;
}
Err(GdalError::OgrNotSupported) => do(&mut dataset)?,
Err(e) => return e,
} which is what I proposed in #582 (comment). I don't remember if it actually works or not, but it's not very nice. |
That pattern doesn't work because dataset is borrowed as mut on But yeah, we do have |
Yeah, I suspected this was the case. It works with |
a77e308
to
39d79bf
Compare
Made the closure FnMut. |
@Atreyagaurav given the lack of consensus here, do you think that #585 be enough for you in the short/medium-term? I'd very much prefer to have a helper like this one, but I'd also like to get a release out "soon". (also, sorry for the delay) |
No problems. We can let it be unmerged for now. We can revisit if someone needs this feature, or when we have more interest on this PR later. I already have the solution for my use case with similar function, and using #585 will make this easier to do with conditional for others as well. |
CHANGES.md
if knowledge of this change could be valuable to users.Superseeds #584