-
Notifications
You must be signed in to change notification settings - Fork 2k
Safe dataset updates with package_revise #4618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
More concurrent-safe package_update usage examples:
Note that our current API has no way to update just the dataset metadata without passing a context parameter.
|
package_sfu
action
package_sfu
action
Tossing out some naming ideas:
Other's suggestions:
|
@boykoc I like |
@wardi good point. What about I'll add more if I come up with anything after letting it stew for awhile. |
One problem with these names is they don't capture the idea that this is a different kind of update: it includes checking values first and can modify fields in a much more fine-grained way than the existing
|
Following the spirit of the 'safe lock' update by @boykoc , How about
vs.
Then it'd be easy to explain in the docs that one should use the "safe protocol" when updating stuff in out-of-band , potentially concurrent jobs. (I admit this is thinking out of the documentation user stand point). |
@amercader Just wondering how things are going with reviewing this PR? |
@mcarans would you be able to help review the PR - that would probably help 😄 |
@rufuspollock, how could we help you with this ? @mcarans made me aware of this, we're working together on the HDX project. Were you referring to a code review of the changes or for us to test on a CKAN instance that has the new |
@alexandru-m-g great to see you here 🙏 😄 - yes it would be reviewing the code in this PR here. A bonus would be pulling with this branch and trying it out but that could be secondary 😄 |
I got around to looking through the code changes a bit but then I decided it's easier to just take the code and get it running locally in order to get a clearer picture of the new stuff. So I'm taking it for a spin now and will come back with a feedback. |
This is a BIG improvement to CKAN and I realize that it was a significant amount of work, so just wanted to say that it really is appreciated. I've played with this new functionality a bit and it worked quite well. I just have a few questions that I'll detail below:
|
@alexandru-m-g thank you for the thorough review!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Adrià Mercader <[email protected]>
I think it would be great to support the use case of adding a new resource with an included upload as @alexandru-m-g raises on point 2, but let's improve this once the main PR is merged. For reference the Internal Error raised is caused because we hit the
|
Context: I bumped into a problem when uploading several files by calling Not good enough solution: I was advised to give a try to Is there a better way?: Is there Any help or advice is highly appreciated! |
resource_create, resource_delete and resource_update can be made concurrent safe with by having them use the new resource_revise not triggering view creation isn't good, that would also be great to see fixed. |
Could you, please, elaborate more regarding this |
|
I would be happy to create a PR if I will be provided with enough details as I am a beginner with CKAN :-) Is adding |
I opened a new ticket for some follow-on work #6420 |
What's all this about?
This PR adds a new
package_revise
action for safe concurrent multi-user dataset updates. This action defines a new pattern for action parameters and return values, and supports multiple file uploads in a single call.Why should I care?
CKAN's current selection of API actions used for updating datasets
package_update
,package_patch
,resource_create
,resource_delete
,resource_update
andresource_patch
are not safe to use concurrently and can easily lose data. CKAN's dataset and resource editing forms silently revert changes with multiple users based on when the form was rendered (not fixed in this PR, but now possible to fix).Losing data is bad and it makes users unhappy.
Technical Details
This PR implements the SELECT FOR UPDATE approach to package update from #4617.
e.g. change the "frontier-data" dataset title from "West" to "East" with the call failing if the title is not currently "West"
match, filter, update
The "match" dict parameter is used to find the "id" or "name" field to get the original metadata with a SELECT FOR UPDATE. At least one of "id" and "name" must be present. Passing "name" values in the "id" field like other actions doesn't work.
The "match" dict is compared against the existing metadata for a field-by-field match. If any field is not equal or missing in the existing metadata the operation will abort with a ValidationError.
If provided a "filter" list of string patterns is glob-matched against the existing metadata to remove fields. Fields matching patterns starting with "-" will be removed and fields matching patterns starting with "+" will be protected from being removed (useful before a broad delete pattern).
The "update" dict parameter updates the existing/filtered metadata and will be validated with the normal dataset validators then used to update the dataset in the DB, releasing the SELECT FOR UPDATE locks.
Flattened dict key parameters are supported so that multipart file uploads for all resources along with updating all metadata fields can be supported with a single call. This is not possible with any current API.
Flattened Keys
e.g. the same call as above with flattened keys:
Flattened keys are similar to the keys accepted by our views for handling posted data with the common pattern
dict_fns.unflatten(tuplize_dict(parse_params(request.form))))
but are more generic and powerful because arbitrary nested dicts and nested lists are fully supported (tuplize_dict
only supports lists of dicts) and we allow the use of partial ids to identify resources as well as integer indexes.Flattened keys are required for file uploads. Any time one or more files is included in a web API request, the request is sent in multipart format. Multipart format has all parameters at the top level as string values, instead of a single combined JSON blob.
e.g. remove all existing resources from dataset and create two new resources with files uploaded
include
The optional "include" parameter follows the same format as "filter" but determines which values will be returned to the caller. This can be used to reduce the payload and skip work on the server-side (not in this PR, future optimization). Unlike
package_show
the updated dataset values are returned under "package" instead of at the top level. This allows us to add additional information in the future such as the corresponding activity record created and flags to indicate which parts of the dataset metadata were actually modified.e.g. only return the new metadata modified date after updating title:
Reusable Functions
collect_prefix_validate
update__*
andmatch__*
parametersjson_list_or_string
a,b,c
and["a", "b", "c"]
-style lists of stringsjson_or_string
dict_only
json_or_string
)check_dict
match
parameter, recursivecheck_list
resolve_string_key
check_string_key
check_dict
/check_list
filter_glob_match
filter
orinclude
string patternsupdate_merge_dict
update_merge_list
update_merge_string_key
update_merge_dict
/update_merge_list
Credits
This work was sponsored by OpenGov
This work was sponsored by the Government of Canada