Skip to content

Safe dataset updates with package_revise #4618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jun 26, 2020

Conversation

wardi
Copy link
Contributor

@wardi wardi commented Jun 5, 2019

What's all this about?

This PR adds a new package_revise action for safe concurrent multi-user dataset updates. This action defines a new pattern for action parameters and return values, and supports multiple file uploads in a single call.

Why should I care?

CKAN's current selection of API actions used for updating datasets package_update, package_patch, resource_create, resource_delete, resource_update and resource_patch are not safe to use concurrently and can easily lose data. CKAN's dataset and resource editing forms silently revert changes with multiple users based on when the form was rendered (not fixed in this PR, but now possible to fix).

Losing data is bad and it makes users unhappy.

Technical Details

This PR implements the SELECT FOR UPDATE approach to package update from #4617.

e.g. change the "frontier-data" dataset title from "West" to "East" with the call failing if the title is not currently "West"

ckanapi action package_revise \
    match='{"title":"West", "name":"frontier-data"}' \
    update='{"title":"East"}'

match, filter, update

  1. The "match" dict parameter is used to find the "id" or "name" field to get the original metadata with a SELECT FOR UPDATE. At least one of "id" and "name" must be present. Passing "name" values in the "id" field like other actions doesn't work.

  2. The "match" dict is compared against the existing metadata for a field-by-field match. If any field is not equal or missing in the existing metadata the operation will abort with a ValidationError.

  3. If provided a "filter" list of string patterns is glob-matched against the existing metadata to remove fields. Fields matching patterns starting with "-" will be removed and fields matching patterns starting with "+" will be protected from being removed (useful before a broad delete pattern).

  4. The "update" dict parameter updates the existing/filtered metadata and will be validated with the normal dataset validators then used to update the dataset in the DB, releasing the SELECT FOR UPDATE locks.

Flattened dict key parameters are supported so that multipart file uploads for all resources along with updating all metadata fields can be supported with a single call. This is not possible with any current API.

Flattened Keys

e.g. the same call as above with flattened keys:

ckanapi action package_revise \
    match__title=West \
    match__name=frontier-data \
    update__title=East

Flattened keys are similar to the keys accepted by our views for handling posted data with the common pattern dict_fns.unflatten(tuplize_dict(parse_params(request.form)))) but are more generic and powerful because arbitrary nested dicts and nested lists are fully supported (tuplize_dict only supports lists of dicts) and we allow the use of partial ids to identify resources as well as integer indexes.

Flattened keys are required for file uploads. Any time one or more files is included in a web API request, the request is sent in multipart format. Multipart format has all parameters at the top level as string values, instead of a single combined JSON blob.

e.g. remove all existing resources from dataset and create two new resources with files uploaded

ckanapi action package_revise \
    match='{"name":"needs-files"}' \
    filter='-resources__*' \
    update='{"resources":[{"name":"one"},{"name":"two"}]}' \
    [email protected] \
    [email protected]

include

The optional "include" parameter follows the same format as "filter" but determines which values will be returned to the caller. This can be used to reduce the payload and skip work on the server-side (not in this PR, future optimization). Unlike package_show the updated dataset values are returned under "package" instead of at the top level. This allows us to add additional information in the future such as the corresponding activity record created and flags to indicate which parts of the dataset metadata were actually modified.

e.g. only return the new metadata modified date after updating title:

ckanapi action package_revise \
    match='{"name":"frontier-data"}' \
    update='{"title":"East"}' \
    include='+package__metadata_modified,-*'

Reusable Functions

New Validator Description
collect_prefix_validate used to collect update__* and match__* parameters
json_list_or_string accepts a,b,c and ["a", "b", "c"]-style lists of strings
json_or_string parse strings as json, if that fails return string
dict_only only dicts accepted (typically used after json_or_string)
New Dictization Function Description
check_dict used to find non-matching keys/values for match parameter, recursive
check_list used to find non-matching items in lists, recursive
resolve_string_key find an object in nested dicts/lists with a flattened key
check_string_key flattened key version of check_dict/check_list
filter_glob_match remove keys/values/items based on list of filter or include string patterns
update_merge_dict update dict in-place from another dict, recursive
update_merge_list update list in-place from another list, recursive
update_merge_string_key flattened key version of update_merge_dict/update_merge_list

Credits

This work was sponsored by OpenGov
This work was sponsored by the Government of Canada

@wardi
Copy link
Contributor Author

wardi commented Jan 18, 2019

More concurrent-safe package_update usage examples:

1 ________________________ change description in dataset, checking for old description
match {"notes": "old notes", "name": "xyz"}
update {"notes": "new notes"}
2 ________________________ identical to 1, but using flattened keys
match__name "xyz"
match__notes "old notes"
update__notes "new notes"
3 ________________________ replace all fields at dataset level only, keep resources
match {"id": "1234abc-1420-cbad-1922"}
filter ["+resources", "-*"] (note: package id and type fields can't be deleted)
update {"name": "fresh-start", "title": "Fresh Start"}

Note that our current API has no way to update just the dataset metadata without passing a context parameter.

4 ________________________ add a new resource (__extend on flattened key)
match {"id": "abc0123-1420-cbad-1922"}
update__resources__extend [{"name": "new resource", "url": "http://example.com"}]
5 ________________________ update a resource by its index
match {"name": "my-data"}
update__resources__0 {"name": "new name, first resource"}
6 ________________________ update a resource by its id (prefixes allowed >4 chars)
match {"name": "their-data"}
update__resources__19cfad {"description": "right one for sure"}
7 ________________________ replace all fields of a resource
match {"id": "34a12bc-1420-cbad-1922"}
filter ["+resources__1492a__id", "-resources__1492a__*"]
update__resources__1492a {"name": "edits here", "url": "http://example.com"}

@wardi wardi changed the title package_update SELECT FOR UPDATE package_sfu action (was SELECT FOR UPDATE) May 15, 2019
@wardi wardi added WIP and removed To Discuss labels Jun 5, 2019
@wardi wardi changed the title package_sfu action (was SELECT FOR UPDATE) Safe dataset updates with package_sfu action Jun 6, 2019
@wardi wardi changed the title Safe dataset updates with package_sfu action Safe dataset updates with package_sfu Jun 6, 2019
@boykoc
Copy link
Contributor

boykoc commented Jun 13, 2019

Tossing out some naming ideas:

  • package_lock_update (I think my favorite. I think we're kind of doing pessimistic locking like in ActiveRecord. Also Postgres docs talk about SELECT FOR UPDATE locking the row and preventing others from locking, modifying or deleting it until after it is unlocked.)
  • package_locked_update
  • package_update_with_lock
  • package_concurrent_update
  • package_concurrent_safe_update

Other's suggestions:

  • package_frobnicate definition (or package_frob - "find, replace, omit, blocking")
  • package_blocking_update
  • package_locking_update

@wardi
Copy link
Contributor Author

wardi commented Jun 13, 2019

@boykoc I like package_lock_update too, except for the possible "we're updating a package lock" interpretation.

@boykoc
Copy link
Contributor

boykoc commented Jun 13, 2019

@wardi good point. What about package_lock_for_update? I kind of wish the endpoints were a bit more human readable e.g. lock_package_for_update. But I guess this might then make it sound like a function that locks a package, allowing you to then update. package_lock_and_update doesn't feel right.

I'll add more if I come up with anything after letting it stew for awhile.

@wardi
Copy link
Contributor Author

wardi commented Jun 13, 2019

One problem with these names is they don't capture the idea that this is a different kind of update: it includes checking values first and can modify fields in a much more fine-grained way than the existing *_update actions. Thoughts on just using another word for update to highlight this difference?

  • package_revise (this is my favourite)
  • package_amend
  • package_alter
  • package_modify

@sivang
Copy link
Member

sivang commented Jun 14, 2019

Following the spirit of the 'safe lock' update by @boykoc , How about package_update_safe , then we could follow with the rest in the same manner and we'll have a set of methods to do a 'regular' update (e.g. the original).

get_action('package_update') 

vs.

get_action('package_update_safe')

Then it'd be easy to explain in the docs that one should use the "safe protocol" when updating stuff in out-of-band , potentially concurrent jobs. (I admit this is thinking out of the documentation user stand point).

@mcarans
Copy link

mcarans commented Mar 18, 2020

@amercader Just wondering how things are going with reviewing this PR?

@rufuspollock
Copy link
Member

@mcarans would you be able to help review the PR - that would probably help 😄

@alexandru-m-g
Copy link
Contributor

@rufuspollock, how could we help you with this ? @mcarans made me aware of this, we're working together on the HDX project.

Were you referring to a code review of the changes or for us to test on a CKAN instance that has the new package_revise() action enabled ?

@rufuspollock
Copy link
Member

@alexandru-m-g great to see you here 🙏 😄 - yes it would be reviewing the code in this PR here. A bonus would be pulling with this branch and trying it out but that could be secondary 😄

@alexandru-m-g
Copy link
Contributor

I got around to looking through the code changes a bit but then I decided it's easier to just take the code and get it running locally in order to get a clearer picture of the new stuff. So I'm taking it for a spin now and will come back with a feedback.

@alexandru-m-g
Copy link
Contributor

alexandru-m-g commented Jun 1, 2020

This is a BIG improvement to CKAN and I realize that it was a significant amount of work, so just wanted to say that it really is appreciated. I've played with this new functionality a bit and it worked quite well. I just have a few questions that I'll detail below:

  1. I saw that the logic related to upload & defer_commit was moved from resource_update() to package_update().
    https://github.com/wardi/ckan/blame/d85e60cc62e5eb5cfb5a35e95ee4aaa11d29e115/ckan/logic/action/update.py#L282
    Since resource_create() is using package_update() in a similar way, would it be possible to remove the upload & defer_commit logic from resource_create() as well ? Basically for package_update() to remain the only place in the code that deals with uploads.

  2. I've played with adding new "uploaded" resources to an existing dataset. If the dataset has no resources I could do something like (this works):

    curl --location --request POST 'http://ckan:5000/api/action/package_revise' \
    --form 'match__name=dataset-for-testing-package-revise' \
    --form 'match__title=Dataset for testing package revise' \
    --form 'update={"resources":[{"name": "test1.csv"}]}' \
    --form 'update__resources__0__upload=@/path/to/test1.csv'
    

    But when I tried to add a 2nd "uploaded" resource to an existing list of resources with extend, I did something wrong (this doesn't work):

    curl --location --request POST 'http://ckan:5000/api/action/package_revise' \
    --form 'match__name=dataset-for-testing-package-revise' \
    --form 'match__title=Dataset for testing package revise' \
    --form 'update__resources__extend=[{"name": "test2.csv"}]' \
    --form 'update__resources__1__upload=@/path/to/test2.csv'
    

    I get an error because it first tries to look for resource with index 1 (which doesn't exist yet)
    So I need some help here on the correct way to add a new uploaded resource to a dataset with existing resources

  3. Related to (2), i think there's a small issue when reporting the error back to the caller. There seems to be a problem when serializing to json here: https://github.com/wardi/ckan/blob/b719e631984a61907cd737571513b009607c8343/ckan/views/api.py#L67 . The error in the response_data dict contains a reference to the FileStorage object which can't be transformed to json. So an exception is thrown there and the end result is that I'm getting back an HTML page

  4. In the future, is the plan for resource_create() and resource_update() to also use select for update ? Would be good to ensure that the package is not modified between the first package show and the actual update. As I mentioned here use "select for update" when retrieving package information for resource_create() / resource_update() #3748, there are scenarios where this could help. Probably also the _*patch() actions would benefit from a similar change.

  5. Maybe I'm jumping the gun here but will the UI, more specifically the dataset form, switch at some point in the future, to using package_revise() behind the scenes ? In which case, do you plan to match on each field in the form (match__name, match__notes, match_author, etc) or just on one field that somehow identifies the previous version (like a hashcode, revision_id or metadata_modified) ?

@wardi
Copy link
Contributor Author

wardi commented Jun 1, 2020

@alexandru-m-g thank you for the thorough review!

  1. great idea I'm not sure why I left some parts of the upload logic in resource_update

  2. good catch, maybe we can change the order that the transformations are applied so that files can be uploaded to newly-created resources in a single call and add some tests for this.

  3. thanks yes, we should filter out any file objects in the dict before trying to render json

  4. absolutely, after this PR is merged

  5. that's the plan. I think we might include hidden fields in the form that indicate the original values of form fields then the view can use package_revise to match and update only the fields that were changed by the user. When field(s) fail the match check the view will render a validation error with the current value and ask the user to confirm whether changes should still be applied. All this is for after this PR is merged, of course

Copy link
Member

@amercader amercader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on this: I'm making good progress on the code review and still have to try out the API but hopefully will be finished soon!

Screenshot_2020-06-23 Safe dataset updates with package_revise by wardi · Pull Request #4618 · ckan ckan

Just leaving some minor comments for now

@amercader
Copy link
Member

I think it would be great to support the use case of adding a new resource with an included upload as @alexandru-m-g raises on point 2, but let's improve this once the main PR is merged.

For reference the Internal Error raised is caused because we hit the if_empty_guess_format validator without a url field, because the resource does not include an upload field so the URL is not generated from the file name:

  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/action/update.py", line 470, in package_revise
    orig)}
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/__init__.py", line 472, in wrapped
    result = _action(context, data_dict, **kw)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/action/update.py", line 296, in package_update
    package_plugin, context, data_dict, schema, 'package_update')
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/plugins.py", line 306, in plugin_validate
    return toolkit.navl_validate(data_dict, schema, context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/navl/dictization_functions.py", line 273, in validate
    converted_data, errors = _validate(flattened, schema, validators_context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/navl/dictization_functions.py", line 314, in _validate
    convert(converter, key, converted_data, errors, context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/navl/dictization_functions.py", line 237, in convert
    converter(key, converted_data, errors, context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/validators.py", line 769, in if_empty_guess_format
    mimetype, encoding = mimetypes.guess_type(url)
  File "/usr/lib/python2.7/mimetypes.py", line 294, in guess_type
    return _db.guess_type(url, strict)
  File "/usr/lib/python2.7/mimetypes.py", line 114, in guess_type
    scheme, url = urllib.splittype(url)
  File "/usr/lib/python2.7/urllib.py", line 1087, in splittype
    match = _typeprog.match(url)
TypeError: expected string or buffer

@amercader amercader merged commit 9b47735 into ckan:master Jun 26, 2020
@amercader
Copy link
Member

@wardi had a good play with this, great work! This makes the CKAN API even more awesome 🚀

I took the liberty of adding your examples to the action docstring so they appear in the API docs (a51af0b)

@ghost
Copy link

ghost commented Mar 16, 2021

Context: I bumped into a problem when uploading several files by calling resource_create for each file (I have a custom UI written in React which enables this). Deceitfully it works fine for development environment when there is a single thread. When setting up a production like environment with several threads (defined in uWSGI) resource_create-calls start to overwrite each other. I reported it in #5959

Not good enough solution: I was advised to give a try to package_revise, but it turned out that it doesn't call plugins.IResourceController-plugins #L346 neither resource_create_default_resource_views #L337. As a result the resource is uploaded correctly but cannot be previewed in UI.

Is there a better way?: Is there package_revise-type of action which mimics resource_create i.e. does aforementioned steps and also a concurrent safe? If not, should there be a feature request for that as this is quite of core functionality. Or is there some easy workaround for such problems?

Any help or advice is highly appreciated!

@wardi
Copy link
Contributor Author

wardi commented Mar 16, 2021

resource_create, resource_delete and resource_update can be made concurrent safe with by having them use the new for_update context parameter. This would be a pretty easy contribution if you're interested in submitting a PR.

resource_revise not triggering view creation isn't good, that would also be great to see fixed.

@ghost
Copy link

ghost commented Mar 17, 2021

resource_create, resource_delete and resource_update can be made concurrent safe with by having them use the new for_update context parameter.

Could you, please, elaborate more regarding this for_update? I can see how you set it to true in the original context inside update.py. But I don't quite get how it does all the magic of making the whole package_revise method call concurrency safe.

@wardi
Copy link
Contributor Author

wardi commented Mar 17, 2021

for_update relies on the database to lock the package for concurrent updates, blocking if there are any existing locks. The lock is released at the end of the transaction when the package and resource changes are committed.

@ghost
Copy link

ghost commented Mar 18, 2021

I would be happy to create a PR if I will be provided with enough details as I am a beginner with CKAN :-)

Is adding for_update parameter in the beginning of the resource_create-method the only change required? I tried it and tested quickly, but it seems to behave the same old way.

@wardi
Copy link
Contributor Author

wardi commented Sep 22, 2021

I opened a new ticket for some follow-on work #6420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants