Safe dataset updates with package_revise #4618

wardi · 2019-06-05T17:09:16Z

What's all this about?

This PR adds a new package_revise action for safe concurrent multi-user dataset updates. This action defines a new pattern for action parameters and return values, and supports multiple file uploads in a single call.

Why should I care?

CKAN's current selection of API actions used for updating datasets package_update, package_patch, resource_create, resource_delete, resource_update and resource_patch are not safe to use concurrently and can easily lose data. CKAN's dataset and resource editing forms silently revert changes with multiple users based on when the form was rendered (not fixed in this PR, but now possible to fix).

Losing data is bad and it makes users unhappy.

Technical Details

This PR implements the SELECT FOR UPDATE approach to package update from #4617.

e.g. change the "frontier-data" dataset title from "West" to "East" with the call failing if the title is not currently "West"

ckanapi action package_revise \
    match='{"title":"West", "name":"frontier-data"}' \
    update='{"title":"East"}'

match, filter, update

The "match" dict parameter is used to find the "id" or "name" field to get the original metadata with a SELECT FOR UPDATE. At least one of "id" and "name" must be present. Passing "name" values in the "id" field like other actions doesn't work.
The "match" dict is compared against the existing metadata for a field-by-field match. If any field is not equal or missing in the existing metadata the operation will abort with a ValidationError.
If provided a "filter" list of string patterns is glob-matched against the existing metadata to remove fields. Fields matching patterns starting with "-" will be removed and fields matching patterns starting with "+" will be protected from being removed (useful before a broad delete pattern).
The "update" dict parameter updates the existing/filtered metadata and will be validated with the normal dataset validators then used to update the dataset in the DB, releasing the SELECT FOR UPDATE locks.

Flattened dict key parameters are supported so that multipart file uploads for all resources along with updating all metadata fields can be supported with a single call. This is not possible with any current API.

Flattened Keys

e.g. the same call as above with flattened keys:

ckanapi action package_revise \
    match__title=West \
    match__name=frontier-data \
    update__title=East

Flattened keys are similar to the keys accepted by our views for handling posted data with the common pattern dict_fns.unflatten(tuplize_dict(parse_params(request.form)))) but are more generic and powerful because arbitrary nested dicts and nested lists are fully supported (tuplize_dict only supports lists of dicts) and we allow the use of partial ids to identify resources as well as integer indexes.

Flattened keys are required for file uploads. Any time one or more files is included in a web API request, the request is sent in multipart format. Multipart format has all parameters at the top level as string values, instead of a single combined JSON blob.

e.g. remove all existing resources from dataset and create two new resources with files uploaded

ckanapi action package_revise \
    match='{"name":"needs-files"}' \
    filter='-resources__*' \
    update='{"resources":[{"name":"one"},{"name":"two"}]}' \
    [email protected] \
    [email protected]

include

The optional "include" parameter follows the same format as "filter" but determines which values will be returned to the caller. This can be used to reduce the payload and skip work on the server-side (not in this PR, future optimization). Unlike package_show the updated dataset values are returned under "package" instead of at the top level. This allows us to add additional information in the future such as the corresponding activity record created and flags to indicate which parts of the dataset metadata were actually modified.

e.g. only return the new metadata modified date after updating title:

ckanapi action package_revise \
    match='{"name":"frontier-data"}' \
    update='{"title":"East"}' \
    include='+package__metadata_modified,-*'

Reusable Functions

New Validator	Description
`collect_prefix_validate`	used to collect `update__` and `match__` parameters
`json_list_or_string`	accepts `a,b,c` and `["a", "b", "c"]`-style lists of strings
`json_or_string`	parse strings as json, if that fails return string
`dict_only`	only dicts accepted (typically used after `json_or_string`)

New Dictization Function	Description
`check_dict`	used to find non-matching keys/values for `match` parameter, recursive
`check_list`	used to find non-matching items in lists, recursive
`resolve_string_key`	find an object in nested dicts/lists with a flattened key
`check_string_key`	flattened key version of `check_dict`/`check_list`
`filter_glob_match`	remove keys/values/items based on list of `filter` or `include` string patterns
`update_merge_dict`	update dict in-place from another dict, recursive
`update_merge_list`	update list in-place from another list, recursive
`update_merge_string_key`	flattened key version of `update_merge_dict`/`update_merge_list`

Credits

This work was sponsored by OpenGov
This work was sponsored by the Government of Canada

wardi · 2019-01-18T21:04:15Z

More concurrent-safe package_update usage examples:

1 ________________________	change description in dataset, checking for old description
`match`	`{"notes": "old notes", "name": "xyz"}`
`update`	`{"notes": "new notes"}`

2 ________________________	identical to 1, but using flattened keys
`match__name`	`"xyz"`
`match__notes`	`"old notes"`
`update__notes`	`"new notes"`

3 ________________________	replace all fields at dataset level only, keep resources
`match`	`{"id": "1234abc-1420-cbad-1922"}`
`filter`	`["+resources", "-*"]` (note: package `id` and `type` fields can't be deleted)
`update`	`{"name": "fresh-start", "title": "Fresh Start"}`

Note that our current API has no way to update just the dataset metadata without passing a context parameter.

4 ________________________	add a new resource (`__extend` on flattened key)
`match`	`{"id": "abc0123-1420-cbad-1922"}`
`update__resources__extend`	`[{"name": "new resource", "url": "http://example.com"}]`

5 ________________________	update a resource by its index
`match`	`{"name": "my-data"}`
`update__resources__0`	`{"name": "new name, first resource"}`

6 ________________________	update a resource by its id (prefixes allowed >4 chars)
`match`	`{"name": "their-data"}`
`update__resources__19cfad`	`{"description": "right one for sure"}`

7 ________________________	replace all fields of a resource
`match`	`{"id": "34a12bc-1420-cbad-1922"}`
`filter`	`["+resources__1492a__id", "-resources__1492a__*"]`
`update__resources__1492a`	`{"name": "edits here", "url": "http://example.com"}`

ckan/logic/action/update.py

boykoc · 2019-06-13T13:47:49Z

Tossing out some naming ideas:

package_lock_update (I think my favorite. I think we're kind of doing pessimistic locking like in ActiveRecord. Also Postgres docs talk about SELECT FOR UPDATE locking the row and preventing others from locking, modifying or deleting it until after it is unlocked.)
package_locked_update
package_update_with_lock
package_concurrent_update
package_concurrent_safe_update

Other's suggestions:

package_frobnicate definition (or package_frob - "find, replace, omit, blocking")
package_blocking_update
package_locking_update

wardi · 2019-06-13T14:00:39Z

@boykoc I like package_lock_update too, except for the possible "we're updating a package lock" interpretation.

boykoc · 2019-06-13T14:05:18Z

@wardi good point. What about package_lock_for_update? I kind of wish the endpoints were a bit more human readable e.g. lock_package_for_update. But I guess this might then make it sound like a function that locks a package, allowing you to then update. package_lock_and_update doesn't feel right.

I'll add more if I come up with anything after letting it stew for awhile.

wardi · 2019-06-13T14:05:21Z

One problem with these names is they don't capture the idea that this is a different kind of update: it includes checking values first and can modify fields in a much more fine-grained way than the existing *_update actions. Thoughts on just using another word for update to highlight this difference?

package_revise (this is my favourite)
package_amend
package_alter
package_modify

sivang · 2019-06-14T04:42:27Z

Following the spirit of the 'safe lock' update by @boykoc , How about package_update_safe , then we could follow with the rest in the same manner and we'll have a set of methods to do a 'regular' update (e.g. the original).

get_action('package_update')

vs.

get_action('package_update_safe')

Then it'd be easy to explain in the docs that one should use the "safe protocol" when updating stuff in out-of-band , potentially concurrent jobs. (I admit this is thinking out of the documentation user stand point).

mcarans · 2020-03-18T09:59:17Z

@amercader Just wondering how things are going with reviewing this PR?

rufuspollock · 2020-05-14T21:07:22Z

@mcarans would you be able to help review the PR - that would probably help 😄

alexandru-m-g · 2020-05-15T12:18:46Z

@rufuspollock, how could we help you with this ? @mcarans made me aware of this, we're working together on the HDX project.

Were you referring to a code review of the changes or for us to test on a CKAN instance that has the new package_revise() action enabled ?

rufuspollock · 2020-05-15T20:08:57Z

@alexandru-m-g great to see you here 🙏 😄 - yes it would be reviewing the code in this PR here. A bonus would be pulling with this branch and trying it out but that could be secondary 😄

alexandru-m-g · 2020-05-29T14:35:00Z

I got around to looking through the code changes a bit but then I decided it's easier to just take the code and get it running locally in order to get a clearer picture of the new stuff. So I'm taking it for a spin now and will come back with a feedback.

alexandru-m-g · 2020-06-01T18:24:11Z

This is a BIG improvement to CKAN and I realize that it was a significant amount of work, so just wanted to say that it really is appreciated. I've played with this new functionality a bit and it worked quite well. I just have a few questions that I'll detail below:

I saw that the logic related to upload & defer_commit was moved from resource_update() to package_update().
https://github.com/wardi/ckan/blame/d85e60cc62e5eb5cfb5a35e95ee4aaa11d29e115/ckan/logic/action/update.py#L282
Since resource_create() is using package_update() in a similar way, would it be possible to remove the upload & defer_commit logic from resource_create() as well ? Basically for package_update() to remain the only place in the code that deals with uploads.

I've played with adding new "uploaded" resources to an existing dataset. If the dataset has no resources I could do something like (this works):

curl --location --request POST 'http://ckan:5000/api/action/package_revise' \
--form 'match__name=dataset-for-testing-package-revise' \
--form 'match__title=Dataset for testing package revise' \
--form 'update={"resources":[{"name": "test1.csv"}]}' \
--form 'update__resources__0__upload=@/path/to/test1.csv'

But when I tried to add a 2nd "uploaded" resource to an existing list of resources with extend, I did something wrong (this doesn't work):

curl --location --request POST 'http://ckan:5000/api/action/package_revise' \
--form 'match__name=dataset-for-testing-package-revise' \
--form 'match__title=Dataset for testing package revise' \
--form 'update__resources__extend=[{"name": "test2.csv"}]' \
--form 'update__resources__1__upload=@/path/to/test2.csv'

I get an error because it first tries to look for resource with index 1 (which doesn't exist yet)
So I need some help here on the correct way to add a new uploaded resource to a dataset with existing resources

Related to (2), i think there's a small issue when reporting the error back to the caller. There seems to be a problem when serializing to json here: https://github.com/wardi/ckan/blob/b719e631984a61907cd737571513b009607c8343/ckan/views/api.py#L67 . The error in the response_data dict contains a reference to the FileStorage object which can't be transformed to json. So an exception is thrown there and the end result is that I'm getting back an HTML page
In the future, is the plan for resource_create() and resource_update() to also use select for update ? Would be good to ensure that the package is not modified between the first package show and the actual update. As I mentioned here use "select for update" when retrieving package information for resource_create() / resource_update() #3748, there are scenarios where this could help. Probably also the _*patch() actions would benefit from a similar change.
Maybe I'm jumping the gun here but will the UI, more specifically the dataset form, switch at some point in the future, to using package_revise() behind the scenes ? In which case, do you plan to match on each field in the form (match__name, match__notes, match_author, etc) or just on one field that somehow identifies the previous version (like a hashcode, revision_id or metadata_modified) ?

wardi · 2020-06-01T19:17:40Z

@alexandru-m-g thank you for the thorough review!

great idea I'm not sure why I left some parts of the upload logic in resource_update
good catch, maybe we can change the order that the transformations are applied so that files can be uploaded to newly-created resources in a single call and add some tests for this.
thanks yes, we should filter out any file objects in the dict before trying to render json
absolutely, after this PR is merged
that's the plan. I think we might include hidden fields in the form that indicate the original values of form fields then the view can use package_revise to match and update only the fields that were changed by the user. When field(s) fail the match check the view will render a validation error with the current value and ask the user to confirm whether changes should still be applied. All this is for after this PR is merged, of course

amercader

Update on this: I'm making good progress on the code review and still have to try out the API but hopefully will be finished soon!

Just leaving some minor comments for now

ckan/tests/logic/action/test_update.py

ckan/model/package.py

ckan/logic/action/update.py

Co-authored-by: Adrià Mercader <[email protected]>

amercader · 2020-06-26T15:00:55Z

I think it would be great to support the use case of adding a new resource with an included upload as @alexandru-m-g raises on point 2, but let's improve this once the main PR is merged.

For reference the Internal Error raised is caused because we hit the if_empty_guess_format validator without a url field, because the resource does not include an upload field so the URL is not generated from the file name:

  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/action/update.py", line 470, in package_revise
    orig)}
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/__init__.py", line 472, in wrapped
    result = _action(context, data_dict, **kw)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/action/update.py", line 296, in package_update
    package_plugin, context, data_dict, schema, 'package_update')
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/plugins.py", line 306, in plugin_validate
    return toolkit.navl_validate(data_dict, schema, context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/navl/dictization_functions.py", line 273, in validate
    converted_data, errors = _validate(flattened, schema, validators_context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/navl/dictization_functions.py", line 314, in _validate
    convert(converter, key, converted_data, errors, context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/lib/navl/dictization_functions.py", line 237, in convert
    converter(key, converted_data, errors, context)
  File "/home/adria/dev/pyenvs/ckan/src/ckan/ckan/logic/validators.py", line 769, in if_empty_guess_format
    mimetype, encoding = mimetypes.guess_type(url)
  File "/usr/lib/python2.7/mimetypes.py", line 294, in guess_type
    return _db.guess_type(url, strict)
  File "/usr/lib/python2.7/mimetypes.py", line 114, in guess_type
    scheme, url = urllib.splittype(url)
  File "/usr/lib/python2.7/urllib.py", line 1087, in splittype
    match = _typeprog.match(url)
TypeError: expected string or buffer

amercader · 2020-06-26T15:27:27Z

@wardi had a good play with this, great work! This makes the CKAN API even more awesome 🚀

I took the liberty of adding your examples to the action docstring so they appear in the API docs (a51af0b)

ghost · 2021-03-16T13:59:28Z

Context: I bumped into a problem when uploading several files by calling resource_create for each file (I have a custom UI written in React which enables this). Deceitfully it works fine for development environment when there is a single thread. When setting up a production like environment with several threads (defined in uWSGI) resource_create-calls start to overwrite each other. I reported it in #5959

Not good enough solution: I was advised to give a try to package_revise, but it turned out that it doesn't call plugins.IResourceController-plugins #L346 neither resource_create_default_resource_views #L337. As a result the resource is uploaded correctly but cannot be previewed in UI.

Is there a better way?: Is there package_revise-type of action which mimics resource_create i.e. does aforementioned steps and also a concurrent safe? If not, should there be a feature request for that as this is quite of core functionality. Or is there some easy workaround for such problems?

Any help or advice is highly appreciated!

wardi · 2021-03-16T17:00:42Z

resource_create, resource_delete and resource_update can be made concurrent safe with by having them use the new for_update context parameter. This would be a pretty easy contribution if you're interested in submitting a PR.

resource_revise not triggering view creation isn't good, that would also be great to see fixed.

ghost · 2021-03-17T07:23:15Z

resource_create, resource_delete and resource_update can be made concurrent safe with by having them use the new for_update context parameter.

Could you, please, elaborate more regarding this for_update? I can see how you set it to true in the original context inside update.py. But I don't quite get how it does all the magic of making the whole package_revise method call concurrency safe.

wardi · 2021-03-17T18:21:55Z

for_update relies on the database to lock the package for concurrent updates, blocking if there are any existing locks. The lock is released at the end of the transaction when the package and resource changes are committed.

ghost · 2021-03-18T16:33:40Z

I would be happy to create a PR if I will be provided with enough details as I am a beginner with CKAN :-)

Is adding for_update parameter in the beginning of the resource_create-method the only change required? I tried it and tested quickly, but it seems to behave the same old way.

wardi · 2021-09-22T12:52:10Z

I opened a new ticket for some follow-on work #6420

amercader added the To Discuss label Jan 15, 2019

wardi mentioned this pull request Apr 26, 2019

Create concurrent resources for a dataset #4217

Closed

check, merge and delete dict/list functions

ce3109a

wardi changed the title ~~package_update SELECT FOR UPDATE~~ package_sfu action (was SELECT FOR UPDATE) May 15, 2019

wardi added 13 commits May 15, 2019 14:16

TestCheckDict

d5f6a6a

check_dict: no exception on wrong type

ef001a2

test resolve_string_key

ac2c334

test: check_string_key

964d968

test: filter_glob_match

ed3f353

test: update_merge_dict

81d0815

test: update_merge_string_key

dfa360a

update_sfu initial implementation

5bcf6d8

package_sfu_schema and collect_prefix_validate

71c58ca

fixes for package_sfu

2b590f1

package_show: FOR UPDATE context parameter

0fa66b2

package_update: handle uploads instead of only in resource_update

75927b6

package_update: fix tests

10036c5

wardi added WIP and removed To Discuss labels Jun 5, 2019

fix tests

8174295

wardi changed the title ~~package_sfu action (was SELECT FOR UPDATE)~~ Safe dataset updates with package_sfu action Jun 6, 2019

wardi changed the title ~~Safe dataset updates with package_sfu action~~ Safe dataset updates with package_sfu Jun 6, 2019

amercader reviewed Jun 7, 2019

View reviewed changes

ckan/logic/action/update.py Outdated Show resolved Hide resolved

[ckan#4618] use get_action("package_update") per amercader

5a99fbb

smotornyuk mentioned this pull request Mar 24, 2020

Uploaded Resource files randomly disappeared from ckan. #5298

Closed

amercader added this to the CKAN 2.9 milestone Mar 31, 2020

amercader reviewed Jun 22, 2020

View reviewed changes

ckan/tests/logic/action/test_update.py Show resolved Hide resolved

ckan/model/package.py Outdated Show resolved Hide resolved

ckan/logic/action/update.py Outdated Show resolved Hide resolved

wardi and others added 4 commits June 23, 2020 10:38

[ckan#4618] improve revise update description

85b1b69

Co-authored-by: Adrià Mercader <[email protected]>

[ckan#4618] better flag name for columns: remove_if_not_provided

8379243

Merge remote-tracking branch 'ckan/master' into 4618-select-for-update

920b2db

[ckan#4618] make resource.name sticky because changed tests rely on it

95634ec

amercader merged commit 9b47735 into ckan:master Jun 26, 2020

This was referenced Jun 26, 2020

Improvements to package_revise #5472

Closed

revise improvements #5475

Merged

wardi added a commit that referenced this pull request Jun 30, 2020

[#5472] docs for #4618

ede9195

jqnatividad mentioned this pull request Dec 9, 2020

package_revise needs more elaboration in the documentation #5787

Open

Zharktas mentioned this pull request Mar 12, 2021

Cannot upload a file to Datastore through uWSGI + nginx and multiple uWSGI workers #5959

Open

paulmueller mentioned this pull request Sep 6, 2021

Not possible to upload resource via package_revise using requests #6360

Closed

mavocado4 mentioned this pull request Oct 12, 2022

Missing files fix RTIInternational/ckanext-searchterms#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safe dataset updates with package_revise #4618

Safe dataset updates with package_revise #4618

wardi commented Jun 5, 2019 •

edited

Loading

wardi commented Jan 18, 2019 •

edited

Loading

boykoc commented Jun 13, 2019 •

edited

Loading

wardi commented Jun 13, 2019

boykoc commented Jun 13, 2019

wardi commented Jun 13, 2019

sivang commented Jun 14, 2019 •

edited

Loading

mcarans commented Mar 18, 2020

rufuspollock commented May 14, 2020

alexandru-m-g commented May 15, 2020

rufuspollock commented May 15, 2020

alexandru-m-g commented May 29, 2020

alexandru-m-g commented Jun 1, 2020 •

edited

Loading

wardi commented Jun 1, 2020

amercader left a comment

amercader commented Jun 26, 2020

amercader commented Jun 26, 2020

ghost commented Mar 16, 2021

wardi commented Mar 16, 2021

ghost commented Mar 17, 2021

wardi commented Mar 17, 2021

ghost commented Mar 18, 2021

wardi commented Sep 22, 2021

Safe dataset updates with package_revise #4618

Safe dataset updates with package_revise #4618

Conversation

wardi commented Jun 5, 2019 • edited Loading

What's all this about?

Why should I care?

Technical Details

match, filter, update

Flattened Keys

include

Reusable Functions

Credits

wardi commented Jan 18, 2019 • edited Loading

boykoc commented Jun 13, 2019 • edited Loading

wardi commented Jun 13, 2019

boykoc commented Jun 13, 2019

wardi commented Jun 13, 2019

sivang commented Jun 14, 2019 • edited Loading

mcarans commented Mar 18, 2020

rufuspollock commented May 14, 2020

alexandru-m-g commented May 15, 2020

rufuspollock commented May 15, 2020

alexandru-m-g commented May 29, 2020

alexandru-m-g commented Jun 1, 2020 • edited Loading

wardi commented Jun 1, 2020

amercader left a comment

Choose a reason for hiding this comment

amercader commented Jun 26, 2020

amercader commented Jun 26, 2020

ghost commented Mar 16, 2021

wardi commented Mar 16, 2021

ghost commented Mar 17, 2021

wardi commented Mar 17, 2021

ghost commented Mar 18, 2021

wardi commented Sep 22, 2021

wardi commented Jun 5, 2019 •

edited

Loading

wardi commented Jan 18, 2019 •

edited

Loading

boykoc commented Jun 13, 2019 •

edited

Loading

sivang commented Jun 14, 2019 •

edited

Loading

alexandru-m-g commented Jun 1, 2020 •

edited

Loading