Description
CKAN version
Tested in 2.9.2 but probably also happening in master
Describe the bug
<lib/helpers.py:2567(new_activities)>
spikes to ~15 seconds if the user is following 2 or 3 organizations (with around 200 datasets each) in a ~1000+ dataset system. This makes the system unusable since this helper is called in almost every request to the UI.
Without following organizations (50 ms):
Calls | Total Time (ms) | Per Call (ms) | Cumulative Time (ms) | Per Call (ms) | Function |
---|---|---|---|---|---|
1 | 0.005 | 0.0050 | 52.439 | 52.4390 | <lib/helpers.py:2567(new_activities)> |
1 | 0.006 | 0.0060 | 52.374 | 52.3740 | <logic/action/get.py:3216(dashboard_new_activities_count)> |
1 | 0.016 | 0.0160 | 50.164 | 50.1640 | <ckanext/gdx/logic/action.py:418(dashboard_activity_list)> |
Following 3 organizations with 100 datasets each (1500 ms):
Calls | Total Time (ms) | Per Call (ms) | Cumulative Time (ms) | Per Call (ms) | Function |
---|---|---|---|---|---|
1 | 0.01 | 0.0100 | 1610.417 | 1610.4170 | <lib/helpers.py:2567(new_activities)> |
1 | 0.018 | 0.0180 | 1610.26 | 1610.2600 | <logic/action/get.py:3216(dashboard_new_activities_count)> |
1 | 0.031 | 0.0310 | 1605.896 | 1605.8960 | <ckanext/gdx/logic/action.py:418(dashboard_activity_list)> |
Steps to reproduce
Steps to reproduce the behavior:
- Create 3 organizations
- Add ~100 datasets to each one (enough to test)
- Login with a user
- Open the home page
- Using flask debug toolbar's profiler measure the response time of the activity related methods:
<lib/helpers.py:2567(new_activities)>
- It should be around
50 ms
- Now follow the 3 organizations
- Refresh the web page
- The call to
<lib/helpers.py:2567(new_activities)>
it's now in 1500 ms (it will scale to ~20 seconds with a bigger database)
Expected behavior
The response time of the new activities feature will not impact on the UX of the site. (A delay time of ~100 ms?)
Additional details
The issue seems to be related to two main factors:
- How the overall logic to get the new activities count is being made
- How the internal queries to the activity models are being made
1. New Activities logic:
The logic queries everything from the database and then uses some business logic in the action layer to filter results. Even when this approach may be fine in most cases, it is creating a bottleneck in the database for this feature.
Example: Even when activities from the users are not counted, it queries them from the database, and then adds some logic to the returned dictionary to filter the count.
2. Internal queries
The logic to get activities from the database is a little bit complicated and not optimized. For each type of entity it:
- Get every followee object (user following X object)
- Creates and executes a select query for each of the previous objects
- Creates a union of all the previous queries
Lines 346 to 358 in abfda8b
I played a little bit with queries. Instead of doing a union of select results, if I do a select filtering by id and the time of the query is reduced to a half (although It's not enough to reduce the time for a good UX).
def _activities_from_datasets_followed_by_user_query(user_id, limit):
'''Return a query for all activities from datasets that user_id follows.'''
import ckan.model as model
# Get a list of the datasets that the user is following.
follower_objects = model.UserFollowingDataset.followee_list(user_id)
if not follower_objects:
# Return a query with no results.
return model.Session.query(model.Activity).filter(text('0=1'))
packages_id = set([f.object_id for f in follower_objects])
return model.Session.query(model.Activity).\
filter(model.Activity.object_id.in_(packages_id)).\
limit(limit)
Metadata
Assignees
Labels
Type
Projects
Status
Done