Skip to content

Activity queries getting slow (~15 seconds) when following entities. #6028

Closed
@pdelboca

Description

CKAN version

Tested in 2.9.2 but probably also happening in master

Describe the bug

<lib/helpers.py:2567(new_activities)> spikes to ~15 seconds if the user is following 2 or 3 organizations (with around 200 datasets each) in a ~1000+ dataset system. This makes the system unusable since this helper is called in almost every request to the UI.

Without following organizations (50 ms):

Calls Total Time (ms) Per Call (ms) Cumulative Time (ms) Per Call (ms) Function
1 0.005 0.0050 52.439 52.4390 <lib/helpers.py:2567(new_activities)>
1 0.006 0.0060 52.374 52.3740 <logic/action/get.py:3216(dashboard_new_activities_count)>
1 0.016 0.0160 50.164 50.1640 <ckanext/gdx/logic/action.py:418(dashboard_activity_list)>

Following 3 organizations with 100 datasets each (1500 ms):

Calls Total Time (ms) Per Call (ms) Cumulative Time (ms) Per Call (ms) Function
1 0.01 0.0100 1610.417 1610.4170 <lib/helpers.py:2567(new_activities)>
1 0.018 0.0180 1610.26 1610.2600 <logic/action/get.py:3216(dashboard_new_activities_count)>
1 0.031 0.0310 1605.896 1605.8960 <ckanext/gdx/logic/action.py:418(dashboard_activity_list)>

Steps to reproduce

Steps to reproduce the behavior:

  • Create 3 organizations
  • Add ~100 datasets to each one (enough to test)
  • Login with a user
  • Open the home page
  • Using flask debug toolbar's profiler measure the response time of the activity related methods: <lib/helpers.py:2567(new_activities)>
  • It should be around 50 ms
  • Now follow the 3 organizations
  • Refresh the web page
  • The call to <lib/helpers.py:2567(new_activities)> it's now in 1500 ms (it will scale to ~20 seconds with a bigger database)

Expected behavior

The response time of the new activities feature will not impact on the UX of the site. (A delay time of ~100 ms?)

Additional details

The issue seems to be related to two main factors:

  1. How the overall logic to get the new activities count is being made
  2. How the internal queries to the activity models are being made

1. New Activities logic:

The logic queries everything from the database and then uses some business logic in the action layer to filter results. Even when this approach may be fine in most cases, it is creating a bottleneck in the database for this feature.

Example: Even when activities from the users are not counted, it queries them from the database, and then adds some logic to the returned dictionary to filter the count.

2. Internal queries

The logic to get activities from the database is a little bit complicated and not optimized. For each type of entity it:

  • Get every followee object (user following X object)
  • Creates and executes a select query for each of the previous objects
  • Creates a union of all the previous queries

ckan/ckan/model/activity.py

Lines 346 to 358 in abfda8b

def _activities_from_datasets_followed_by_user_query(user_id, limit):
'''Return a query for all activities from datasets that user_id follows.'''
import ckan.model as model
# Get a list of the datasets that the user is following.
follower_objects = model.UserFollowingDataset.followee_list(user_id)
if not follower_objects:
# Return a query with no results.
return model.Session.query(model.Activity).filter(text('0=1'))
return _activities_union_all(*[
_activities_limit(_package_activity_query(follower.object_id), limit)
for follower in follower_objects])

I played a little bit with queries. Instead of doing a union of select results, if I do a select filtering by id and the time of the query is reduced to a half (although It's not enough to reduce the time for a good UX).

def _activities_from_datasets_followed_by_user_query(user_id, limit):
    '''Return a query for all activities from datasets that user_id follows.'''
    import ckan.model as model

    # Get a list of the datasets that the user is following.
    follower_objects = model.UserFollowingDataset.followee_list(user_id)
    if not follower_objects:
        # Return a query with no results.
        return model.Session.query(model.Activity).filter(text('0=1'))

    packages_id = set([f.object_id for f in follower_objects])

    return model.Session.query(model.Activity).\
        filter(model.Activity.object_id.in_(packages_id)).\
            limit(limit)

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions