Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(github): simplify identity caching #162

Merged
merged 5 commits into from
May 15, 2024
Merged

feat(github): simplify identity caching #162

merged 5 commits into from
May 15, 2024

Conversation

vit-zikmund
Copy link
Collaborator

This is an update on github auth plugin caching, which I eventually found problematic (then I made it ugly, sorted my thoughts, and finally made it better than it was). I'd squash it, but then you'd lose the commit genesis, which might be interesting, so I'll keep it for the PR (which I'd still rather recommend squash-merging 😄)

The change is mostly simplifying it from the user perspective. I also learned a new thing - the weakref library that allows one to do stuff when an object's refcount goes to 0. I used that to manage a dictionary of unique users I need to keep around to track one user's multiple tokens. Because cachetools caches don't have any native mechanism to notify anyone that an entry got evicted, I originally had size-capped caches for both the tokens and users, but that created a kind of a split-brain problem, where the user referenced from both caches could get evicted independently, and that was bad.

Now I'm keeping track of the users in a size-unlimited cache which - thanks to weak references - can clean itself up after all its references in the primary cache get evicted. This keeps the user cache fresh and not growing forever, which would otherwise likely require some periodic cleanup.

While working on providing documentation for the GitHub auth provider, I ran into a pickle trying to explain the token and user caches, specifically consequences of entries being evicted from one or the other.

Because of how they were implemented, this explanation ended up too convoluted to even try writing it down and it all suddenly felt wrong.

The thing is that each GithubIdentity object holds an authorization cache for all repos ever encountered for the particular user. Formerly these identities were cached in both the "raw-user-identifiaction-data" (aka "user") and the "token" cache, which brought mind-bending questions what will happen if either of those cached entries get evicted and a new request from that user is made. I realized that there's a case where an evicted "user" cache entry (that's still held in the "token" cache) will cause a dual-same-identity state when a new token of the same user will get in. Sheesh!

The fix is to decouple the caches, so only the "user" cache will actually hold the GithubIdentities (w/auth cache), while the "token" cache will hold the raw static data (newly encapsulated in the _RawGithubIdentity dataclass later becoming part of the GithubIdentity, w/o any copying, yay).

This allows me to keep the caches agnostic of each other, the code is easier to read and manage, with only a little drawback, that's two cache searches instead of the previous one. It's not the bestest, really, but it made the code much more predictable and manageable.

On the way I also did a little refactoring:
- Stripped the GithubIdentity of some comparison dunders that were found not to be helping anything. Now any change in the _RawGithubIdentity attributes will be a cache miss, which really doesn't hurt anything, as the cache is limited by size and the stale identity will get eventually evicted.
- GithubIdentity.login usage has been switched to .id and old .id moved to .github_id for better alignment with the common auth code.
- Improved return value typing for the cache decorators to avoid one dumb cast in GithubAuthenticator._get_user_cached only to hit a mypy bug that eventually needs another cast, but at least that one stays inside the decorator logic.
The schismatic user and token cache is finally over! The need for a set of currently cached users (to prevent tracking duplicate identities) is now resolved using a weakref.WeakValueDictionary, which automatically clears up with all the referenced identites evicted from the token cache. This allows me to hide this user cache away from the operator, because it should never grow past the size of the token cache itself (and thus its max size is loosely bound to the token cache).
Caching the token -> identity pairs again also gets us back to one search operation for an incoming token.
The default Identity class had a couple class variables (name, id, email) that seemed more like a protocol kind of thing (stating an _instance_ o the class has those attributes), but being it an ABC that's inherited from made the inherited classes never lose those dud class vars, although the instances got their own (as seen in DefaultIdentity).

This might have so far been a little clutter and nothing else, but sice the `GithubIdentity(Identity)` uses an auxiliary dataclass to hold part of these variables, I had to resort to a rather complicated `__getattribute__` override to make access of these aux attributes transparent (because a much simpler `__getattr__` override wasn't possible due to it accessing the dud class vars first).

Providing a simple `__init__` implementation for the base `Identity` works perfectly fine with static checkers and lets me avoid the recursive hell of `__getattribute__` override.
@@ -24,6 +25,10 @@


# THREAD SAFE CACHING UTILS
# original type preserving "return type" for the decorators below
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy is getting pretty impressive.

self._token_cache: MutableMapping[
Any, GithubIdentity
] = cachetools.LRUCache(maxsize=cfg.cache.token_max_size)
# unique user identities, to get the same identity that's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clever.

@@ -372,3 +368,68 @@ def test_github_auth_request_cached(app: flask.Flask) -> None:
assert identity.is_authorized(ORG, REPO, Permission.WRITE)
assert user_resp.call_count == 1
assert perm_resp.call_count == 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the use of call_count here to prove what happened.

Copy link
Collaborator

@athornton athornton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent.

@athornton
Copy link
Collaborator

If you want to squash, go ahead. I am not bothered by messy commit history the way some people are. I'll merge this tomorrow (15 May 2024) if no one objects.

@vit-zikmund
Copy link
Collaborator Author

Then I'm not bothered either. I'd only break something in an attempt to do so 🙈

@athornton athornton merged commit f105957 into datopian:main May 15, 2024
7 checks passed
@vit-zikmund vit-zikmund deleted the fix-cache branch May 16, 2024 10:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants