Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Revamped comments handling - separate documents in ES and pagination #678

Open
1 task done
dot-mike opened this issue Mar 9, 2024 · 3 comments
Open
1 task done
Labels
enhancement New feature or request question Further information is requested

Comments

@dot-mike
Copy link
Contributor

dot-mike commented Mar 9, 2024

Feature request

This feature request pertains to slowness observed in large environments (think, >100K comments per video). I was working on automatically detecting and handling timestamping in comments (#677), but it seems the application is not yet optimized for scalability, as all the comments are stored in a single JSON file.

Upon investigation, I've identified that parsing a single large document is causing slowness during data transformation.

Ideally, each comment thread should be stored as into individual documents in Elasticsearch with a unique ID and tagged with the youtube_id as a key.

This would enable the implementation of pagination/scrolling while viewing the comments. And this would also enable us to index and search individual comments efficiently.

Furthermore, this should not cause any major slowness, as only a limited set of comments can be fetched for a particular video instead of whole large documents.


Me and @lamusmaser had a great discussion on this Sunday 10th on Discord. Having separate documents for each parent comment with parent comment and child-comments in same document is the way to go first. Ideally all comments should be in separate documents and using parent/child relationship, but that can be another goal.

There will be index ta_comment_v2 to store the documents going forward. This implementation will require a significant rewrite of how comments are handled, so it makes most sense to use a new index.

For the migration of existing data, we briefly discussed that a new celery task must be implemented to handle this and it should be handled a background job after application has started to not block the startup of the application. Comments is not a critical feature to make the application function.

We also figured out that letting ES generate the document ID is the best option, citing ES official docs:

Use auto-generated ids
When indexing a document that has an explicit id, Elasticsearch needs to check whether a document with the same id already exists within the same shard, which is a costly operation and gets even more costly as the index grows. By using auto-generated ids, Elasticsearch can skip this check, which makes indexing faster.
today we use explicit Id which is the video id

Problems to solve

  • Migration of current comment documents to new thread-based documents.
  • Comment metadata refresh should update corresponding documents for the comment thread.
  • Yt-dlp should yield comment in batches to ensure we can bulk upload. If the process some how is not completed, all data will not be lost.
  • Comments should be treated as a thread that has a parent (root) and child comments. Each thread is a separate document containing the actual comment and child comments to it.
  • Automatic pagination of comments in the API.

Your help is needed!

  • Yes I will work on this in the next few days or weeks.
@bbilly1
Copy link
Member

bbilly1 commented Mar 10, 2024

Give me some time to look into this. Do you have a video example with 100k+ comments?

@bbilly1 bbilly1 added enhancement New feature or request question Further information is requested labels Mar 10, 2024
@dot-mike
Copy link
Contributor Author

@bbilly1 Hello there! Thanks for checking in. Yes here is an example: https://www.youtube.com/watch?v=UWb5Qc-fBvk

@bbilly1
Copy link
Member

bbilly1 commented Mar 11, 2024

OK, did some investigating. First documenting the comment count problem/bug:

This will return all comments, by not specifying any extractor args:

yt_obs = {
    "skip_download": True,
    "getcomments": True,
    "ignoreerrors": True,
    "socket_timeout": 10,
    "extractor_retries": 3,
    "retries": 10,
}
response = yt_dlp.YoutubeDL(yt_obs).extract_info("UWb5Qc-fBvk")

That logs:

[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version [email protected] from yt-dlp/yt-dlp [615a84447] (pip) API
[debug] params: {'skip_download': True, 'getcomments': True, 'ignoreerrors': True, 'socket_timeout': 10, 'extractor_retries': 3, 'retries': 10, 'verbose': True, 'compat_opts': set(), 'http_headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-us,en;q=0.5', 'Sec-Fetch-Mode': 'navigate'}, 'forceprint': {}, 'print_to_file': {}, 'outtmpl': {'default': '%(title)s [%(id)s].%(ext)s', 'chapter': '%(title)s - %(section_number)03d %(section_title)s [%(id)s].%(ext)s'}}
[debug] Python 3.11.3 (CPython x86_64 64bit) - Linux-5.15.0-100-generic-x86_64-with-glibc2.31 (OpenSSL 1.1.1n  15 Mar 2022, glibc 2.31)
[debug] exe versions: ffmpeg N-114103-gda39a19aad-20240311 (setts), ffprobe N-114103-gda39a19aad-20240311
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2024.02.02, mutagen-1.47.0, requests-2.31.0, sqlite3-3.34.1, urllib3-2.2.1, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1803 extractors
[youtube] Extracting URL: UWb5Qc-fBvk
[youtube] UWb5Qc-fBvk: Downloading webpage
[youtube] UWb5Qc-fBvk: Downloading ios player API JSON
[youtube] UWb5Qc-fBvk: Downloading android player API JSON
[youtube] UWb5Qc-fBvk: Downloading m3u8 information
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id
[youtube] Downloading comment section API JSON
[youtube] Downloading ~116416 comments
[youtube] Sorting comments by newest first
[youtube] Downloading comment API JSON page 1 (0/~116416)
[youtube] Downloading comment API JSON page 2 (20/~116416)
[youtube] Downloading comment API JSON page 3 (40/~116416)
[youtube] Downloading comment API JSON page 4 (60/~116416)
[youtube]     Downloading comment API JSON reply thread 1 (77/~116416)
[youtube] Downloading comment API JSON page 5 (81/~116416)
[youtube] Downloading comment API JSON page 6 (101/~116416)
[youtube] Downloading comment API JSON page 7 (121/~116416)
[youtube]     Downloading comment API JSON reply thread 1 (129/~116416)
[youtube]     Downloading comment API JSON reply thread 2 (134/~116416)
[youtube]     Downloading comment API JSON reply thread 3 (139/~116416)
[youtube] Downloading comment API JSON page 8 (147/~116416)
...trimmed

And will keep going for like 30 minutes and extract all:

len(response["comments"])
116357

But specifying all in the extractor keys won't, e.g.:

yt_obs = {
    "skip_download": True,
    "getcomments": True,
    "ignoreerrors": True,
    "socket_timeout": 10,
    "extractor_retries": 3,
    "retries": 10,
    "extractor_args": {
        "youtube": {
            "max_comments": ["all", "all", "all", "all"],
            "comment_sort": ["top"],
        }
    },
    "verbose": True,
}
response = yt_dlp.YoutubeDL(yt_obs).extract_info("UWb5Qc-fBvk")

That will extract 4k comments:

[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version [email protected] from yt-dlp/yt-dlp [615a84447] (pip) API
[debug] params: {'skip_download': True, 'getcomments': True, 'ignoreerrors': True, 'socket_timeout': 10, 'extractor_retries': 3, 'retries': 10, 'extractor_args': {'youtube': {'max_comments': ['all', 'all', 'all', 'all'], 'comment_sort': ['top']}}, 'verbose': True, 'compat_opts': set(), 'http_headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-us,en;q=0.5', 'Sec-Fetch-Mode': 'navigate'}}
[debug] Python 3.11.3 (CPython x86_64 64bit) - Linux-5.15.0-100-generic-x86_64-with-glibc2.31 (OpenSSL 1.1.1n  15 Mar 2022, glibc 2.31)
[debug] exe versions: ffmpeg N-114103-gda39a19aad-20240311 (setts), ffprobe N-114103-gda39a19aad-20240311
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2024.02.02, mutagen-1.47.0, requests-2.31.0, sqlite3-3.34.1, urllib3-2.2.1, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests, websockets
[debug] Loaded 1803 extractors
[youtube] Extracting URL: UWb5Qc-fBvk
[youtube] UWb5Qc-fBvk: Downloading webpage
[youtube] UWb5Qc-fBvk: Downloading ios player API JSON
[youtube] UWb5Qc-fBvk: Downloading android player API JSON
[youtube] UWb5Qc-fBvk: Downloading m3u8 information
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id
[youtube] Downloading comment section API JSON
[youtube] Downloading ~116416 comments
[youtube] Sorting comments by top comments
[youtube] Downloading comment API JSON page 1 (0/~116416)
[youtube]     Downloading comment API JSON reply thread 1 (1/~116416)
[youtube]     Downloading comment API JSON reply thread 2 (4/~116416)
[youtube]     Downloading comment API JSON reply thread 3 (9/~116416)
[youtube]     Downloading comment API JSON reply thread 4 (13/~116416)
[youtube]        Downloading comment replies API JSON page 1 (23/~116416)
[youtube]     Downloading comment API JSON reply thread 5 (27/~116416)
[youtube]        Downloading comment replies API JSON page 1 (37/~116416)
[youtube]        Downloading comment replies API JSON page 2 (82/~116416)
...trimmed
[youtube] Downloading comment API JSON page 63 (3980/~116416)
[youtube]     Downloading comment API JSON reply thread 1 (3981/~116416)
[youtube]     Downloading comment API JSON reply thread 2 (4001/~116416)
[youtube] Downloading comment API JSON page 64 (4002/~116416)
[youtube] Extracted 4003 comments
[debug] Default format spec: bestvideo*+bestaudio/best
[info] UWb5Qc-fBvk: Downloading 1 format(s): 616+251

Confirming:

len(response["comments"])
4003

Can you confirm that? If you can, I'll bring it up with the yt-dlp team, as that shouldn't be expected behavior.

BTW, the big 100k+ comments can index just fine with a single request, you'll just have to increase the heap size of ES to something like 2G, or not define it at all by removing this line: https://github.com/tubearchivist/tubearchivist/blob/master/docker-compose.yml#L48. Also getting these comments isn't too bad either, will take a few seconds to build the mark up in JS though...
2024-03-11_23-12-31-screenshot

All glorious comments. :-)

Also building the comments in the frontend could easily be done async in JS, that would require minimal code change.

Some thoughts:

  • Index time is not a concern here, this is not a high traffic webshop where a few 100ms matter. All is happening in the tasks asynchronously anyways already. Plus you'll spend like 30 minutes to get all these comments, you can wait a few milliseconds until they are indexed.
  • Pagination, or at least a load more button, could still be a good idea to implement, although I wouldn't change how it's indexed at the moment in one big happy json document. Also more exact searching in comments could be interesting too later. I think all of that could be implemented with using nested fields, although I haven't tested it in detail...
  • This would avoid any complicated migrations, but still needs an index rebuild with the new mapping...
  • This would fix the main problem where the user is waiting for comments to load and the interface is blocked (although could be async).
  • This would allow for future searching of comments.
  • This adds just minimal complexity.

Sorry for the wall of text, but does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants