-
-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request]: Revamped comments handling - separate documents in ES and pagination #678
Comments
Give me some time to look into this. Do you have a video example with 100k+ comments? |
@bbilly1 Hello there! Thanks for checking in. Yes here is an example: https://www.youtube.com/watch?v=UWb5Qc-fBvk |
OK, did some investigating. First documenting the comment count problem/bug: This will return all comments, by not specifying any extractor args: yt_obs = {
"skip_download": True,
"getcomments": True,
"ignoreerrors": True,
"socket_timeout": 10,
"extractor_retries": 3,
"retries": 10,
}
response = yt_dlp.YoutubeDL(yt_obs).extract_info("UWb5Qc-fBvk") That logs:
And will keep going for like 30 minutes and extract all:
But specifying all in the extractor keys won't, e.g.: yt_obs = {
"skip_download": True,
"getcomments": True,
"ignoreerrors": True,
"socket_timeout": 10,
"extractor_retries": 3,
"retries": 10,
"extractor_args": {
"youtube": {
"max_comments": ["all", "all", "all", "all"],
"comment_sort": ["top"],
}
},
"verbose": True,
}
response = yt_dlp.YoutubeDL(yt_obs).extract_info("UWb5Qc-fBvk") That will extract 4k comments:
Confirming:
Can you confirm that? If you can, I'll bring it up with the yt-dlp team, as that shouldn't be expected behavior. BTW, the big 100k+ comments can index just fine with a single request, you'll just have to increase the heap size of ES to something like 2G, or not define it at all by removing this line: https://github.com/tubearchivist/tubearchivist/blob/master/docker-compose.yml#L48. Also getting these comments isn't too bad either, will take a few seconds to build the mark up in JS though... All glorious comments. :-) Also building the comments in the frontend could easily be done async in JS, that would require minimal code change. Some thoughts:
Sorry for the wall of text, but does that make sense? |
Feature request
This feature request pertains to slowness observed in large environments (think, >100K comments per video). I was working on automatically detecting and handling timestamping in comments (#677), but it seems the application is not yet optimized for scalability, as all the comments are stored in a single JSON file.
Upon investigation, I've identified that parsing a single large document is causing slowness during data transformation.
Ideally, each comment thread should be stored as into individual documents in Elasticsearch with a unique ID and tagged with the youtube_id as a key.
This would enable the implementation of pagination/scrolling while viewing the comments. And this would also enable us to index and search individual comments efficiently.
Furthermore, this should not cause any major slowness, as only a limited set of comments can be fetched for a particular video instead of whole large documents.
Me and @lamusmaser had a great discussion on this Sunday 10th on Discord. Having separate documents for each parent comment with parent comment and child-comments in same document is the way to go first. Ideally all comments should be in separate documents and using parent/child relationship, but that can be another goal.
There will be index
ta_comment_v2
to store the documents going forward. This implementation will require a significant rewrite of how comments are handled, so it makes most sense to use a new index.For the migration of existing data, we briefly discussed that a new celery task must be implemented to handle this and it should be handled a background job after application has started to not block the startup of the application. Comments is not a critical feature to make the application function.
We also figured out that letting ES generate the document ID is the best option, citing ES official docs:
Problems to solve
Your help is needed!
The text was updated successfully, but these errors were encountered: