Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another Matroska Attempt #1149

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Conversation

complexlogic
Copy link
Contributor

@complexlogic complexlogic commented Sep 19, 2023

Since the other PR for Matroska (#967) hasn't had any progress in several years, I decided to give it a try myself. I think that the Matroska container format has a lot of potential for audio and would like to see it supported in TagLib.

Matroska is different from other metadata formats in that it uses nested/hierarchical elements to store data. I designed an API which internally parses the nested EBML structure and generates a flat list of tag objects to present to the user. This allows Matroska files to be parsed in a similar manner to the other formats in TagLib.

The user can add/remove tags to the list, and manipulate the tags' attributes (for example, the target level). When the file is written, the nested EBML structure will be re-rendered from the flat list of tags. None of the EBML-related related code is part of the public API.

I created the programs examples/matroskareader.cpp and examples/matroskawriter.cpp to demonstrate how the API works.

Current Development Status

  • Reading Files
  • Writing Files
  • Chapter support
  • Attachments support (needed for cover art)
  • Audio Properties
  • Complex Properties
  • PropertyMap
  • Documentation
  • Unit tests

@ufleisch
Copy link
Contributor

Thanks for your efforts in moving Matroska support forward. This would really be a useful addition to TagLib. I would be really interested in this feature and want to encourage you to go on. But I think that before merging the code, it should also support writing metadata. Reading and writing should work for the elements which are commonly used in Matroska/Webm files. The problem here is that I do not have any mka files, and the mkv files I have are mostly grabbed from DVD using Handbrake. So probably it is a bit hard for me to find out which elements are commonly used.

Before looking at the code, I wanted to get an overview of what works. I created a tiny mkv-file for test purposes using

ffmpeg -t 0.1 -s qcif -f rawvideo -pix_fmt rgb24 -r 25 -i /dev/zero zero_second.mkv

When I use matroskareader on it, I get

Tag Name: ENCODER
Tag Value: Lavf60.3.100
Target Type Value: None

Tag Name: ENCODER
Tag Value: Lavc60.3.100 libx264
Target Type Value: None

Tag Name: DURATION
Tag Value: 00:00:00.120000000
Target Type Value: None

This is also what I see below Tags when using mkvinfo. Now I wanted to add some tags.

mkvpropedit file.mkv -e info -s title="Title"
mkvpropedit file.mkv -e info -s date=2023-09-23T04:30:00Z
mkvpropedit file.mkv -e info -s segment-filename="Segment Filename"
mkvpropedit file.mkv -e info -s prev-filename="Prev Filename"
mkvpropedit file.mkv -e info -s next-filename="Next Filename"
mkvpropedit file.mkv -e info -s muxing-application="Muxing Application"
mkvpropedit file.mkv -e info -s writing-application="Writing Application"
mkvpropedit file.mkv -e track:1 -s language=en
mkvpropedit file.mkv -e track:1 -s name="Name 1"

I can see this information using mkvinfo -a and mediainfo, but not using matroskareader - it still shows the same three tags.

Now I tried to add some tags using VLC. Unfortunately, this does not seem to work anymore (see bugreport). VLC is using TagLib, maybe this bug will not be fixed until TagLib supports Matroska.

Then I tried Handbrake by converting my tiny mkv file to another mkv file and setting the tags in the Handbrake UI. mediainfo shows the following information:

Movie name                               : handbrake
Description                              : Description
Released date                            : 2023
Encoded date                             : 2023-09-22 15:58:25 UTC
Writing application                      : HandBrake 1.6.1 2023032600
Writing library                          : Lavf59.27.100
ErrorDetectionType                       : Per level 1
ARTIST                                   : Actors
DIRECTOR                                 : Director
GENRE                                    : Genre
SUMMARY                                  : Comment
SYNOPSIS                                 : Plot

mkvinfo also displays the entered information:

|+ Segment information
| + Timestamp scale: 1000000
| + Title: handbrake
| + Multiplexing application: Lavf59.27.100
| + Writing application: HandBrake 1.6.1 2023032600
| + Segment UID: 0x43 0x02 0xc3 0x50 0x41 0xce 0xf3 0x78 0xa1 0xe3 0x1d 0x6e 0xac 0xb7 0x18 0x8b
| + Date: 2023-09-22 15:58:25 UTC
| + Duration: 00:00:00.120000000
(..)
|+ Tags
| + Tag
|  + Targets
|  + Simple
|   + Name: DIRECTOR
|   + String: Director
|  + Simple
|   + Name: ARTIST
|   + String: Actors
|  + Simple
|   + Name: SUMMARY
|   + String: Comment
|  + Simple
|   + Name: DESCRIPTION
|   + String: Description
|  + Simple
|   + Name: GENRE
|   + String: Genre
|  + Simple
|   + Name: SYNOPSIS
|   + String: Plot
|  + Simple
|   + Name: DATE_RELEASED
|   + String: 2023
|  + Simple
|   + Name: ENCODER
|   + String: Lavf59.27.100
| + Tag
|  + Targets
|   + Track UID: 9584013959154292683
|  + Simple
|   + Name: DURATION
|   + String: 00:00:00.120000000

Also matroskareader shows this information correctly.

Tag Name: DIRECTOR
Tag Value: Director
Target Type Value: None

Tag Name: ARTIST
Tag Value: Actors
Target Type Value: None

Tag Name: SUMMARY
Tag Value: Comment
Target Type Value: None

Tag Name: DESCRIPTION
Tag Value: Description
Target Type Value: None

Tag Name: GENRE
Tag Value: Genre
Target Type Value: None

Tag Name: SYNOPSIS
Tag Value: Plot
Target Type Value: None

Tag Name: DATE_RELEASED
Tag Value: 2023
Target Type Value: None

Tag Name: ENCODER
Tag Value: Lavf59.27.100
Target Type Value: None

Tag Name: DURATION
Tag Value: 00:00:00.120000000
Target Type Value: None

This already looks promising! What seems to be missing is the title, in this case "handbrake", which is shown by mkvinfo below "Segment information" and not "Tags". But it seems that not all applications use that title. If I have a look at the example files from matroska-test-files, they use a TITLE tag, which is correctly shown by matroskareader.

Tag Name: TITLE
Tag Value: Big Buck Bunny - test 1
Target Type Value: None

Tag Name: DATE_RELEASED
Tag Value: 2010
Target Type Value: None

Tag Name: COMMENT
Tag Value: Matroska Validation File1, basic MPEG4.2 and MP3 with only SimpleBlock
Target Type Value: None

This is exactly what is contained in the corresponding tag XML file.

tagreader, however, does not show any tags for the mentioned test files. But the mapping of Matroska elements to "standard" tags and TagLib properties should not be a problem once we know what is usually used. Probably, you have more information on that and more realistic example files. But as I said, this looks promising, and I am looking forward to seeing write support.

@complexlogic
Copy link
Contributor Author

Thanks for the feedback @ufleisch. I recommend reading this page from the Matroska specification for proper tags to use for audio.

I'm going to keep developing this feature and will update the PR with my progress.

@complexlogic
Copy link
Contributor Author

Updates:

  • Writing tags is now supported.
  • PropertyMap support has been implemented
  • FileRef has been implemented for .mka, .mkv, and .webm extensions.
  • Matroska/WebM files work with both tagreader and tagwriter

The current code always rewrites the file when the new tag is a different size from the old tag size (which is usually the case). This can be slow with large file sizes. I will eventually optimize the writing to intelligently use void elements as padding to avoid rewrites when possible, similar to what is done for other file types.

For generating sample/test files, I don't recommend using ffmpeg or any program based on it (such as Handbrake). ffmpeg's Matroska metadata implementation is not even close to compliant with the specification. Based on my research, Foobar2000 seems to be the only program with a decent Matroska audio implementation. So I'm using that as the reference implementation where the official Matroska spec is ambiguous.

I plan to work on attachments next, for cover art support.

@ufleisch
Copy link
Contributor

ufleisch commented Oct 1, 2023

Great, thanks! Concerning rewriting the file when the tag changes: Most audio format implementations in TagLib avoid this by using some padding (such as void elements). But I personally think that it is important to be able to bring the file back into a deterministic state without padding when the tag is completely removed, this should be possible with all audio formats supported by TagLib. You could implement it like this:

  • If the tag is empty after save(), make sure that it is completely removed.
  • The default save() method could be optimized to be fast by using padding instead of shrinking the file.
  • There could be an overload save(bool shrink) which can be called with shrink=true to save without padding.

I will have a look at the changes.

@ufleisch
Copy link
Contributor

I had a look at the current state of the Matroska code, and progress looks promissing. I only have a few remarks:

It gives me a good feeling if I can edit the metadata of a file and am then able to revert the changes thereby bringing the file into the exact state it was before. So I used tagwriter to set some values (standard tags and properties) on a Matroska file, then delete them again and look if the file is left as it was before. I was not able to delete properties using the -D option of tagwriter - no surprise if I look into matroskatag.cpp, where there is a return false before removeSimpleTags(). So I just tried to delete standard tags, this worked, but there were some leftovers from -y and -T: DATE_RELEASED/50, PART_NUMBER/30 with values 0, probably these should be removed too.

Regarding the state of the file after the tags have been removed: It was no longer the same, the tags are removed, but the previously existant tags have been changed (Language from "Not set" to "und"). I would prefer if tags which are not edited would be kept as they were, so that the file is brought back into its previous state.

I also had a look at how tags written by other applications are recognized. This looked quite good, only the year was displayed as 0. This is because foobar2000 uses DATE_RECORDED/30, Mp3Tag DATE_RECORDED/50, but tagwriter uses DATE_RELEASED/50. Given that TDRC in ID3v2.4.0 is "Recording time", I would also take DATE_RECORDED. Tags from Handbrake were still not recognized, probably because all tags from Handbrake have a "Target Type Value: None". Maybe TagLib could be a bit more forgiving here.

Concerning the code itself, it is too early to fix minor styling issues, but I have one remark. In matroskatag.h there are a few template functions in the public header file. Probably these "simple tag" template functions could be moved into matroskatag.cpp as members of TagPrivate, then they could directly access tags and would not need a simpleTagsListPrivate() function.

But as I said, it looks promissing and I am looking forward to seeing further progress.

@complexlogic
Copy link
Contributor Author

Thanks for the feedback.

I'm continuing to work on this locally. I have attachments working, but I have discovered some other issues:

  1. The code does not update the Seek Head element. This is necessary when adding a new Tags element, or when the size of an existing Tags element changes (and shifts the offset of other elements in the Seek Head)
  2. The code does not update the cue points. This is necessary when the size of a Tags element changes. Each cue point stores an offset, which needs to be adjusted when the size of any upstream element changes.

If you run mkvalidator on a file written by tagwriter, it will most likely fail due to the issues above. I'm currently working on fixes for this. In the meantime, make sure you only run the code on test files, or files that you have backups for.

@neheb
Copy link
Contributor

neheb commented Dec 25, 2023

Conflicts

@complexlogic
Copy link
Contributor Author

@neheb I'm working on some changes locally to fix the issues I described in #1149 (comment). After I have that done, I will rebase before pushing again.

@ufleisch
Copy link
Contributor

@complexlogic I just want to ask if you are still working on this. If you have local changes, you could push them to the branch, so we could continue working on this feature.

@ufleisch
Copy link
Contributor

I have fixed the conflicts and pushed to a new branch https://github.com/ufleisch/taglib/tree/matroska.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants