Skip to content

Instantly share code, notes, and snippets.

@Konfekt
Last active August 23, 2024 10:14
Show Gist options
  • Save Konfekt/5ece511a94a8aa118aadbbb23dab1f21 to your computer and use it in GitHub Desktop.
Save Konfekt/5ece511a94a8aa118aadbbb23dab1f21 to your computer and use it in GitHub Desktop.
git diff common binary files

To diff common binary files in git using appropriate external converters such as unrtf, pandoc, docx2txt.pl, odt2txt, git-xlsx-textconv, git-xlsx-textconv.pl or pptx2md, add to ~/.config/git/config the lines

[diff]
  [diff "pdf"]
    binary = true
    # textconv = pdfinfo
    textconv = sh -c 'pdftotext -layout "$0" -enc UTF-8 -nopgbrk -q -'
    cachetextconv = true

  [diff "djvu"]
    binary = true
    # textconv = pdfinfo
    textconv = djvutxt
    cachetextconv = true

  [diff "odt"]
    textconv = odt2txt
    # textconv = pandoc --standalone --from=odt --to=plain
    binary = true
    cachetextconv = true

  [diff "doc"]
    # not available under Windows
    # textconv = wvText
    textconv = catdoc
    binary = true
    cachetextconv = true
  [diff "xls"]
    textconv = in2csv
    # textconv = xlscat -a UTF-8
    # textconv = soffice --headless --convert-to csv
    binary = true
    cachetextconv = true
  [diff "ppt"]
    textconv = catppt
    binary = true
    cachetextconv = true

  [diff "docx"]
    textconv = pandoc --standalone --from=docx --to=plain
    # textconv = sh -c 'docx2txt.pl "$0" -'
    binary = true
    cachetextconv = true
  [diff "xlsx"]
    # Be it https://metacpan.org/pod/Spreadsheet::Read#xlsx2csv
    # or https://github.com/dilshod/xlsx2csv
    textconv = xlsx2csv --all
    # textconv = xlscat --trim -S all
    # textconv = in2csv
    # textconv = soffice --headless --convert-to csv
    binary = true
    cachetextconv = true
  [diff "pptx"]
    textconv = sh -c 'pptx2md --disable-image --disable-wmf -i "$0" -o ~/.cache/git/presentation.md >/dev/null && cat ~/.cache/git/presentation.md'
    binary = true
    cachetextconv = true

  [diff "rtf"]
    textconv = unrtf --text
    binary = true
    cachetextconv = true

  [diff "epub"]
    textconv = pandoc --standalone --from=epub --to=plain
    binary = true
    cachetextconv = true

  [diff "tika"]
    textconv = "tika --text"
    binary = true
    cachetextconv = true
  [diff "libreoffice"]
    textconv = "soffice --cat"
    binary = true
    cachetextconv = true

and add to ~/.config/git/attributes the lines

*.pdf    diff=pdf
*.djvu   diff=djvu

*.odt    diff=odt
*.odp    diff=libreoffice
*.ods    diff=libreoffice

*.doc    diff=doc
*.xls    diff=xls
*.ppt    diff=ppt

*.docx   diff=docx
*.xlsx   diff=xlsx
*.pptx   diff=pptx

*.rtf    diff=rtf

*.epub   diff=pandoc
*.chm    diff=tika
*.mhtml? diff=tika

*.{class,jar}  diff=tika
*.{rar,7z,zip} diff=tika

LibreOffice is an office suite that (together with a common text browser such as lynx) can handle all those formats listed above, except PDFs. (To use it on Microsoft Windows, ensure after its installation that its path is added to the %PATH% environment variable, say by Rapidee.)

Tika which is a content extractor that can handle all those formats listed above and many more. To use it:

  1. Download the latest runnable tika-app-...jar from Tika to ~/bin/tika.jar (on Linux) respectively %USERPROFILE%\bin (on Microsoft Windows).

  2. Create

    • on Linux, a shell script ~/bin/tika that reads
        #!/bin/sh
        exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/null

    and mark it executable (by chmod a+x ~/bin/tika).

    • on Microsoft Windows, a batch script %USERPROFILE%\bin\tika.bat that reads
        @echo off
        java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %*
  3. Add the folder of the newly created tika executable to your environment variable $PATH (on Linux) respectively %PATH% (on Microsoft Windows):

    • on Linux, if you use bash or zsh by adding to ~/.profile or ~/.zshenv the line
        PATH=$PATH:~/bin
    • on Microsoft Windows, a convenient program to update %PATH% is Rapidee.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment