Last active August 23, 2024 10:14
git diff common binary files

To diff common binary files in git using appropriate external converters such as unrtf, pandoc,, odt2txt, git-xlsx-textconv, or pptx2md, add to ~/.config/git/config the lines

  [diff "pdf"]
    binary = true
    # textconv = pdfinfo
    textconv = sh -c 'pdftotext -layout "$0" -enc UTF-8 -nopgbrk -q -'
    cachetextconv = true

  [diff "djvu"]
    binary = true
    # textconv = pdfinfo
    textconv = djvutxt
    cachetextconv = true

  [diff "odt"]
    textconv = odt2txt
    # textconv = pandoc --standalone --from=odt --to=plain
    binary = true
    cachetextconv = true

  [diff "doc"]
    # not available under Windows
    # textconv = wvText
    textconv = catdoc
    binary = true
    cachetextconv = true
  [diff "xls"]
    textconv = in2csv
    # textconv = xlscat -a UTF-8
    # textconv = soffice --headless --convert-to csv
    binary = true
    cachetextconv = true
  [diff "ppt"]
    textconv = catppt
    binary = true
    cachetextconv = true

  [diff "docx"]
    textconv = pandoc --standalone --from=docx --to=plain
    # textconv = sh -c ' "$0" -'
    binary = true
    cachetextconv = true
  [diff "xlsx"]
    # Be it
    # or
    textconv = xlsx2csv --all
    # textconv = xlscat --trim -S all
    # textconv = in2csv
    # textconv = soffice --headless --convert-to csv
    binary = true
    cachetextconv = true
  [diff "pptx"]
    textconv = sh -c 'pptx2md --disable-image --disable-wmf -i "$0" -o ~/.cache/git/ >/dev/null && cat ~/.cache/git/'
    binary = true
    cachetextconv = true

  [diff "rtf"]
    textconv = unrtf --text
    binary = true
    cachetextconv = true

  [diff "epub"]
    textconv = pandoc --standalone --from=epub --to=plain
    binary = true
    cachetextconv = true

  [diff "tika"]
    textconv = "tika --text"
    binary = true
    cachetextconv = true
  [diff "libreoffice"]
    textconv = "soffice --cat"
    binary = true
    cachetextconv = true

and add to ~/.config/git/attributes the lines

*.pdf    diff=pdf
*.djvu   diff=djvu

*.odt    diff=odt
*.odp    diff=libreoffice
*.ods    diff=libreoffice

*.doc    diff=doc
*.xls    diff=xls
*.ppt    diff=ppt

*.docx   diff=docx
*.xlsx   diff=xlsx
*.pptx   diff=pptx

*.rtf    diff=rtf

*.epub   diff=pandoc
*.chm    diff=tika
*.mhtml? diff=tika

*.{class,jar}  diff=tika
*.{rar,7z,zip} diff=tika

LibreOffice is an office suite that (together with a common text browser such as lynx) can handle all those formats listed above, except PDFs. (To use it on Microsoft Windows, ensure after its installation that its path is added to the %PATH% environment variable, say by Rapidee.)

Tika which is a content extractor that can handle all those formats listed above and many more. To use it:

  1. Download the latest runnable tika-app-...jar from Tika to ~/bin/tika.jar (on Linux) respectively %USERPROFILE%\bin (on Microsoft Windows).

  2. Create

    • on Linux, a shell script ~/bin/tika that reads
        exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/null

    and mark it executable (by chmod a+x ~/bin/tika).

    • on Microsoft Windows, a batch script %USERPROFILE%\bin\tika.bat that reads
        @echo off
        java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %*
  3. Add the folder of the newly created tika executable to your environment variable $PATH (on Linux) respectively %PATH% (on Microsoft Windows):

    • on Linux, if you use bash or zsh by adding to ~/.profile or ~/.zshenv the line
    • on Microsoft Windows, a convenient program to update %PATH% is Rapidee.
