To diff common binary files in git using appropriate external converters such as unrtf, pandoc, docx2txt.pl, odt2txt, git-xlsx-textconv, git-xlsx-textconv.pl or pptx2md, add to ~/.config/git/config
the lines
[diff]
[diff "pdf"]
binary = true
# textconv = pdfinfo
textconv = sh -c 'pdftotext -layout "$0" -enc UTF-8 -nopgbrk -q -'
cachetextconv = true
[diff "djvu"]
binary = true
# textconv = pdfinfo
textconv = djvutxt
cachetextconv = true
[diff "odt"]
textconv = odt2txt
# textconv = pandoc --standalone --from=odt --to=plain
binary = true
cachetextconv = true
[diff "doc"]
# not available under Windows
# textconv = wvText
textconv = catdoc
binary = true
cachetextconv = true
[diff "xls"]
textconv = in2csv
# textconv = xlscat -a UTF-8
# textconv = soffice --headless --convert-to csv
binary = true
cachetextconv = true
[diff "ppt"]
textconv = catppt
binary = true
cachetextconv = true
[diff "docx"]
textconv = pandoc --standalone --from=docx --to=plain
# textconv = sh -c 'docx2txt.pl "$0" -'
binary = true
cachetextconv = true
[diff "xlsx"]
# Be it https://metacpan.org/pod/Spreadsheet::Read#xlsx2csv
# or https://github.com/dilshod/xlsx2csv
textconv = xlsx2csv --all
# textconv = xlscat --trim -S all
# textconv = in2csv
# textconv = soffice --headless --convert-to csv
binary = true
cachetextconv = true
[diff "pptx"]
textconv = sh -c 'pptx2md --disable-image --disable-wmf -i "$0" -o ~/.cache/git/presentation.md >/dev/null && cat ~/.cache/git/presentation.md'
binary = true
cachetextconv = true
[diff "rtf"]
textconv = unrtf --text
binary = true
cachetextconv = true
[diff "epub"]
textconv = pandoc --standalone --from=epub --to=plain
binary = true
cachetextconv = true
[diff "tika"]
textconv = "tika --text"
binary = true
cachetextconv = true
[diff "libreoffice"]
textconv = "soffice --cat"
binary = true
cachetextconv = true
and add to ~/.config/git/attributes
the lines
*.pdf diff=pdf
*.djvu diff=djvu
*.odt diff=odt
*.odp diff=libreoffice
*.ods diff=libreoffice
*.doc diff=doc
*.xls diff=xls
*.ppt diff=ppt
*.docx diff=docx
*.xlsx diff=xlsx
*.pptx diff=pptx
*.rtf diff=rtf
*.epub diff=pandoc
*.chm diff=tika
*.mhtml? diff=tika
*.{class,jar} diff=tika
*.{rar,7z,zip} diff=tika
LibreOffice is an office suite that (together with a common text browser such as lynx
) can handle all those formats listed above, except PDF
s.
(To use it on Microsoft Windows, ensure after its installation that its path is added to the %PATH%
environment variable, say by Rapidee.)
Tika which is a content extractor that can handle all those formats listed above and many more. To use it:
-
Download the latest runnable
tika-app-...jar
from Tika to~/bin/tika.jar
(on Linux) respectively%USERPROFILE%\bin
(on Microsoft Windows). -
Create
- on Linux, a shell script
~/bin/tika
that reads
#!/bin/sh exec java -Dfile.encoding=UTF-8 -jar "$HOME/bin/tika.jar" "$@" 2>/dev/null
and mark it executable (by
chmod a+x ~/bin/tika
).- on Microsoft Windows, a batch script
%USERPROFILE%\bin\tika.bat
that reads
@echo off java -Dfile.encoding=UTF-8 -jar "%USERPROFILE%\bin\tika.jar" %*
- on Linux, a shell script
-
Add the folder of the newly created
tika
executable to your environment variable$PATH
(on Linux) respectively%PATH%
(on Microsoft Windows):- on Linux, if you use
bash
orzsh
by adding to~/.profile
or~/.zshenv
the line
PATH=$PATH:~/bin
- on Microsoft Windows, a convenient program to update
%PATH%
is Rapidee.
- on Linux, if you use