Skip to content

Instantly share code, notes, and snippets.

@Chubek
Created February 25, 2024 09:41
Show Gist options
  • Save Chubek/1fa1c037d280dfc7952676cb4ee89e11 to your computer and use it in GitHub Desktop.
Save Chubek/1fa1c037d280dfc7952676cb4ee89e11 to your computer and use it in GitHub Desktop.
Witty.rb -> Parse .git/index
#!/usr/bin/env ruby
# === Witty.rb ===
# A very simple Ruby Script
# Author: Chubak Bidpaa (github.com/Chubek)
#
# ** What does this do? **
# This script demonstrates how to parse a Git index file (.git/index)
# using nothing but the languages IO facilities. This is perhaps best
# done in a systems language, or a strongly-typed language where there
# is a good distinction between integers, characters and bytes, however
# since Ruby is a 'sweet' language, and I mean that both figuratively and
# literally (syntactic diabetes?) I wrote the demonstration in the language.
# One could do this in any language though. Even AWK! But I digress.
# Enough language talk. Let's talk about .git/index, hereby referred to as
# `index`.
#
# ** The Structure of `index`
# The structure of this file is plainly explained at this page:
# https://git-scm.com/docs/index-format
# It's nothing above-the-board. It is your regular binary file.
# It is not a 'database'. A database file must have a structural form.
# `index` is very structurally loose. It's just a linear list of items,
# prececeded by a magic which is succeded by a header, which is then succeded by
# the number sof items in the 'list'.
#
# ___NOTE___: Besides the list, there's an 'extensions' section. Which really
# gives `index` no shot at being a genuine database! In this script, we do NOT
# parse the extensions, because it may or may not occur, and plus, it's besides
# the point of gaining info on the files in our repo.
#
# The 'list' is a hudge-pudge of unsigned 32-bit integers, 16-bit flags, padding,
# and one null-term string which is the path to the file FROM THE ROOT. That means, in this
# pathname, absolute paths, and by that I mean 'POSIX absolute paths', are forbidden.
# Everything is given from the root of the repository. That is where the '.git' directory
# is located.
#
# From this description, it seems like Git to be very hostile to non-POSIX systems. There
# are no 'absolute' and 'relative' paths in Windows! But I guess, the people who ported
# this goddamn git of a software to that goddamn git of an operating system knew how to
# deal with it (I have not used PipDooze in several years, but IIRC, git only works in
# the Windows version of Bash? Dunno).
#
# ** The Format of Each Index Entry **
# This table explains the format of each index entry:
#
# Field Description
# -------------------------------------------------------------------
# 1. 32-bit ctime seconds, the last time a file's metadata changed
# this is stat(2) data
# 2. 32-bit ctime nanosecond fractions
# this is stat(2) data
# 3. 32-bit mtime seconds, the last time a file's data changed
# this is stat(2) data
# 4. 32-bit mtime nanosecond fractions
# this is stat(2) data
# 5. 32-bit dev
# this is stat(2) data
# 6. 32-bit ino
# this is stat(2) data
# 7. 32-bit mode, split into (high to low bits)
# | 4-bit object type
# | valid values in binary are 1000 (regular file), 1010 (symbolic link)
# | and 1110 (gitlink)
# | 3-bit unused
# | 9-bit unix permission. Only 0755 and 0644 are valid for regular files.
# | Symbolic links and gitlinks have value 0 in this field.
# 8. 32-bit uid
# this is stat(2) data
# 9. 32-bit gid
# this is stat(2) data
# 10. 32-bit file size
# This is the on-disk size from stat(2), truncated to 32-bit.
# 11. Object name for the represented object
# 12. A 16-bit 'flags' field split into (high to low bits)
# | 1-bit assume-valid flag
# | 1-bit extended flag (must be zero in version 2)
# | 2-bit stage (during merge)
# | 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
# | is stored in this field.
# 13. (Version 3 or later) A 16-bit field, only applicable if the
# "extended flag" above is 1, split into (high to low bits).
# | 1-bit reserved for future
# | 1-bit skip-worktree flag (used by sparse checkout)
# | 1-bit intent-to-add flag (used by "git add -N")
# | 13-bit unused, must be zero
# 14. Entry path name (variable length) relative to top level directory
# (without leading slash). '/' is used as path separator. The special
# path components ".", ".." and ".git" (without quotes) are disallowed.
# Trailing slash is also disallowed.
# 15. (Version 4) In version 4, the entry path name is prefix-compressed
# relative to the path name for the previous entry (the very first
# entry is encoded as if the path name for the previous entry is an
# empty string). At the beginning of an entry, an integer N in the
# variable width encoding (the same encoding as the offset is encoded
# for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
# by a NUL-terminated string S. Removing N bytes from the end of the
# path name for the previous entry, and replacing it with the string S
# yields the path name for this entry.
# 16. 1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes
# while keeping the name NUL-terminated.
#
# I think this is clear enough, but let's address the pathname. The pathname is
# a null-terminated string, this means the authors of Git did not bank on people
# having filenames as like as FILENAME_MAX. Keep in mind that in most POSIX systems
# FILENAME_MAX is defined as 255 whilst in Windows it is defined as 256. But well,
# MAYBE some pesky person is using a different file system, on the same OS that defines
# FILENAME_MAX for its native filesystem? I mean, WHO KNOWS?
#
# But, there's another reason FILENAME_MAX is not set at the maximum and instead, a null-term
# string is chosen. And that's got to do with encoding, and multibyte strings. Git does not
# really care about what the encoding of your pathname is, or ife its multibyte or ASCII or
# Extended ASCII. It just puts a null at the end of the byte sequence that represents the path.
#
# In a way, null-term strings are cancer. But this is a good place for their use.
#
# Now, as you can clearly read in the table given above, in the later versions of Git, the exact
# length for the string is given. And in this script, we have chosen this version. Mainly because,
# again, null-term strings are CANCER. I mainly code in C and I use them a lot, but see, Ruby does not
# support them, Python does not support them, Scheme does not support them, Java doesn't (?),
# my mom and your mom do not support them, only systems languages support them, like D, Rust, etc.
#
# So anyways, here I present to you, Witty.rb, a Ruby script that reads up `index`, aka .git/index,
# do with the info as you wish!
require 'pathname'
def read_n_bytes(n)
STDIN.read(n)
end
def read_uint16
read_n_bytes(2).unpack('n').first
end
def read_uint32
read_n_bytes(4).unpack('N').first
end
def read_uint64
read_n_bytes(8).unpack('Q').first
end
def read_string
str = ''
loop do
char = read_n_bytes(1)
break if char == "\x00"
str << char
end
str
end
def match_signature
raise "Not a valid Git Index file" if read_uint32 != "DIRC".unpack('N').first
end
def read_version_number
version = read_uint32
raise "Unsupported Git Index version: #{version}" unless [2, 3, 4].include?(version)
version
end
def read_number_of_entries
read_uint32
end
def read_index_entry
entry = Hash.new
entry[:ctime_seconds] = read_uint32
entry[:ctime_nanoseconds] = read_uint32
entry[:mtime_seconds] = read_uint32
entry[:mtime_nanoseconds] = read_uint32
entry[:dev] = read_uint32
entry[:inode] = read_uint32
mode = read_uint32
entry[:type] = (mode >> 12) & 0b1111
entry[:permissions] = mode & 0b111111111
entry[:uid] = read_uint32
entry[:gid] = read_uint32
entry[:size] = read_uint32
entry[:object_name] = read_n_bytes(20).unpack('H*').first
flags = read_uint16
entry[:assume_valid] = (flags >> 15) & 0b1
entry[:extended] = (flags >> 14) & 0b1
entry[:stage] = (flags >> 12) & 0b11
name_length = flags & 0b111111111111
entry[:name] = read_n_bytes(name_length).force_encoding('UTF-8')
padding_length = 8 - ((4 * 10) + 20 + 2 + name_length) % 8
padding_length = 8 if padding_length == 0
read_n_bytes(padding_length)
entry
end
def read_all_paths
match_signature
version_no = read_version_number
no_of_entries = read_number_of_entries
paths = Array.new
for _ in 0..(no_of_entries - 1)
paths << Pathname.new(read_index_entry[:name])
end
paths
end
# Call `read_all_paths` and do as you wish
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment