Created
February 25, 2024 09:41
-
-
Save Chubek/1fa1c037d280dfc7952676cb4ee89e11 to your computer and use it in GitHub Desktop.
Witty.rb -> Parse .git/index
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
# === Witty.rb === | |
# A very simple Ruby Script | |
# Author: Chubak Bidpaa (github.com/Chubek) | |
# | |
# ** What does this do? ** | |
# This script demonstrates how to parse a Git index file (.git/index) | |
# using nothing but the languages IO facilities. This is perhaps best | |
# done in a systems language, or a strongly-typed language where there | |
# is a good distinction between integers, characters and bytes, however | |
# since Ruby is a 'sweet' language, and I mean that both figuratively and | |
# literally (syntactic diabetes?) I wrote the demonstration in the language. | |
# One could do this in any language though. Even AWK! But I digress. | |
# Enough language talk. Let's talk about .git/index, hereby referred to as | |
# `index`. | |
# | |
# ** The Structure of `index` | |
# The structure of this file is plainly explained at this page: | |
# https://git-scm.com/docs/index-format | |
# It's nothing above-the-board. It is your regular binary file. | |
# It is not a 'database'. A database file must have a structural form. | |
# `index` is very structurally loose. It's just a linear list of items, | |
# prececeded by a magic which is succeded by a header, which is then succeded by | |
# the number sof items in the 'list'. | |
# | |
# ___NOTE___: Besides the list, there's an 'extensions' section. Which really | |
# gives `index` no shot at being a genuine database! In this script, we do NOT | |
# parse the extensions, because it may or may not occur, and plus, it's besides | |
# the point of gaining info on the files in our repo. | |
# | |
# The 'list' is a hudge-pudge of unsigned 32-bit integers, 16-bit flags, padding, | |
# and one null-term string which is the path to the file FROM THE ROOT. That means, in this | |
# pathname, absolute paths, and by that I mean 'POSIX absolute paths', are forbidden. | |
# Everything is given from the root of the repository. That is where the '.git' directory | |
# is located. | |
# | |
# From this description, it seems like Git to be very hostile to non-POSIX systems. There | |
# are no 'absolute' and 'relative' paths in Windows! But I guess, the people who ported | |
# this goddamn git of a software to that goddamn git of an operating system knew how to | |
# deal with it (I have not used PipDooze in several years, but IIRC, git only works in | |
# the Windows version of Bash? Dunno). | |
# | |
# ** The Format of Each Index Entry ** | |
# This table explains the format of each index entry: | |
# | |
# Field Description | |
# ------------------------------------------------------------------- | |
# 1. 32-bit ctime seconds, the last time a file's metadata changed | |
# this is stat(2) data | |
# 2. 32-bit ctime nanosecond fractions | |
# this is stat(2) data | |
# 3. 32-bit mtime seconds, the last time a file's data changed | |
# this is stat(2) data | |
# 4. 32-bit mtime nanosecond fractions | |
# this is stat(2) data | |
# 5. 32-bit dev | |
# this is stat(2) data | |
# 6. 32-bit ino | |
# this is stat(2) data | |
# 7. 32-bit mode, split into (high to low bits) | |
# | 4-bit object type | |
# | valid values in binary are 1000 (regular file), 1010 (symbolic link) | |
# | and 1110 (gitlink) | |
# | 3-bit unused | |
# | 9-bit unix permission. Only 0755 and 0644 are valid for regular files. | |
# | Symbolic links and gitlinks have value 0 in this field. | |
# 8. 32-bit uid | |
# this is stat(2) data | |
# 9. 32-bit gid | |
# this is stat(2) data | |
# 10. 32-bit file size | |
# This is the on-disk size from stat(2), truncated to 32-bit. | |
# 11. Object name for the represented object | |
# 12. A 16-bit 'flags' field split into (high to low bits) | |
# | 1-bit assume-valid flag | |
# | 1-bit extended flag (must be zero in version 2) | |
# | 2-bit stage (during merge) | |
# | 12-bit name length if the length is less than 0xFFF; otherwise 0xFFF | |
# | is stored in this field. | |
# 13. (Version 3 or later) A 16-bit field, only applicable if the | |
# "extended flag" above is 1, split into (high to low bits). | |
# | 1-bit reserved for future | |
# | 1-bit skip-worktree flag (used by sparse checkout) | |
# | 1-bit intent-to-add flag (used by "git add -N") | |
# | 13-bit unused, must be zero | |
# 14. Entry path name (variable length) relative to top level directory | |
# (without leading slash). '/' is used as path separator. The special | |
# path components ".", ".." and ".git" (without quotes) are disallowed. | |
# Trailing slash is also disallowed. | |
# 15. (Version 4) In version 4, the entry path name is prefix-compressed | |
# relative to the path name for the previous entry (the very first | |
# entry is encoded as if the path name for the previous entry is an | |
# empty string). At the beginning of an entry, an integer N in the | |
# variable width encoding (the same encoding as the offset is encoded | |
# for OFS_DELTA pack entries; see pack-format.txt) is stored, followed | |
# by a NUL-terminated string S. Removing N bytes from the end of the | |
# path name for the previous entry, and replacing it with the string S | |
# yields the path name for this entry. | |
# 16. 1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes | |
# while keeping the name NUL-terminated. | |
# | |
# I think this is clear enough, but let's address the pathname. The pathname is | |
# a null-terminated string, this means the authors of Git did not bank on people | |
# having filenames as like as FILENAME_MAX. Keep in mind that in most POSIX systems | |
# FILENAME_MAX is defined as 255 whilst in Windows it is defined as 256. But well, | |
# MAYBE some pesky person is using a different file system, on the same OS that defines | |
# FILENAME_MAX for its native filesystem? I mean, WHO KNOWS? | |
# | |
# But, there's another reason FILENAME_MAX is not set at the maximum and instead, a null-term | |
# string is chosen. And that's got to do with encoding, and multibyte strings. Git does not | |
# really care about what the encoding of your pathname is, or ife its multibyte or ASCII or | |
# Extended ASCII. It just puts a null at the end of the byte sequence that represents the path. | |
# | |
# In a way, null-term strings are cancer. But this is a good place for their use. | |
# | |
# Now, as you can clearly read in the table given above, in the later versions of Git, the exact | |
# length for the string is given. And in this script, we have chosen this version. Mainly because, | |
# again, null-term strings are CANCER. I mainly code in C and I use them a lot, but see, Ruby does not | |
# support them, Python does not support them, Scheme does not support them, Java doesn't (?), | |
# my mom and your mom do not support them, only systems languages support them, like D, Rust, etc. | |
# | |
# So anyways, here I present to you, Witty.rb, a Ruby script that reads up `index`, aka .git/index, | |
# do with the info as you wish! | |
require 'pathname' | |
def read_n_bytes(n) | |
STDIN.read(n) | |
end | |
def read_uint16 | |
read_n_bytes(2).unpack('n').first | |
end | |
def read_uint32 | |
read_n_bytes(4).unpack('N').first | |
end | |
def read_uint64 | |
read_n_bytes(8).unpack('Q').first | |
end | |
def read_string | |
str = '' | |
loop do | |
char = read_n_bytes(1) | |
break if char == "\x00" | |
str << char | |
end | |
str | |
end | |
def match_signature | |
raise "Not a valid Git Index file" if read_uint32 != "DIRC".unpack('N').first | |
end | |
def read_version_number | |
version = read_uint32 | |
raise "Unsupported Git Index version: #{version}" unless [2, 3, 4].include?(version) | |
version | |
end | |
def read_number_of_entries | |
read_uint32 | |
end | |
def read_index_entry | |
entry = Hash.new | |
entry[:ctime_seconds] = read_uint32 | |
entry[:ctime_nanoseconds] = read_uint32 | |
entry[:mtime_seconds] = read_uint32 | |
entry[:mtime_nanoseconds] = read_uint32 | |
entry[:dev] = read_uint32 | |
entry[:inode] = read_uint32 | |
mode = read_uint32 | |
entry[:type] = (mode >> 12) & 0b1111 | |
entry[:permissions] = mode & 0b111111111 | |
entry[:uid] = read_uint32 | |
entry[:gid] = read_uint32 | |
entry[:size] = read_uint32 | |
entry[:object_name] = read_n_bytes(20).unpack('H*').first | |
flags = read_uint16 | |
entry[:assume_valid] = (flags >> 15) & 0b1 | |
entry[:extended] = (flags >> 14) & 0b1 | |
entry[:stage] = (flags >> 12) & 0b11 | |
name_length = flags & 0b111111111111 | |
entry[:name] = read_n_bytes(name_length).force_encoding('UTF-8') | |
padding_length = 8 - ((4 * 10) + 20 + 2 + name_length) % 8 | |
padding_length = 8 if padding_length == 0 | |
read_n_bytes(padding_length) | |
entry | |
end | |
def read_all_paths | |
match_signature | |
version_no = read_version_number | |
no_of_entries = read_number_of_entries | |
paths = Array.new | |
for _ in 0..(no_of_entries - 1) | |
paths << Pathname.new(read_index_entry[:name]) | |
end | |
paths | |
end | |
# Call `read_all_paths` and do as you wish |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment