Git - Distributed Version Control System | Augment Code

Overview

Relevant Files

README.md
Documentation/gitdatamodel.adoc
Documentation/gitrepository-layout.adoc
git.c
repository.h and repository.c

Git is a fast, scalable, distributed revision control system with an unusually rich command set. It provides both high-level operations for everyday use and full access to internal mechanisms for advanced workflows. Originally written by Linus Torvalds, Git is an open-source project licensed under GPLv2.

Core Purpose

Git enables developers to track changes to code and collaborate efficiently across distributed teams. It stores complete repository history locally, allowing offline work and fast operations. Unlike centralized version control systems, Git gives every developer a full copy of the project history.

Data Model

Git's architecture is built on four fundamental data structures:

Objects - Immutable data units representing commits, trees, blobs, and tags. Each object has a unique SHA-1 hash ID.
References - Pointers to objects (branches, tags, remote-tracking branches) that enable human-readable navigation.
Index - The staging area that tracks which changes will be included in the next commit.
Reflogs - Historical logs of reference changes for recovery and debugging.

Repository Structure

A Git repository consists of:

.git directory - Contains all repository metadata and object storage
objects/ directory - Stores all Git objects (commits, trees, blobs, tags) in compressed pack files or loose format
refs/ directory - Stores branch and tag references
HEAD - Points to the current branch
Working tree - The actual files you edit (absent in bare repositories)

Command Architecture

Git's command system is organized hierarchically:

git <command> [<subcommand>] [<options>] [<arguments>]

The main entry point (git.c) dispatches commands to built-in implementations in the builtin/ directory. Commands can be simple (like git add) or have subcommands (like git remote add, git maintenance run).

Key Components

Repository Management - Handles initialization, configuration, and state tracking via repository.h/c
Object Database - Manages storage and retrieval of Git objects with compression and indexing
Reference System - Maintains branches, tags, and tracking references
Index Management - Tracks staged changes and working tree state
Diff Engine - Computes differences between commits, trees, and working tree
Merge System - Handles multi-way merges with conflict detection

Performance Features

Git includes advanced optimizations for large repositories:

Commit graphs - Accelerate history traversal
Multi-pack index - Speed up object lookups across multiple pack files
Sparse checkout - Reduce working tree size for monorepos
Partial clone - Download only needed objects on demand
Background maintenance - Optimize repository structure automatically

Architecture & Core Data Structures

Relevant Files

object.h & object.c
odb.h & odb.c
repository.h & repository.c

Git's architecture centers on three core data structures that work together to manage repository state and object storage.

The Repository Structure

The struct repository is the top-level container for all repository-related state. It holds:

Path information: gitdir (the .git directory) and commondir (shared directory for worktrees)
Object database: struct object_database *objects for accessing stored objects
Parsed object pool: struct parsed_object_pool *parsed_objects for in-memory object caching
Reference storage: struct ref_store *refs_private for managing branches and tags
Configuration and state: Hash algorithm, index state, submodule caches, and remote configuration

The Object Database (ODB)

The struct object_database manages access to Git objects through a multi-source architecture:

struct object_database {
  struct odb_source *sources;      /* Primary + alternates */
  struct oidmap replace_map;       /* Object replacements */
  struct packfile_store *packfiles;/* Packed object access */
  struct cached_object_entry *cached_objects;
};

Key features:

Multiple sources: Primary source (.git/objects) plus alternates for shared object pools
Lazy loading: Alternates loaded on-demand from .git/objects/info/alternates
Transactions: Supports atomic object writes via odb_transaction_begin()
Object replacement: Maps objects to replacements (see git-replace)

The Parsed Object Pool

The struct parsed_object_pool caches parsed objects in memory:

struct parsed_object_pool {
  struct object **obj_hash;        /* Hash table of objects */
  int nr_objs, obj_hash_size;      /* Count and capacity */
  struct alloc_state *blob_state;  /* Per-type allocators */
  struct commit_graft **grafts;    /* Parent substitutions */
};

This uses a linear probing hash table for O(1) object lookup by OID. Objects are allocated from type-specific memory pools for efficiency.

Core Object Types

Git defines five core object types (stored in 3 bits):

OBJ_COMMIT (1): Snapshot of repository state
OBJ_TREE (2): Directory listing
OBJ_BLOB (3): File content
OBJ_TAG (4): Annotated reference
OBJ_NONE (0): Uninitialized object

Delta objects (OBJ_OFS_DELTA, OBJ_REF_DELTA) are used only in pack files.

Data Flow

Loading diagram...

Object Lookup and Creation

Lookup (lookup_object): Uses hash table with linear probing. On collision, moves found object to initial position for faster future lookups.

Creation (create_object): Allocates from type-specific pool, inserts into hash table, grows table when 50% full.

Parsing (parse_object): Reads raw object from ODB, validates hash, deserializes into typed structure (commit, tree, blob, or tag).

Memory Management

Objects are allocated in slabs (contiguous blocks) per type, not individually. This reduces fragmentation and improves cache locality. The pool tracks allocation state per type and frees entire slabs on cleanup, not individual objects.

Object Storage & Retrieval

Relevant Files

object-file.c - Core object storage and retrieval logic
object-file.h - Object file API and structures
loose.c - Loose object mapping and management
loose.h - Loose object map interface
hash.c - Hash algorithm implementations
hash.h - Hash algorithm definitions and context

Overview

Git stores objects (commits, trees, blobs, tags) in two primary formats: loose objects and packed objects. This section focuses on loose object storage and retrieval, which uses a content-addressable filesystem layout where object paths are derived from their SHA-1 or SHA-256 hashes.

Loose Object Storage Layout

Loose objects are stored in .git/objects/ using a two-level directory structure:

.git/objects/
├── ab/
│   ├── cdef1234567890...
│   └── 1234567890abcdef...
├── cd/
│   └── ef1234567890abcd...
└── ...

The first two hex characters of the hash form the directory name, and the remaining characters form the filename. This structure is generated by fill_loose_path() and odb_loose_path(), which convert an object ID into its filesystem path.

Hash Algorithms

Git supports multiple hash algorithms (SHA-1 and SHA-256) through the git_hash_algo structure:

struct git_hash_algo {
    const char *name;           // "sha1" or "sha256"
    uint32_t format_id;         // Pack file identifier
    size_t rawsz;               // Binary hash size (20 or 32 bytes)
    size_t hexsz;               // Hex representation size (40 or 64 chars)
    git_hash_init_fn init_fn;   // Initialize hash context
    git_hash_update_fn update_fn; // Update with data
    git_hash_final_oid_fn final_oid_fn; // Finalize to OID
    // ... other fields
};

The git_hash_ctx structure maintains state during hashing operations and is used throughout object creation and verification.

Writing Objects

Object writing follows this flow:

Prepare header: Format object type and size (e.g., "blob 42\0")
Compute hash: Use hash_object_file() to generate the OID
Write loose object: write_loose_object() compresses data with zlib and writes to a temporary file
Atomic rename: Move temp file to final location (e.g., .git/objects/ab/cdef...)
Freshen: Update file mtime to prevent garbage collection

Streaming writes are supported via odb_source_loose_write_stream() for large objects.

Reading Objects

Object retrieval uses read_loose_object():

Locate file: Construct path from OID
Memory map: Open and mmap the file for efficient access
Decompress header: Extract object type and size
Verify hash: Recompute hash to detect corruption
Return contents: Provide decompressed data or stream

Loose Object Mapping

For hash algorithm transitions, Git maintains a loose_object_map that tracks correspondences between SHA-1 and SHA-256 representations of the same object. This map is persisted in .git/objects/loose-object-idx and enables seamless migration between hash algorithms.

Caching and Performance

The odb_source_loose structure includes:

oidtree cache: Fast lookup of loose objects by OID prefix
subdir_seen bitmap: Tracks which object directories have been scanned
loose_object_map: Maps between compatible hash representations

These optimizations reduce filesystem operations during object lookups and abbreviated hash resolution.

Transactions and Fsync

For durability, object writes can participate in batch fsync transactions:

prepare_loose_object_transaction() creates a temporary object directory
fsync_loose_object_transaction() batches writeout requests
flush_loose_object_transaction() issues a final hardware flush before renaming

This approach improves performance on systems with expensive fsync operations.

Pack Files & Compression

Relevant Files

packfile.c & packfile.h - Pack file loading and management
pack-objects.c & pack-objects.h - Pack creation and object handling
pack-write.c - Index and metadata file writing
midx.c - Multi-pack index support

Pack files are Git's primary storage format for repository objects. They consolidate loose objects and smaller packs into larger, compressed archives to optimize storage and performance.

Pack File Format

A pack file (.pack) contains a header, object entries, and a trailing checksum:

Header (12 bytes):
  - 4-byte signature: 'PACK'
  - 4-byte version (2 or 3)
  - 4-byte object count

Object Entries:
  - Variable-length type &amp; size header
  - Compressed object data (or delta data)

Trailer:
  - SHA-1/SHA-256 checksum of all above

Each object is encoded with a 3-bit type field and variable-length size using 7-bit chunks. Objects can be stored as deltas (OFS_DELTA or REF_DELTA) to save space.

Compression Strategy

Git uses zlib deflate compression for all packed objects. The compression level is configurable via pack.compression (range -1 to 9, default -1 for zlib default):

Level 0: No compression (fastest)
Level 6: Default balance of speed and compression
Level 9: Maximum compression (slowest)

The do_compress() function in pack-objects.c handles object compression, while write_large_blob_data() streams compression for large blobs to manage memory efficiently.

Index Files

Pack index files (.idx) enable fast object lookup without scanning the entire pack:

Version 1: Simple format with 256-entry fanout table
Version 2: Supports packs > 4 GiB with CRC32 checksums for each object

The index stores object IDs, offsets, and CRC32 values. Version 2 is automatically selected when pack size exceeds 2^31 bytes.

Supporting Metadata

.rev files: Reverse index mapping pack offsets to object positions for efficient iteration
.mtimes files: Object modification times for cruft pack identification
Multi-pack index (MIDX): Indexes multiple packs simultaneously for faster lookups across pack boundaries

Pack Windows & Memory Management

The struct packed_git maintains a sliding window system (pack_window) for memory-mapped access to pack data. This allows efficient reading without loading entire packs into memory. The use_pack() function manages window allocation and the unuse_pack() function releases references.

Delta Compression

Deltas store differences between objects rather than full copies:

OFS_DELTA: Encodes offset to base object in same pack (space-efficient)
REF_DELTA: Encodes full object ID of base (supports cross-pack references)

Delta chains can be nested, but must eventually resolve to a canonical object. The delta data itself is also zlib-compressed.

References & Branches

Relevant Files

refs.c - Backend-independent reference module
refs.h - Public reference API
refs/files-backend.c - Loose file-based reference storage
refs/reftable-backend.c - Reftable reference storage backend
branch.c - Branch creation and management
branch.h - Branch API

Git's reference system is the foundation for tracking commits, branches, and tags. References are names that point to object IDs (commits, trees, blobs) or other references (symbolic refs). The system uses a pluggable backend architecture to support different storage formats.

Reference Storage Backends

Git supports multiple reference storage backends, selected at repository initialization:

Files Backend (refs_be_files) - Traditional loose files in .git/refs/ and packed refs in .git/packed-refs. Each reference is a file containing an object ID or symbolic reference target.
Reftable Backend (refs_be_reftable) - Modern format using binary tables for efficient storage and querying. Supports atomic transactions and better performance for large repositories.

The backend interface is defined by struct ref_storage_be, which provides function pointers for operations like reading, writing, iterating, and transaction management.

Reference Transactions

Reference updates are atomic through the transaction system. A ref_transaction groups multiple reference updates and ensures they succeed or fail together:

struct ref_transaction {
    struct ref_store *ref_store;
    struct ref_update **updates;
    size_t nr;
    enum ref_transaction_state state;
};

Transactions follow a three-phase protocol: prepare (validate), finish (commit), and abort (rollback). This prevents partial updates and maintains repository consistency.

Reference Namespaces

Git organizes references into logical namespaces:

HEAD - Current branch pointer
refs/heads/ - Local branches
refs/tags/ - Annotated and lightweight tags
refs/remotes/ - Remote-tracking branches
refs/stash - Stash storage
refs/notes/ - Metadata annotations

Branch Management

The branch module provides high-level operations for creating and managing branches. Key functions:

create_branch() - Creates a new branch from a starting point with optional tracking setup
create_branches_recursively() - Creates branches in superproject and submodules atomically
dwim_and_setup_tracking() - Configures upstream tracking relationships

Branches are implemented as references in refs/heads/ that point to commits. Tracking branches link local branches to remote branches via configuration.

Reference Resolution

References are resolved recursively through refs_resolve_ref_unsafe(), which follows symbolic references to find the ultimate object ID. Resolution flags control behavior:

RESOLVE_REF_READING - Fail if reference doesn't exist
RESOLVE_REF_NO_RECURSE - Stop after one level of indirection
RESOLVE_REF_ALLOW_BAD_NAME - Allow malformed reference names

Loading diagram...

Reflog

The reflog records all reference updates, enabling recovery of lost commits and branch history inspection. Each reference maintains a log of previous values with timestamps and operation descriptions.

Revision Walking & History Traversal

Relevant Files

revision.c & revision.h
commit-reach.c & commit-reach.h
commit-graph.c & commit-graph.h
path-walk.c & path-walk.h

Core Revision Walking API

The revision walking system provides a structured way to traverse commit history. The main entry point is the rev_info structure, which holds configuration for what commits to walk and how to traverse them.

Key workflow:

Initialize rev_info with repo_init_revisions()
Configure traversal options (filters, sort order, etc.)
Call prepare_revision_walk() to build the commit list
Iterate using get_revision() to fetch commits one by one

Traversal Modes

Git supports multiple traversal strategies controlled by flags in rev_info:

Date-ordered: Default mode, commits sorted by commit date
Topological order (topo_order): Respects parent-child relationships, ensuring parents appear before children
Reflog walking (reflog_info): Traverses reference history instead of commit ancestry
Limited traversal (limited): Applies filters like date ranges or path restrictions

Topological Walking

For topological ordering, Git uses a sophisticated three-phase algorithm implemented in topo_walk_info:

Explore phase: Discovers all reachable commits and their generation numbers
Indegree phase: Calculates parent counts for each commit
Output phase: Emits commits in topological order using priority queues

This approach leverages commit-graph generation numbers to optimize traversal, skipping unnecessary exploration of older commits.

Commit Reachability Analysis

The commit-reach.c module provides high-level reachability queries:

repo_is_descendant_of(): Check if one commit descends from another
repo_get_merge_bases(): Find common ancestors
can_all_from_reach_with_flag(): Batch reachability checks with generation cutoffs
ahead_behind(): Count commits ahead/behind between branches

These functions use commit-graph generation numbers as optimization hints to prune search spaces.

Commit Graph Integration

The commit-graph file accelerates history traversal by pre-computing:

Parent relationships
Generation numbers (topological distance from root)
Bloom filters for path-based filtering

Functions like parse_commit_in_graph() and commit_graph_generation() provide fast lookups without parsing raw commit objects.

Path-Based Walking

The path-walk API enables efficient traversal of specific file paths across history:

int walk_objects_by_path(struct path_walk_info *info)

This batches objects by path, allowing tools like git log -- path to efficiently discover which commits modified specific files without examining every commit individually.

Object Flags and Marking

Revision walking uses bit flags to mark commit state during traversal:

SEEN: Commit already processed
UNINTERESTING: Commit excluded from results (e.g., via --not)
TREESAME: Commit has identical tree to parent (used for merge simplification)
TOPO_WALK_EXPLORED: Commit visited during exploration phase
TOPO_WALK_INDEGREE: Commit processed for parent counting

These flags enable efficient single-pass algorithms without maintaining separate data structures.

Merge Algorithms & Diff Engine

Relevant Files

merge-ort.c & merge-ort.h - ORT merge strategy (default)
merge-ort-wrappers.c & merge-ort-wrappers.h - Wrapper functions
merge-ll.c & merge-ll.h - Low-level three-way merge
diff.c & diff.h - Diff engine and options
diffcore.h - Diff core data structures
xdiff-interface.c - XDiff library integration
xdiff/xmerge.c - XDiff merge implementation

Overview

Git's merge and diff systems work together to reconcile competing changes. The ORT (Ostensibly Recursive's Twin) strategy is the default merge algorithm, replacing the older recursive strategy. It performs three-way merges with rename detection and handles complex scenarios like directory renames and content conflicts.

Merge Architecture

The merge process follows this pipeline:

Collect merge info - Traverse all three trees (base, side1, side2) and build a map of all paths
Detect renames - Use diffcore to identify file renames and copies
Process entries - Resolve conflicts for each path, applying merge drivers
Output result - Generate merged tree with conflict markers where needed

Loading diagram...

Key Data Structures

merge_options - Configuration for merge behavior:

detect_renames - Enable rename detection
xdl_opts - XDiff options (patience, histogram, ignore whitespace)
conflict_style - Marker format (merge, diff3, zdiff3)
recursive_variant - Resolution strategy (normal, ours, theirs)

conflict_info - Per-path conflict metadata:

stages[3] - OID & mode for base, side1, side2
pathnames[3] - Paths after rename detection
df_conflict - Directory/file conflict flag
path_conflict - Rename/delete or other path conflicts

Diff Engine

The diff system processes file pairs through a pipeline:

Loading diagram...

XDiff Integration - Low-level diff algorithm:

Compares files line-by-line using Myers or histogram algorithm
Produces hunks with context lines
Supports ignore patterns (whitespace, regex)

Three-way Merge (ll_merge) - Merges individual files:

Calls xdl_merge() for text files
Handles binary files (no merge)
Applies merge drivers from .gitattributes
Generates conflict markers on failure

Conflict Resolution

Conflicts are marked with <<<<<<<, =======, >>>>>>> markers. The merge process handles:

Content conflicts - Text merge failed; markers inserted
Rename/rename - Same file renamed differently on both sides
Rename/delete - File renamed on one side, deleted on other
Directory renames - Inferred from file movements
Mode conflicts - File mode differs (executable, symlink)

Performance Optimizations

Rename caching - Reuses rename detection across sequential merges (cherry-pick, rebase)
Histogram diff - Faster than Myers for large files with repetitive content
Lazy loading - Blobs fetched only when needed for merge drivers
Sparse checkout - Merges only relevant paths in partial clones

Command Implementation & Porcelain

Relevant Files

builtin.h
git.c
builtin/commit.c
builtin/checkout.c
builtin/merge.c
builtin/rebase.c
builtin/fetch.c
builtin/push.c
builtin/log.c
command-list.txt

Git distinguishes between porcelain commands (user-facing, high-level) and plumbing commands (low-level, internal). This section covers how builtin commands are implemented and registered in the Git codebase.

Command Registration

All builtin commands are registered in the commands[] array in git.c. Each entry contains:

{ "command-name", cmd_function, FLAGS }

The flags control command behavior:

RUN_SETUP – Requires a Git repository; changes to repo root if in subdirectory
RUN_SETUP_GENTLY – Accepts missing repository gracefully
NEED_WORK_TREE – Requires a working tree (not bare repository)
DELAY_PAGER_CONFIG – Defers pager configuration to the command itself
NO_PARSEOPT – Command handles option parsing manually

Command Implementation Pattern

Every builtin command follows a standard signature defined in builtin.h:

int cmd_foo(int argc, const char **argv, 
            const char *prefix, struct repository *repo)

The prefix parameter contains the relative path from the repository root to the directory where the command was invoked. This enables commands to resolve user-supplied paths correctly.

Porcelain vs. Plumbing

Commands are classified in command-list.txt by type:

mainporcelain – Primary user-facing commands (commit, checkout, merge)
ancillarymanipulators/interrogators – Secondary user commands (branch, tag)
plumbingmanipulators/interrogators – Low-level internal commands (cat-file, hash-object)
purehelpers – Utility commands (check-attr, credential)

Complex Command Example: Commit

The git commit command (builtin/commit.c) demonstrates typical patterns:

Option parsing using parse_options() with a struct option array
State management via struct wt_status for working tree status
Index manipulation to stage changes before creating the commit object
Hook execution for pre-commit and post-commit workflows
Ref updates to move HEAD to the new commit

Subcommands like git bisect use OPT_SUBCOMMAND() to delegate to specialized handlers, enabling modular command hierarchies.

Execution Flow

When a user runs git foo:

git.c:cmd_main() parses global options
handle_builtin() looks up the command in the commands[] array
Repository setup occurs based on flags (RUN_SETUP, etc.)
The command function executes with prepared context
Exit status is returned and validated