Install

git/git

Git - Distributed Version Control System

Last updated on Dec 17, 2025 (Commit: c4a0c88)

Overview

Relevant Files
  • README.md
  • Documentation/gitdatamodel.adoc
  • Documentation/gitrepository-layout.adoc
  • git.c
  • repository.h and repository.c

Git is a fast, scalable, distributed revision control system with an unusually rich command set. It provides both high-level operations for everyday use and full access to internal mechanisms for advanced workflows. Originally written by Linus Torvalds, Git is an open-source project licensed under GPLv2.

Core Purpose

Git enables developers to track changes to code and collaborate efficiently across distributed teams. It stores complete repository history locally, allowing offline work and fast operations. Unlike centralized version control systems, Git gives every developer a full copy of the project history.

Data Model

Git's architecture is built on four fundamental data structures:

  1. Objects - Immutable data units representing commits, trees, blobs, and tags. Each object has a unique SHA-1 hash ID.
  2. References - Pointers to objects (branches, tags, remote-tracking branches) that enable human-readable navigation.
  3. Index - The staging area that tracks which changes will be included in the next commit.
  4. Reflogs - Historical logs of reference changes for recovery and debugging.

Repository Structure

A Git repository consists of:

  • .git directory - Contains all repository metadata and object storage
  • objects/ directory - Stores all Git objects (commits, trees, blobs, tags) in compressed pack files or loose format
  • refs/ directory - Stores branch and tag references
  • HEAD - Points to the current branch
  • Working tree - The actual files you edit (absent in bare repositories)

Command Architecture

Git's command system is organized hierarchically:

git <command> [<subcommand>] [<options>] [<arguments>]

The main entry point (git.c) dispatches commands to built-in implementations in the builtin/ directory. Commands can be simple (like git add) or have subcommands (like git remote add, git maintenance run).

Key Components

  • Repository Management - Handles initialization, configuration, and state tracking via repository.h/c
  • Object Database - Manages storage and retrieval of Git objects with compression and indexing
  • Reference System - Maintains branches, tags, and tracking references
  • Index Management - Tracks staged changes and working tree state
  • Diff Engine - Computes differences between commits, trees, and working tree
  • Merge System - Handles multi-way merges with conflict detection

Performance Features

Git includes advanced optimizations for large repositories:

  • Commit graphs - Accelerate history traversal
  • Multi-pack index - Speed up object lookups across multiple pack files
  • Sparse checkout - Reduce working tree size for monorepos
  • Partial clone - Download only needed objects on demand
  • Background maintenance - Optimize repository structure automatically

Architecture & Core Data Structures

Relevant Files
  • object.h & object.c
  • odb.h & odb.c
  • repository.h & repository.c

Git's architecture centers on three core data structures that work together to manage repository state and object storage.

The Repository Structure

The struct repository is the top-level container for all repository-related state. It holds:

  • Path information: gitdir (the .git directory) and commondir (shared directory for worktrees)
  • Object database: struct object_database *objects for accessing stored objects
  • Parsed object pool: struct parsed_object_pool *parsed_objects for in-memory object caching
  • Reference storage: struct ref_store *refs_private for managing branches and tags
  • Configuration and state: Hash algorithm, index state, submodule caches, and remote configuration

The Object Database (ODB)

The struct object_database manages access to Git objects through a multi-source architecture:

struct object_database {
  struct odb_source *sources;      /* Primary + alternates */
  struct oidmap replace_map;       /* Object replacements */
  struct packfile_store *packfiles;/* Packed object access */
  struct cached_object_entry *cached_objects;
};

Key features:

  • Multiple sources: Primary source (.git/objects) plus alternates for shared object pools
  • Lazy loading: Alternates loaded on-demand from .git/objects/info/alternates
  • Transactions: Supports atomic object writes via odb_transaction_begin()
  • Object replacement: Maps objects to replacements (see git-replace)

The Parsed Object Pool

The struct parsed_object_pool caches parsed objects in memory:

struct parsed_object_pool {
  struct object **obj_hash;        /* Hash table of objects */
  int nr_objs, obj_hash_size;      /* Count and capacity */
  struct alloc_state *blob_state;  /* Per-type allocators */
  struct commit_graft **grafts;    /* Parent substitutions */
};

This uses a linear probing hash table for O(1) object lookup by OID. Objects are allocated from type-specific memory pools for efficiency.

Core Object Types

Git defines five core object types (stored in 3 bits):

  • OBJ_COMMIT (1): Snapshot of repository state
  • OBJ_TREE (2): Directory listing
  • OBJ_BLOB (3): File content
  • OBJ_TAG (4): Annotated reference
  • OBJ_NONE (0): Uninitialized object

Delta objects (OBJ_OFS_DELTA, OBJ_REF_DELTA) are used only in pack files.

Data Flow

Loading diagram...

Object Lookup and Creation

Lookup (lookup_object): Uses hash table with linear probing. On collision, moves found object to initial position for faster future lookups.

Creation (create_object): Allocates from type-specific pool, inserts into hash table, grows table when 50% full.

Parsing (parse_object): Reads raw object from ODB, validates hash, deserializes into typed structure (commit, tree, blob, or tag).

Memory Management

Objects are allocated in slabs (contiguous blocks) per type, not individually. This reduces fragmentation and improves cache locality. The pool tracks allocation state per type and frees entire slabs on cleanup, not individual objects.

Object Storage & Retrieval

Relevant Files
  • object-file.c - Core object storage and retrieval logic
  • object-file.h - Object file API and structures
  • loose.c - Loose object mapping and management
  • loose.h - Loose object map interface
  • hash.c - Hash algorithm implementations
  • hash.h - Hash algorithm definitions and context

Overview

Git stores objects (commits, trees, blobs, tags) in two primary formats: loose objects and packed objects. This section focuses on loose object storage and retrieval, which uses a content-addressable filesystem layout where object paths are derived from their SHA-1 or SHA-256 hashes.

Loose Object Storage Layout

Loose objects are stored in .git/objects/ using a two-level directory structure:

.git/objects/
├── ab/
│   ├── cdef1234567890...
│   └── 1234567890abcdef...
├── cd/
│   └── ef1234567890abcd...
└── ...

The first two hex characters of the hash form the directory name, and the remaining characters form the filename. This structure is generated by fill_loose_path() and odb_loose_path(), which convert an object ID into its filesystem path.

Hash Algorithms

Git supports multiple hash algorithms (SHA-1 and SHA-256) through the git_hash_algo structure:

struct git_hash_algo {
    const char *name;           // "sha1" or "sha256"
    uint32_t format_id;         // Pack file identifier
    size_t rawsz;               // Binary hash size (20 or 32 bytes)
    size_t hexsz;               // Hex representation size (40 or 64 chars)
    git_hash_init_fn init_fn;   // Initialize hash context
    git_hash_update_fn update_fn; // Update with data
    git_hash_final_oid_fn final_oid_fn; // Finalize to OID
    // ... other fields
};

The git_hash_ctx structure maintains state during hashing operations and is used throughout object creation and verification.

Writing Objects

Object writing follows this flow:

  1. Prepare header: Format object type and size (e.g., "blob 42\0")
  2. Compute hash: Use hash_object_file() to generate the OID
  3. Write loose object: write_loose_object() compresses data with zlib and writes to a temporary file
  4. Atomic rename: Move temp file to final location (e.g., .git/objects/ab/cdef...)
  5. Freshen: Update file mtime to prevent garbage collection

Streaming writes are supported via odb_source_loose_write_stream() for large objects.

Reading Objects

Object retrieval uses read_loose_object():

  1. Locate file: Construct path from OID
  2. Memory map: Open and mmap the file for efficient access
  3. Decompress header: Extract object type and size
  4. Verify hash: Recompute hash to detect corruption
  5. Return contents: Provide decompressed data or stream

Loose Object Mapping

For hash algorithm transitions, Git maintains a loose_object_map that tracks correspondences between SHA-1 and SHA-256 representations of the same object. This map is persisted in .git/objects/loose-object-idx and enables seamless migration between hash algorithms.

Caching and Performance

The odb_source_loose structure includes:

  • oidtree cache: Fast lookup of loose objects by OID prefix
  • subdir_seen bitmap: Tracks which object directories have been scanned
  • loose_object_map: Maps between compatible hash representations

These optimizations reduce filesystem operations during object lookups and abbreviated hash resolution.

Transactions and Fsync

For durability, object writes can participate in batch fsync transactions:

  • prepare_loose_object_transaction() creates a temporary object directory
  • fsync_loose_object_transaction() batches writeout requests
  • flush_loose_object_transaction() issues a final hardware flush before renaming

This approach improves performance on systems with expensive fsync operations.

Pack Files & Compression

Relevant Files
  • packfile.c & packfile.h - Pack file loading and management
  • pack-objects.c & pack-objects.h - Pack creation and object handling
  • pack-write.c - Index and metadata file writing
  • midx.c - Multi-pack index support

Pack files are Git's primary storage format for repository objects. They consolidate loose objects and smaller packs into larger, compressed archives to optimize storage and performance.

Pack File Format

A pack file (.pack) contains a header, object entries, and a trailing checksum:

Header (12 bytes):
  - 4-byte signature: 'PACK'
  - 4-byte version (2 or 3)
  - 4-byte object count

Object Entries:
  - Variable-length type &amp; size header
  - Compressed object data (or delta data)

Trailer:
  - SHA-1/SHA-256 checksum of all above

Each object is encoded with a 3-bit type field and variable-length size using 7-bit chunks. Objects can be stored as deltas (OFS_DELTA or REF_DELTA) to save space.

Compression Strategy

Git uses zlib deflate compression for all packed objects. The compression level is configurable via pack.compression (range -1 to 9, default -1 for zlib default):

  • Level 0: No compression (fastest)
  • Level 6: Default balance of speed and compression
  • Level 9: Maximum compression (slowest)

The do_compress() function in pack-objects.c handles object compression, while write_large_blob_data() streams compression for large blobs to manage memory efficiently.

Index Files

Pack index files (.idx) enable fast object lookup without scanning the entire pack:

  • Version 1: Simple format with 256-entry fanout table
  • Version 2: Supports packs > 4 GiB with CRC32 checksums for each object

The index stores object IDs, offsets, and CRC32 values. Version 2 is automatically selected when pack size exceeds 2^31 bytes.

Supporting Metadata

  • .rev files: Reverse index mapping pack offsets to object positions for efficient iteration
  • .mtimes files: Object modification times for cruft pack identification
  • Multi-pack index (MIDX): Indexes multiple packs simultaneously for faster lookups across pack boundaries

Pack Windows & Memory Management

The struct packed_git maintains a sliding window system (pack_window) for memory-mapped access to pack data. This allows efficient reading without loading entire packs into memory. The use_pack() function manages window allocation and the unuse_pack() function releases references.

Delta Compression

Deltas store differences between objects rather than full copies:

  • OFS_DELTA: Encodes offset to base object in same pack (space-efficient)
  • REF_DELTA: Encodes full object ID of base (supports cross-pack references)

Delta chains can be nested, but must eventually resolve to a canonical object. The delta data itself is also zlib-compressed.

References & Branches

Relevant Files
  • refs.c - Backend-independent reference module
  • refs.h - Public reference API
  • refs/files-backend.c - Loose file-based reference storage
  • refs/reftable-backend.c - Reftable reference storage backend
  • branch.c - Branch creation and management
  • branch.h - Branch API

Git's reference system is the foundation for tracking commits, branches, and tags. References are names that point to object IDs (commits, trees, blobs) or other references (symbolic refs). The system uses a pluggable backend architecture to support different storage formats.

Reference Storage Backends

Git supports multiple reference storage backends, selected at repository initialization:

  • Files Backend (refs_be_files) - Traditional loose files in .git/refs/ and packed refs in .git/packed-refs. Each reference is a file containing an object ID or symbolic reference target.
  • Reftable Backend (refs_be_reftable) - Modern format using binary tables for efficient storage and querying. Supports atomic transactions and better performance for large repositories.

The backend interface is defined by struct ref_storage_be, which provides function pointers for operations like reading, writing, iterating, and transaction management.

Reference Transactions

Reference updates are atomic through the transaction system. A ref_transaction groups multiple reference updates and ensures they succeed or fail together:

struct ref_transaction {
    struct ref_store *ref_store;
    struct ref_update **updates;
    size_t nr;
    enum ref_transaction_state state;
};

Transactions follow a three-phase protocol: prepare (validate), finish (commit), and abort (rollback). This prevents partial updates and maintains repository consistency.

Reference Namespaces

Git organizes references into logical namespaces:

  • HEAD - Current branch pointer
  • refs/heads/ - Local branches
  • refs/tags/ - Annotated and lightweight tags
  • refs/remotes/ - Remote-tracking branches
  • refs/stash - Stash storage
  • refs/notes/ - Metadata annotations

Branch Management

The branch module provides high-level operations for creating and managing branches. Key functions:

  • create_branch() - Creates a new branch from a starting point with optional tracking setup
  • create_branches_recursively() - Creates branches in superproject and submodules atomically
  • dwim_and_setup_tracking() - Configures upstream tracking relationships

Branches are implemented as references in refs/heads/ that point to commits. Tracking branches link local branches to remote branches via configuration.

Reference Resolution

References are resolved recursively through refs_resolve_ref_unsafe(), which follows symbolic references to find the ultimate object ID. Resolution flags control behavior:

  • RESOLVE_REF_READING - Fail if reference doesn't exist
  • RESOLVE_REF_NO_RECURSE - Stop after one level of indirection
  • RESOLVE_REF_ALLOW_BAD_NAME - Allow malformed reference names
Loading diagram...

Reflog

The reflog records all reference updates, enabling recovery of lost commits and branch history inspection. Each reference maintains a log of previous values with timestamps and operation descriptions.

Revision Walking & History Traversal

Relevant Files
  • revision.c & revision.h
  • commit-reach.c & commit-reach.h
  • commit-graph.c & commit-graph.h
  • path-walk.c & path-walk.h

Core Revision Walking API

The revision walking system provides a structured way to traverse commit history. The main entry point is the rev_info structure, which holds configuration for what commits to walk and how to traverse them.

Key workflow:

  1. Initialize rev_info with repo_init_revisions()
  2. Configure traversal options (filters, sort order, etc.)
  3. Call prepare_revision_walk() to build the commit list
  4. Iterate using get_revision() to fetch commits one by one

Traversal Modes

Git supports multiple traversal strategies controlled by flags in rev_info:

  • Date-ordered: Default mode, commits sorted by commit date
  • Topological order (topo_order): Respects parent-child relationships, ensuring parents appear before children
  • Reflog walking (reflog_info): Traverses reference history instead of commit ancestry
  • Limited traversal (limited): Applies filters like date ranges or path restrictions

Topological Walking

For topological ordering, Git uses a sophisticated three-phase algorithm implemented in topo_walk_info:

  1. Explore phase: Discovers all reachable commits and their generation numbers
  2. Indegree phase: Calculates parent counts for each commit
  3. Output phase: Emits commits in topological order using priority queues

This approach leverages commit-graph generation numbers to optimize traversal, skipping unnecessary exploration of older commits.

Commit Reachability Analysis

The commit-reach.c module provides high-level reachability queries:

  • repo_is_descendant_of(): Check if one commit descends from another
  • repo_get_merge_bases(): Find common ancestors
  • can_all_from_reach_with_flag(): Batch reachability checks with generation cutoffs
  • ahead_behind(): Count commits ahead/behind between branches

These functions use commit-graph generation numbers as optimization hints to prune search spaces.

Commit Graph Integration

The commit-graph file accelerates history traversal by pre-computing:

  • Parent relationships
  • Generation numbers (topological distance from root)
  • Bloom filters for path-based filtering

Functions like parse_commit_in_graph() and commit_graph_generation() provide fast lookups without parsing raw commit objects.

Path-Based Walking

The path-walk API enables efficient traversal of specific file paths across history:

int walk_objects_by_path(struct path_walk_info *info)

This batches objects by path, allowing tools like git log -- path to efficiently discover which commits modified specific files without examining every commit individually.

Object Flags and Marking

Revision walking uses bit flags to mark commit state during traversal:

  • SEEN: Commit already processed
  • UNINTERESTING: Commit excluded from results (e.g., via --not)
  • TREESAME: Commit has identical tree to parent (used for merge simplification)
  • TOPO_WALK_EXPLORED: Commit visited during exploration phase
  • TOPO_WALK_INDEGREE: Commit processed for parent counting

These flags enable efficient single-pass algorithms without maintaining separate data structures.

Merge Algorithms & Diff Engine

Relevant Files
  • merge-ort.c & merge-ort.h - ORT merge strategy (default)
  • merge-ort-wrappers.c & merge-ort-wrappers.h - Wrapper functions
  • merge-ll.c & merge-ll.h - Low-level three-way merge
  • diff.c & diff.h - Diff engine and options
  • diffcore.h - Diff core data structures
  • xdiff-interface.c - XDiff library integration
  • xdiff/xmerge.c - XDiff merge implementation

Overview

Git's merge and diff systems work together to reconcile competing changes. The ORT (Ostensibly Recursive's Twin) strategy is the default merge algorithm, replacing the older recursive strategy. It performs three-way merges with rename detection and handles complex scenarios like directory renames and content conflicts.

Merge Architecture

The merge process follows this pipeline:

  1. Collect merge info - Traverse all three trees (base, side1, side2) and build a map of all paths
  2. Detect renames - Use diffcore to identify file renames and copies
  3. Process entries - Resolve conflicts for each path, applying merge drivers
  4. Output result - Generate merged tree with conflict markers where needed
Loading diagram...

Key Data Structures

merge_options - Configuration for merge behavior:

  • detect_renames - Enable rename detection
  • xdl_opts - XDiff options (patience, histogram, ignore whitespace)
  • conflict_style - Marker format (merge, diff3, zdiff3)
  • recursive_variant - Resolution strategy (normal, ours, theirs)

conflict_info - Per-path conflict metadata:

  • stages[3] - OID & mode for base, side1, side2
  • pathnames[3] - Paths after rename detection
  • df_conflict - Directory/file conflict flag
  • path_conflict - Rename/delete or other path conflicts

Diff Engine

The diff system processes file pairs through a pipeline:

Loading diagram...

XDiff Integration - Low-level diff algorithm:

  • Compares files line-by-line using Myers or histogram algorithm
  • Produces hunks with context lines
  • Supports ignore patterns (whitespace, regex)

Three-way Merge (ll_merge) - Merges individual files:

  • Calls xdl_merge() for text files
  • Handles binary files (no merge)
  • Applies merge drivers from .gitattributes
  • Generates conflict markers on failure

Conflict Resolution

Conflicts are marked with &lt;&lt;&lt;&lt;&lt;&lt;&lt;, =======, &gt;&gt;&gt;&gt;&gt;&gt;&gt; markers. The merge process handles:

  • Content conflicts - Text merge failed; markers inserted
  • Rename/rename - Same file renamed differently on both sides
  • Rename/delete - File renamed on one side, deleted on other
  • Directory renames - Inferred from file movements
  • Mode conflicts - File mode differs (executable, symlink)

Performance Optimizations

  • Rename caching - Reuses rename detection across sequential merges (cherry-pick, rebase)
  • Histogram diff - Faster than Myers for large files with repetitive content
  • Lazy loading - Blobs fetched only when needed for merge drivers
  • Sparse checkout - Merges only relevant paths in partial clones

Command Implementation & Porcelain

Relevant Files
  • builtin.h
  • git.c
  • builtin/commit.c
  • builtin/checkout.c
  • builtin/merge.c
  • builtin/rebase.c
  • builtin/fetch.c
  • builtin/push.c
  • builtin/log.c
  • command-list.txt

Git distinguishes between porcelain commands (user-facing, high-level) and plumbing commands (low-level, internal). This section covers how builtin commands are implemented and registered in the Git codebase.

Command Registration

All builtin commands are registered in the commands[] array in git.c. Each entry contains:

{ "command-name", cmd_function, FLAGS }

The flags control command behavior:

  • RUN_SETUP – Requires a Git repository; changes to repo root if in subdirectory
  • RUN_SETUP_GENTLY – Accepts missing repository gracefully
  • NEED_WORK_TREE – Requires a working tree (not bare repository)
  • DELAY_PAGER_CONFIG – Defers pager configuration to the command itself
  • NO_PARSEOPT – Command handles option parsing manually

Command Implementation Pattern

Every builtin command follows a standard signature defined in builtin.h:

int cmd_foo(int argc, const char **argv, 
            const char *prefix, struct repository *repo)

The prefix parameter contains the relative path from the repository root to the directory where the command was invoked. This enables commands to resolve user-supplied paths correctly.

Porcelain vs. Plumbing

Commands are classified in command-list.txt by type:

  • mainporcelain – Primary user-facing commands (commit, checkout, merge)
  • ancillarymanipulators/interrogators – Secondary user commands (branch, tag)
  • plumbingmanipulators/interrogators – Low-level internal commands (cat-file, hash-object)
  • purehelpers – Utility commands (check-attr, credential)

Complex Command Example: Commit

The git commit command (builtin/commit.c) demonstrates typical patterns:

  1. Option parsing using parse_options() with a struct option array
  2. State management via struct wt_status for working tree status
  3. Index manipulation to stage changes before creating the commit object
  4. Hook execution for pre-commit and post-commit workflows
  5. Ref updates to move HEAD to the new commit

Subcommands like git bisect use OPT_SUBCOMMAND() to delegate to specialized handlers, enabling modular command hierarchies.

Execution Flow

When a user runs git foo:

  1. git.c:cmd_main() parses global options
  2. handle_builtin() looks up the command in the commands[] array
  3. Repository setup occurs based on flags (RUN_SETUP, etc.)
  4. The command function executes with prepared context
  5. Exit status is returned and validated