Overview
Relevant Files
README.mdDocumentation/gitdatamodel.adocDocumentation/gitrepository-layout.adocgit.crepository.handrepository.c
Git is a fast, scalable, distributed revision control system with an unusually rich command set. It provides both high-level operations for everyday use and full access to internal mechanisms for advanced workflows. Originally written by Linus Torvalds, Git is an open-source project licensed under GPLv2.
Core Purpose
Git enables developers to track changes to code and collaborate efficiently across distributed teams. It stores complete repository history locally, allowing offline work and fast operations. Unlike centralized version control systems, Git gives every developer a full copy of the project history.
Data Model
Git's architecture is built on four fundamental data structures:
- Objects - Immutable data units representing commits, trees, blobs, and tags. Each object has a unique SHA-1 hash ID.
- References - Pointers to objects (branches, tags, remote-tracking branches) that enable human-readable navigation.
- Index - The staging area that tracks which changes will be included in the next commit.
- Reflogs - Historical logs of reference changes for recovery and debugging.
Repository Structure
A Git repository consists of:
.gitdirectory - Contains all repository metadata and object storageobjects/directory - Stores all Git objects (commits, trees, blobs, tags) in compressed pack files or loose formatrefs/directory - Stores branch and tag referencesHEAD- Points to the current branch- Working tree - The actual files you edit (absent in bare repositories)
Command Architecture
Git's command system is organized hierarchically:
git <command> [<subcommand>] [<options>] [<arguments>]
The main entry point (git.c) dispatches commands to built-in implementations in the builtin/ directory. Commands can be simple (like git add) or have subcommands (like git remote add, git maintenance run).
Key Components
- Repository Management - Handles initialization, configuration, and state tracking via
repository.h/c - Object Database - Manages storage and retrieval of Git objects with compression and indexing
- Reference System - Maintains branches, tags, and tracking references
- Index Management - Tracks staged changes and working tree state
- Diff Engine - Computes differences between commits, trees, and working tree
- Merge System - Handles multi-way merges with conflict detection
Performance Features
Git includes advanced optimizations for large repositories:
- Commit graphs - Accelerate history traversal
- Multi-pack index - Speed up object lookups across multiple pack files
- Sparse checkout - Reduce working tree size for monorepos
- Partial clone - Download only needed objects on demand
- Background maintenance - Optimize repository structure automatically
Architecture & Core Data Structures
Relevant Files
object.h&object.codb.h&odb.crepository.h&repository.c
Git's architecture centers on three core data structures that work together to manage repository state and object storage.
The Repository Structure
The struct repository is the top-level container for all repository-related state. It holds:
- Path information:
gitdir(the.gitdirectory) andcommondir(shared directory for worktrees) - Object database:
struct object_database *objectsfor accessing stored objects - Parsed object pool:
struct parsed_object_pool *parsed_objectsfor in-memory object caching - Reference storage:
struct ref_store *refs_privatefor managing branches and tags - Configuration and state: Hash algorithm, index state, submodule caches, and remote configuration
The Object Database (ODB)
The struct object_database manages access to Git objects through a multi-source architecture:
struct object_database {
struct odb_source *sources; /* Primary + alternates */
struct oidmap replace_map; /* Object replacements */
struct packfile_store *packfiles;/* Packed object access */
struct cached_object_entry *cached_objects;
};
Key features:
- Multiple sources: Primary source (
.git/objects) plus alternates for shared object pools - Lazy loading: Alternates loaded on-demand from
.git/objects/info/alternates - Transactions: Supports atomic object writes via
odb_transaction_begin() - Object replacement: Maps objects to replacements (see
git-replace)
The Parsed Object Pool
The struct parsed_object_pool caches parsed objects in memory:
struct parsed_object_pool {
struct object **obj_hash; /* Hash table of objects */
int nr_objs, obj_hash_size; /* Count and capacity */
struct alloc_state *blob_state; /* Per-type allocators */
struct commit_graft **grafts; /* Parent substitutions */
};
This uses a linear probing hash table for O(1) object lookup by OID. Objects are allocated from type-specific memory pools for efficiency.
Core Object Types
Git defines five core object types (stored in 3 bits):
OBJ_COMMIT(1): Snapshot of repository stateOBJ_TREE(2): Directory listingOBJ_BLOB(3): File contentOBJ_TAG(4): Annotated referenceOBJ_NONE(0): Uninitialized object
Delta objects (OBJ_OFS_DELTA, OBJ_REF_DELTA) are used only in pack files.
Data Flow
Loading diagram...
Object Lookup and Creation
Lookup (lookup_object): Uses hash table with linear probing. On collision, moves found object to initial position for faster future lookups.
Creation (create_object): Allocates from type-specific pool, inserts into hash table, grows table when 50% full.
Parsing (parse_object): Reads raw object from ODB, validates hash, deserializes into typed structure (commit, tree, blob, or tag).
Memory Management
Objects are allocated in slabs (contiguous blocks) per type, not individually. This reduces fragmentation and improves cache locality. The pool tracks allocation state per type and frees entire slabs on cleanup, not individual objects.
Object Storage & Retrieval
Relevant Files
object-file.c- Core object storage and retrieval logicobject-file.h- Object file API and structuresloose.c- Loose object mapping and managementloose.h- Loose object map interfacehash.c- Hash algorithm implementationshash.h- Hash algorithm definitions and context
Overview
Git stores objects (commits, trees, blobs, tags) in two primary formats: loose objects and packed objects. This section focuses on loose object storage and retrieval, which uses a content-addressable filesystem layout where object paths are derived from their SHA-1 or SHA-256 hashes.
Loose Object Storage Layout
Loose objects are stored in .git/objects/ using a two-level directory structure:
.git/objects/
├── ab/
│ ├── cdef1234567890...
│ └── 1234567890abcdef...
├── cd/
│ └── ef1234567890abcd...
└── ...
The first two hex characters of the hash form the directory name, and the remaining characters form the filename. This structure is generated by fill_loose_path() and odb_loose_path(), which convert an object ID into its filesystem path.
Hash Algorithms
Git supports multiple hash algorithms (SHA-1 and SHA-256) through the git_hash_algo structure:
struct git_hash_algo {
const char *name; // "sha1" or "sha256"
uint32_t format_id; // Pack file identifier
size_t rawsz; // Binary hash size (20 or 32 bytes)
size_t hexsz; // Hex representation size (40 or 64 chars)
git_hash_init_fn init_fn; // Initialize hash context
git_hash_update_fn update_fn; // Update with data
git_hash_final_oid_fn final_oid_fn; // Finalize to OID
// ... other fields
};
The git_hash_ctx structure maintains state during hashing operations and is used throughout object creation and verification.
Writing Objects
Object writing follows this flow:
- Prepare header: Format object type and size (e.g.,
"blob 42\0") - Compute hash: Use
hash_object_file()to generate the OID - Write loose object:
write_loose_object()compresses data with zlib and writes to a temporary file - Atomic rename: Move temp file to final location (e.g.,
.git/objects/ab/cdef...) - Freshen: Update file mtime to prevent garbage collection
Streaming writes are supported via odb_source_loose_write_stream() for large objects.
Reading Objects
Object retrieval uses read_loose_object():
- Locate file: Construct path from OID
- Memory map: Open and mmap the file for efficient access
- Decompress header: Extract object type and size
- Verify hash: Recompute hash to detect corruption
- Return contents: Provide decompressed data or stream
Loose Object Mapping
For hash algorithm transitions, Git maintains a loose_object_map that tracks correspondences between SHA-1 and SHA-256 representations of the same object. This map is persisted in .git/objects/loose-object-idx and enables seamless migration between hash algorithms.
Caching and Performance
The odb_source_loose structure includes:
oidtreecache: Fast lookup of loose objects by OID prefixsubdir_seenbitmap: Tracks which object directories have been scannedloose_object_map: Maps between compatible hash representations
These optimizations reduce filesystem operations during object lookups and abbreviated hash resolution.
Transactions and Fsync
For durability, object writes can participate in batch fsync transactions:
prepare_loose_object_transaction()creates a temporary object directoryfsync_loose_object_transaction()batches writeout requestsflush_loose_object_transaction()issues a final hardware flush before renaming
This approach improves performance on systems with expensive fsync operations.
Pack Files & Compression
Relevant Files
packfile.c&packfile.h- Pack file loading and managementpack-objects.c&pack-objects.h- Pack creation and object handlingpack-write.c- Index and metadata file writingmidx.c- Multi-pack index support
Pack files are Git's primary storage format for repository objects. They consolidate loose objects and smaller packs into larger, compressed archives to optimize storage and performance.
Pack File Format
A pack file (.pack) contains a header, object entries, and a trailing checksum:
Header (12 bytes):
- 4-byte signature: 'PACK'
- 4-byte version (2 or 3)
- 4-byte object count
Object Entries:
- Variable-length type & size header
- Compressed object data (or delta data)
Trailer:
- SHA-1/SHA-256 checksum of all above
Each object is encoded with a 3-bit type field and variable-length size using 7-bit chunks. Objects can be stored as deltas (OFS_DELTA or REF_DELTA) to save space.
Compression Strategy
Git uses zlib deflate compression for all packed objects. The compression level is configurable via pack.compression (range -1 to 9, default -1 for zlib default):
- Level 0: No compression (fastest)
- Level 6: Default balance of speed and compression
- Level 9: Maximum compression (slowest)
The do_compress() function in pack-objects.c handles object compression, while write_large_blob_data() streams compression for large blobs to manage memory efficiently.
Index Files
Pack index files (.idx) enable fast object lookup without scanning the entire pack:
- Version 1: Simple format with 256-entry fanout table
- Version 2: Supports packs > 4 GiB with CRC32 checksums for each object
The index stores object IDs, offsets, and CRC32 values. Version 2 is automatically selected when pack size exceeds 2^31 bytes.
Supporting Metadata
.revfiles: Reverse index mapping pack offsets to object positions for efficient iteration.mtimesfiles: Object modification times for cruft pack identification- Multi-pack index (MIDX): Indexes multiple packs simultaneously for faster lookups across pack boundaries
Pack Windows & Memory Management
The struct packed_git maintains a sliding window system (pack_window) for memory-mapped access to pack data. This allows efficient reading without loading entire packs into memory. The use_pack() function manages window allocation and the unuse_pack() function releases references.
Delta Compression
Deltas store differences between objects rather than full copies:
- OFS_DELTA: Encodes offset to base object in same pack (space-efficient)
- REF_DELTA: Encodes full object ID of base (supports cross-pack references)
Delta chains can be nested, but must eventually resolve to a canonical object. The delta data itself is also zlib-compressed.
References & Branches
Relevant Files
refs.c- Backend-independent reference modulerefs.h- Public reference APIrefs/files-backend.c- Loose file-based reference storagerefs/reftable-backend.c- Reftable reference storage backendbranch.c- Branch creation and managementbranch.h- Branch API
Git's reference system is the foundation for tracking commits, branches, and tags. References are names that point to object IDs (commits, trees, blobs) or other references (symbolic refs). The system uses a pluggable backend architecture to support different storage formats.
Reference Storage Backends
Git supports multiple reference storage backends, selected at repository initialization:
- Files Backend (
refs_be_files) - Traditional loose files in.git/refs/and packed refs in.git/packed-refs. Each reference is a file containing an object ID or symbolic reference target. - Reftable Backend (
refs_be_reftable) - Modern format using binary tables for efficient storage and querying. Supports atomic transactions and better performance for large repositories.
The backend interface is defined by struct ref_storage_be, which provides function pointers for operations like reading, writing, iterating, and transaction management.
Reference Transactions
Reference updates are atomic through the transaction system. A ref_transaction groups multiple reference updates and ensures they succeed or fail together:
struct ref_transaction {
struct ref_store *ref_store;
struct ref_update **updates;
size_t nr;
enum ref_transaction_state state;
};
Transactions follow a three-phase protocol: prepare (validate), finish (commit), and abort (rollback). This prevents partial updates and maintains repository consistency.
Reference Namespaces
Git organizes references into logical namespaces:
HEAD- Current branch pointerrefs/heads/- Local branchesrefs/tags/- Annotated and lightweight tagsrefs/remotes/- Remote-tracking branchesrefs/stash- Stash storagerefs/notes/- Metadata annotations
Branch Management
The branch module provides high-level operations for creating and managing branches. Key functions:
create_branch()- Creates a new branch from a starting point with optional tracking setupcreate_branches_recursively()- Creates branches in superproject and submodules atomicallydwim_and_setup_tracking()- Configures upstream tracking relationships
Branches are implemented as references in refs/heads/ that point to commits. Tracking branches link local branches to remote branches via configuration.
Reference Resolution
References are resolved recursively through refs_resolve_ref_unsafe(), which follows symbolic references to find the ultimate object ID. Resolution flags control behavior:
RESOLVE_REF_READING- Fail if reference doesn't existRESOLVE_REF_NO_RECURSE- Stop after one level of indirectionRESOLVE_REF_ALLOW_BAD_NAME- Allow malformed reference names
Loading diagram...
Reflog
The reflog records all reference updates, enabling recovery of lost commits and branch history inspection. Each reference maintains a log of previous values with timestamps and operation descriptions.
Revision Walking & History Traversal
Relevant Files
revision.c&revision.hcommit-reach.c&commit-reach.hcommit-graph.c&commit-graph.hpath-walk.c&path-walk.h
Core Revision Walking API
The revision walking system provides a structured way to traverse commit history. The main entry point is the rev_info structure, which holds configuration for what commits to walk and how to traverse them.
Key workflow:
- Initialize
rev_infowithrepo_init_revisions() - Configure traversal options (filters, sort order, etc.)
- Call
prepare_revision_walk()to build the commit list - Iterate using
get_revision()to fetch commits one by one
Traversal Modes
Git supports multiple traversal strategies controlled by flags in rev_info:
- Date-ordered: Default mode, commits sorted by commit date
- Topological order (
topo_order): Respects parent-child relationships, ensuring parents appear before children - Reflog walking (
reflog_info): Traverses reference history instead of commit ancestry - Limited traversal (
limited): Applies filters like date ranges or path restrictions
Topological Walking
For topological ordering, Git uses a sophisticated three-phase algorithm implemented in topo_walk_info:
- Explore phase: Discovers all reachable commits and their generation numbers
- Indegree phase: Calculates parent counts for each commit
- Output phase: Emits commits in topological order using priority queues
This approach leverages commit-graph generation numbers to optimize traversal, skipping unnecessary exploration of older commits.
Commit Reachability Analysis
The commit-reach.c module provides high-level reachability queries:
repo_is_descendant_of(): Check if one commit descends from anotherrepo_get_merge_bases(): Find common ancestorscan_all_from_reach_with_flag(): Batch reachability checks with generation cutoffsahead_behind(): Count commits ahead/behind between branches
These functions use commit-graph generation numbers as optimization hints to prune search spaces.
Commit Graph Integration
The commit-graph file accelerates history traversal by pre-computing:
- Parent relationships
- Generation numbers (topological distance from root)
- Bloom filters for path-based filtering
Functions like parse_commit_in_graph() and commit_graph_generation() provide fast lookups without parsing raw commit objects.
Path-Based Walking
The path-walk API enables efficient traversal of specific file paths across history:
int walk_objects_by_path(struct path_walk_info *info)
This batches objects by path, allowing tools like git log -- path to efficiently discover which commits modified specific files without examining every commit individually.
Object Flags and Marking
Revision walking uses bit flags to mark commit state during traversal:
SEEN: Commit already processedUNINTERESTING: Commit excluded from results (e.g., via--not)TREESAME: Commit has identical tree to parent (used for merge simplification)TOPO_WALK_EXPLORED: Commit visited during exploration phaseTOPO_WALK_INDEGREE: Commit processed for parent counting
These flags enable efficient single-pass algorithms without maintaining separate data structures.
Merge Algorithms & Diff Engine
Relevant Files
merge-ort.c&merge-ort.h- ORT merge strategy (default)merge-ort-wrappers.c&merge-ort-wrappers.h- Wrapper functionsmerge-ll.c&merge-ll.h- Low-level three-way mergediff.c&diff.h- Diff engine and optionsdiffcore.h- Diff core data structuresxdiff-interface.c- XDiff library integrationxdiff/xmerge.c- XDiff merge implementation
Overview
Git's merge and diff systems work together to reconcile competing changes. The ORT (Ostensibly Recursive's Twin) strategy is the default merge algorithm, replacing the older recursive strategy. It performs three-way merges with rename detection and handles complex scenarios like directory renames and content conflicts.
Merge Architecture
The merge process follows this pipeline:
- Collect merge info - Traverse all three trees (base, side1, side2) and build a map of all paths
- Detect renames - Use diffcore to identify file renames and copies
- Process entries - Resolve conflicts for each path, applying merge drivers
- Output result - Generate merged tree with conflict markers where needed
Loading diagram...
Key Data Structures
merge_options - Configuration for merge behavior:
detect_renames- Enable rename detectionxdl_opts- XDiff options (patience, histogram, ignore whitespace)conflict_style- Marker format (merge, diff3, zdiff3)recursive_variant- Resolution strategy (normal, ours, theirs)
conflict_info - Per-path conflict metadata:
stages[3]- OID & mode for base, side1, side2pathnames[3]- Paths after rename detectiondf_conflict- Directory/file conflict flagpath_conflict- Rename/delete or other path conflicts
Diff Engine
The diff system processes file pairs through a pipeline:
Loading diagram...
XDiff Integration - Low-level diff algorithm:
- Compares files line-by-line using Myers or histogram algorithm
- Produces hunks with context lines
- Supports ignore patterns (whitespace, regex)
Three-way Merge (ll_merge) - Merges individual files:
- Calls
xdl_merge()for text files - Handles binary files (no merge)
- Applies merge drivers from
.gitattributes - Generates conflict markers on failure
Conflict Resolution
Conflicts are marked with <<<<<<<, =======, >>>>>>> markers. The merge process handles:
- Content conflicts - Text merge failed; markers inserted
- Rename/rename - Same file renamed differently on both sides
- Rename/delete - File renamed on one side, deleted on other
- Directory renames - Inferred from file movements
- Mode conflicts - File mode differs (executable, symlink)
Performance Optimizations
- Rename caching - Reuses rename detection across sequential merges (cherry-pick, rebase)
- Histogram diff - Faster than Myers for large files with repetitive content
- Lazy loading - Blobs fetched only when needed for merge drivers
- Sparse checkout - Merges only relevant paths in partial clones
Command Implementation & Porcelain
Relevant Files
builtin.hgit.cbuiltin/commit.cbuiltin/checkout.cbuiltin/merge.cbuiltin/rebase.cbuiltin/fetch.cbuiltin/push.cbuiltin/log.ccommand-list.txt
Git distinguishes between porcelain commands (user-facing, high-level) and plumbing commands (low-level, internal). This section covers how builtin commands are implemented and registered in the Git codebase.
Command Registration
All builtin commands are registered in the commands[] array in git.c. Each entry contains:
{ "command-name", cmd_function, FLAGS }
The flags control command behavior:
RUN_SETUP– Requires a Git repository; changes to repo root if in subdirectoryRUN_SETUP_GENTLY– Accepts missing repository gracefullyNEED_WORK_TREE– Requires a working tree (not bare repository)DELAY_PAGER_CONFIG– Defers pager configuration to the command itselfNO_PARSEOPT– Command handles option parsing manually
Command Implementation Pattern
Every builtin command follows a standard signature defined in builtin.h:
int cmd_foo(int argc, const char **argv,
const char *prefix, struct repository *repo)
The prefix parameter contains the relative path from the repository root to the directory where the command was invoked. This enables commands to resolve user-supplied paths correctly.
Porcelain vs. Plumbing
Commands are classified in command-list.txt by type:
- mainporcelain – Primary user-facing commands (
commit,checkout,merge) - ancillarymanipulators/interrogators – Secondary user commands (
branch,tag) - plumbingmanipulators/interrogators – Low-level internal commands (
cat-file,hash-object) - purehelpers – Utility commands (
check-attr,credential)
Complex Command Example: Commit
The git commit command (builtin/commit.c) demonstrates typical patterns:
- Option parsing using
parse_options()with astruct optionarray - State management via
struct wt_statusfor working tree status - Index manipulation to stage changes before creating the commit object
- Hook execution for pre-commit and post-commit workflows
- Ref updates to move HEAD to the new commit
Subcommands like git bisect use OPT_SUBCOMMAND() to delegate to specialized handlers, enabling modular command hierarchies.
Execution Flow
When a user runs git foo:
git.c:cmd_main()parses global optionshandle_builtin()looks up the command in thecommands[]array- Repository setup occurs based on flags (RUN_SETUP, etc.)
- The command function executes with prepared context
- Exit status is returned and validated