Git 5: Blobs and Trees
Git Is Like A UNIX Filesystem
Git is a lot like UNIX. Having been written by Linus Torvalds, this is unsurprising. This is all covered very well in Git From the Bottom Up, so I will draw heavily from there.
Git represents your file’s contents in blobs, which are also leaf nodes in something awfully close to a directory, called a tree. Just as an i-node is uniquely identified by a system-assigned number, a blob is named by computing the SHA1 hash ID of its size and contents. For all intents and purposes this is just an arbitrary number, like an i-node, except that it has two additional properties: first, it verifies the blob’s contents will never change; and second, the same contents shall always be represented by the same blob, no matter where it appears: across commits, across repositories — even across the whole Internet. If multiple trees reference the same blob, this is just like hard-linking: the blob will not disappear from your repository as long as there is at least one link remaining to it.
The difference between a Git blob and a filesystem’s file is that a blob stores no metadata about its content. All such information is kept in the tree that holds the blob. One tree may know those contents as a file named “foo” that was created in August 2004, while another tree may know the same contents as a file named “bar” that was created five years later. In a normal filesystem, two files with the same contents but with such different metadata would always be represented as two independent files. Why this difference? Mainly, it’s because a filesystem is designed to support files that change, whereas Git is not. The fact that data is immutable in the Git repository is what makes all of this work and so a different design was needed. And as it turns out, this design allows for much more compact storage, since all objects having identical content can be shared, no matter where they are.
Introducing The Blob
Now that the basic picture has been painted, let’s get into some practical examples. I’m going to start by creating a sample Git repository, and showing how Git works from the bottom up in that repository. Feel free to follow along as you read:
mkdir sample
cd sample
echo 'Hello, world!' > greeting
Pro Tip: if you’re unfamiliar with what the echo command above is doing: see I/O redirection. In a nutshell, it redirects the output of “echo ‘Hello, world!’” to a file called greeting.
Here I’ve created a new filesystem directory named “sample” which contains a file whose contents are prosaically predictable. I haven’t even created a repository yet, but already I can start using some of Git’s commands to understand what it’s going to do. First of all, I’d like to know which hash ID Git is going to store my greeting text under:
git hash-object greeting
af5626b4a114abcb82d63db7c8082c3c4756e51b
Git hash-object document.
If you run this command on your system, you’ll get the same hash ID. Even though we’re creating two different repositories (possibly a world apart, even) our greeting blob in those two repositories will have the same hash ID. I could even pull commits from your repository into mine, and Git would realize that we’re tracking the same content — and so would only store one copy of it! Pretty cool. The next step is to initialize a new repository and commit the file into it. I’m going to do this all in one step right now, but then come back and do it again in stages so you can see what’s going on underneath:
git init
git add greeting
git commit -m "Added my greeting"
At this point our blob should be in the system exactly as we expected, using the hash ID determined above. As a convenience, Git requires only as many digits of the hash ID are necessary to uniquely identify it within the repository. Usually just six or seven digits are enough:
git cat-file -t af5626b
blob
git cat-file blob af5626b
Hello, world!
This merits further explanation. The git “cat-file -t af5626b” tells Git to “show the object type of commit af5626b.” The “git cat-file blob af5626b” tells Git to show the contents of blob af5626b.
There it is! I haven’t even looked at which commit holds it, or what tree it’s in, but based solely on the contents I was able to assume it’s there, and there it is. It will always have this same identifier, no matter how long the repository lives or where the file within it is stored. These particular contents are now verifiably preserved, forever.
In this way, a Git blob represents the fundamental data unit in Git. Really, the whole system is about blob management.
Blobs Are Stored In Trees
The contents of your files are stored in blobs, but those blobs are pretty featureless. They have no name, no structure — they’re just “blobs,” after all.
In order for Git to represent the structure and naming of your files, it attaches blobs as leaf nodes within a tree. Now, I can’t discover which tree(s) a blob lives in just by looking at it, since it may have many, many owners. But I know it must live somewhere within the tree held by the commit I just made:
git ls-tree HEAD
100644 blob af5626b4a114abcb82d63db7c8082c3c4756e51b greeting
Again this calls for further explanation: ‘git ls-tree HEAD’ says for Git to show the objects in the tree pointed to by HEAD.
There it is! This first commit added my greeting file to the repository. This commit contains one Git tree, which has a single leaf: the greeting content’s blob.
Although I can look at the tree containing my blob by passing HEAD to ls-tree
, I haven’t yet seen the underlying tree object referenced by that commit. Here are a few other commands to highlight that difference and thus discover my tree:
git rev-parse HEAD
588483b99a46342501d99e3f10630cfc1219ea32 # different on your system
git cat-file -t HEAD
commit
git cat-file commit HEAD
tree 0563f77d884e4f79ce95117e2d686d7d6e282887 author John Wiegley <johnw@newartisans.com> 1209512110 -0400 committer John Wiegley <johnw@newartisans.com> 1209512110 -0400 Added my greeting
Git rev-parse document.
The first command decodes the HEAD alias into the commit it references, the second verifies its type, while the third command shows the hash ID of the tree held by that commit, as well as the other information stored in the commit object. The hash ID for the commit is unique to my repository — because it includes my name and the date when I made the commit — but the hash ID for the tree should be common between your example and mine, containing as it does the same blob under the same name.
There is more than one way to do this:
find .git/objects -type f | sort
.git/objects/05/63f77d884e4f79ce95117e2d686d7d6e282887
.git/objects/58/8483b99a46342501d99e3f10630cfc1219ea32
.git/objects/af/5626b4a114abcb82d63db7c8082c3c4756e51b
See the find and sort man pages.
From this output I see that the whole of my repo contains three objects, each of whose hash ID has appeared in the preceding examples.
How Trees Are Made
Every commit holds a single tree, but how are trees made? We know that blobs are created by stuffing the contents of your files into blobs — and that trees own blobs — but we haven’t yet seen how the tree that holds the blob is made, or how that tree gets linked to its parent commit.
Let’s start with a new sample repository again, but this time by doing things manually, so you can get a feeling for exactly what’s happening under the hood:
rm -fr greeting .git
echo 'Hello, world!' > greeting
git init
git add greeting
Pro tip: this example uses the ‘rm -fr’ command. I much prefer to use Ubuntu’s trash because I’ve blown away more than my fair share of filesystems with rm.
It all starts when you first add a file to the index. For now, let’s just say that the index is what you use to initially create blobs out of files. When I added the file greeting, a change occurred in my repository. I can’t see this change as a commit yet, but here is one way I can tell what happened:
git log # this will fail, there are no commits!
fatal: bad default revision 'HEAD'
git ls-files --stage # list blob referenced by the index
100644 af5626b4a114abcb82d63db7c8082c3c4756e51b 0 greeting
Git ls-files document.
What’s this? I haven’t committed anything to the repository yet, but already an object has come into being. It has the same hash ID I started this whole business with, so I know it represents the contents of my greeting file. I could use cat-file -t at this point on the hash ID, and I’d see that it was a blob. It is, in fact, the same blob I got the first time I created this sample repository. The same file will always result in the same blob (just in case I haven’t stressed that enough).
This blob isn’t referenced by a tree yet, nor are there any commits. At the moment it is only referenced from a file named .git/index, which references the blobs and trees that make up the current index. So now let’s make a tree in the repo for our blob to hang off of:
git write-tree # record the contents of the index in a tree
0563f77d884e4f79ce95117e2d686d7d6e282887
Git write-tree document.
This number should look familiar as well: a tree containing the same blobs (and sub-trees) will always have the same hash ID. I don’t have a commit object yet, but now there is a tree object in that repository which holds the blob. The purpose of the low-level write-tree command is to take whatever the contents of the index are and tuck them into a new tree for the purpose of creating a commit.
I can manually make a new commit object by using this tree directly, which is just what the commit-tree command does:
echo "Initial commit" | git commit-tree 0563f77
5f1bc85745dcccce6121494fdd37658cb4ad441f
The echo command above employs a rather non-intuitive pipe redirection for the uninitiated. It also uses the Git commit-tree command.
The raw commit-tree command takes a tree’s hash ID and makes a commit object to hold it. If I had wanted the commit to have a parent, I would have had to specify the parent commit’s hash ID explicitly using the -p option. Also, note here that the hash ID differs from what will appear on your system: This is because my commit object refers to both my name, and the date at which I created the commit, and these two details will always be different from yours.
Our work is not done yet, though, since I haven’t registered the commit as the new head of a branch:
echo 5f1bc85745dcccce6121494fdd37658cb4ad441f > .git/refs/heads/master
Note this requires the full commit string to be entered, which certainly will be different from 5f1bc….
This command tells Git that the branch name “master” should now refer to our recent commit. Another, much safer way to do this is by using the command update-ref:
git update-ref refs/heads/master 5f1bc857
Git update-ref document.
After creating master, we must associate our working tree with it. Normally this happens for you whenever you check out a branch:
git symbolic-ref HEAD refs/heads/master
Git symbolic-ref document.
This command associates HEAD symbolically with the master branch. This is significant because any future commits from the working tree will now automatically update the value of refs/heads/master.
It’s hard to believe it’s this simple, but yes, I can now use log to view my newly minted commit:
commit f0a0bf17c1f190cce2c201fa4851213576ddf170 (HEAD -> master)
Author: Michael Day <mday299@yahoo.com>
Date: Mon Nov 9 20:08:24 2020 -0500
Initial commit
Push ‘q’ to exit when done.
A side note: if I hadn’t set refs/heads/master to point to the new commit, it would have been considered “unreachable,” since nothing currently refers to it nor is it the parent of a reachable commit. When this is the case, the commit object will at some point be removed from the repository, along with its tree and all its blobs. (This happens automatically by a command called gc, which you rarely need to use manually). By linking the commit to a name within refs/heads, as we did above, it becomes a reachable commit, which ensures that it’s kept around from now on.
Next
Next time we’ll explore Git’s branching system.
Feedback
As always, do make a comment or write me an email if you have something to say about this post!