Git 1: Why Git?

git-logo_cropped.jpg

Git Series:

Introduction

Welcome to my humble attempt to document the Git version control system. I hope you learn a lot from it. I’ll be doing this on Ubuntu 18.04.

Key Git terms (see Git from the Bottom Up):

  • repository: a collection of commits, each of which is an archive of what the project’s working tree looked like at a past date, whether on your machine or someone else’s. It also defines HEAD (see below), which identifies the branch or commit the current working tree stemmed from. Lastly, it contains a set of branches and tags, to identify certain commits by name.

  • the index: Unlike similar tools you may have used, Git does not commit changes directly from the working tree into the repository. Instead, changes are first registered in something called the index. Think of it as a way of “confirming” your changes, one by one, before doing a commit (which records all your approved changes at once). Some find it helpful to call it the “staging area” instead of the index.

  • working tree: any directory on your filesystem which has a repository associated with it — typically indicated by the presence of a sub-directory within it named .git and it includes all the files and sub-directories in that directory.

  • commit: a snapshot of your working tree at some point in time. The state of HEAD (see below) at the time your commit is made becomes that commit’s parent. This is what creates the notion of a “revision history”.

  • branch; just a name for a commit, also called a reference. It’s the parentage of a commit which defines its history, and thus the typical notion of a “branch of development”.

  • tag: a name for a commit, similar to a branch, except that it always names the same commit, and can have its own description text.

  • master: The mainline of development in most repositories is done on a branch called master. Although this is a typical default, you are not required to use it.

  • HEAD: used by your repository to define what is currently checked out.

    • If you checkout a branch, HEAD symbolically refers to that branch, indicating that the branch name should be updated after the next commit operation.

    • If you checkout a specific commit, HEAD refers to that commit only. This is referred to as a detached HEAD, and occurs, for example, if you checkout a tag name.

The usual flow of events is this: After creating a repository, your work is done in the working tree. Once your work reaches a significant point — the completion of a bug, the end of the working day, a moment when everything compiles — you add your changes successively to the index. Once the index contains everything you intend to commit, you record its content in the repository. Here’s a simple diagram that shows a typical project’s life-cycle:

lifecycle.png

With this basic picture in mind, the following sections shall attempt to describe how each of these different entities is important to the operation of Git.

Intro To Version Control

No tutorial on git would be complete without a section on version control. It is covered in the git book, but I wanted to get into some historical version controls systems. They include SCCS, CVS, Mercurial, Subversion, SourceSafe, among others. We’ll defer my personal history to the end of the series on Git where I will give my opinions of each of the VCSs I’ve been exposed to.

Features of Git

The following is an excerpt from this section of the git book.

Git is Distributed

Like Mercurial, Git stores snapshots of entire files vice the differences between them (as opposed to how CVS and Subversion store files). So as a VCS like CVS stored files like this:

deltas.png

What Git does to the same files is the following:

snapshots.png

This makes Git a mini-filesystem, leading to increased understandability of branching and merging.

Git Is Crazy Fast

Most of the operations Git needs to operate are local. Because the entire history of the project is right there on your local disk, most operations seem almost instantaneous.

This also means that there is very little you can’t do if you’re offline or off VPN. If you get on an airplane or a train and want to do a little work, you can commit happily (to your local copy, remember?) until you get to a network connection to upload. If you go home and can’t get your VPN client working properly, you can still work. In many other systems, doing so is either impossible or painful. In Perforce, for example, you can’t do much when you aren’t connected to the server; in Subversion and CVS, you can edit files, but you can’t commit changes to your database (because your database is offline). This may not seem like a huge deal, but you may be surprised what a big difference it can make.

Git Is (Nearly) Incorruptible

Everything in Git is checksummed before it is stored and is then referred to by that checksum. This means it’s impossible to change the contents of any file or directory without Git knowing about it. This functionality is built into Git at the lowest levels and is integral to its philosophy. You can’t lose information in transit or get file corruption without Git being able to detect it.

The mechanism that Git uses for this checksumming is called a SHA-1 hash. You will see these hash values all over the place in Git because it uses them so much. In fact, Git stores everything in its database not by file name but by the hash value of its contents.

Git Generally Only Adds Data

When you do actions in Git, nearly all of them only add data to the Git database. It is hard to get the system to do anything that is not undoable or to make it erase data in any way. As with any VCS, you can lose or mess up changes you haven’t committed yet, but after you commit a snapshot into Git, it is very difficult to lose, especially if you regularly push your database to another repository. This makes using Git a joy because we know we can experiment without the danger of severely screwing things up.

How Git Works

Pay attention now — here is the main thing to remember about Git if you want the rest of your learning process to go smoothly. Git has three main states that your files can reside in: modified, staged, and committed:

  • Modified means that you have changed the file but have not committed it to your database yet

  • Staged means that you have marked a modified file in its current version to go into your next commit snapshot

  • Committed means that the data is safely stored in your local database

This leads us to the three main sections of a Git project: the working tree, the staging area, and the Git directory.

Working tree, staging area, and .git directory

Working tree, staging area, and .git directory

The working tree is a single checkout of one version of the project. These files are pulled out of the compressed database in the Git directory and placed on disk for you to use or modify.

The staging area is a file, generally contained in your Git directory, that stores information about what will go into your next commit. Its technical name in Git parlance is the “index”, but the phrase “staging area” works just as well.

The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.

The basic Git workflow goes something like this:

  1. You modify files in your working tree.

  2. You selectively stage just those changes you want to be part of your next commit, which adds only those changes to the staging area.

  3. You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git directory.

If a particular version of a file is in the Git directory, it’s considered committed. If it has been modified and was added to the staging area, it is staged. And if it was changed since it was checked out but has not been staged, it is modified.

Getting Help

If you ever need help while using Git, there are three equivalent ways to get the comprehensive manual page (manpage) help for any of the Git commands:

git help <verb>
git <verb> --help man git-<verb>

For example, you can get the manpage help for the git config command by running this:

git help config

These commands are nice because you can access them anywhere, even offline. If the manpages and this book aren’t enough and you need in-person help, you can try the #git or #github channel on the Freenode IRC server, which can be found at https://freenode.net. These channels are regularly filled with hundreds of people who are all very knowledgeable about Git and are often willing to help.

In addition, if you don’t need the full-blown manpage help, but just need a quick refresher on the available options for a Git command, you can ask for the more concise “help” output with the -h option, as in:

git add -h

Gory Details

BitKeeper Debacle

TLDR; version: Linus Torvalds took it upon himself (albeit reluctantly) to write Git. If you want more details then read on!

The following is an excerpt from the BitKeeper Wikipedia page:

BitKeeper was first mentioned as a solution to some of the growing pains that Linux was having in September 1998.[5] Early access betas were available in May 1999[6] and on May 4, 2000, the first public release of BitKeeper was made available.[7][8] BitMover used to provide access to the system for certain open-source or free-software projects, one of which was the source code of the Linux kernel. The license for the "community" version of BitKeeper had allowed for developers to use the tool at no cost for open source or free software projects, provided those developers did not participate in the development of a competing tool (such as Concurrent Versions SystemGNU archSubversion or ClearCase) for the duration of their usage of BitKeeper plus one year. This restriction applied regardless of whether the competing tool was free or proprietary. This version of BitKeeper also required that certain meta-information about changes be stored on computer servers operated by BitMover, an addition that made it impossible for community version users to run projects of which BitMover was unaware.

The decision made in 2002 to use BitKeeper for Linux kernel development was a controversial one. Some, including GNU Project founder Richard Stallman, expressed concern about proprietary tools being used on a flagship free project. While project leader Linus Torvalds and other core developers adopted BitKeeper, several key developers (including Linux veteran Alan Cox) refused to do so, citing the BitMover license, and voicing concern that the project was ceding some control to a proprietary developer. To mitigate these concerns, BitMover added gateways which allowed limited interoperation between the Linux BitKeeper servers (maintained by BitMover) and developers using CVS and Subversion. Even after this addition, flamewars occasionally broke out on the Linux kernel mailing list, often involving key kernel developers and BitMover CEO Larry McVoy, who is also a Linux developer.[9][original research?]

In April 2005, BitMover announced that it would stop providing a version of BitKeeper free of charge to the community, giving as the reason the efforts of Andrew Tridgell, a developer employed by OSDL on an unrelated project, to develop a client which would show the metadata (data about revisions, possibly including differences between versions) instead of only the most recent version. Being able to see metadata and compare past versions is one of the core features of all version-control systems, but was not available to anyone without a commercial BitKeeper license, significantly inconveniencing most Linux kernel developers. Although BitMover decided to provide free commercial BitKeeper licenses to some kernel developers, it refused to give or sell licenses to anyone employed by OSDL, including Linus Torvalds and Andrew Morton, placing OSDL developers in the same position as other kernel developers. The Git project was launched with the intent of becoming the Linux kernel's source code management software, and was eventually adopted by Linux developers.

End of support for the "Free Use" version of BitKeeper was officially July 1, 2005, and users were required to switch to the commercial version or change version control system by then. Commercial users were also required not to produce any competing tools: In October 2005, McVoy contacted a customer using commercially licensed BitKeeper, demanding that an employee of the customer stop contributing to the Mercurial project, a GPL source management tool. Bryan O'Sullivan, the employee, responded, "To avoid any possible perception of conflict, I have volunteered to Larry that as long as I continue to use the commercial version of BitKeeper, I will not contribute to the development of Mercurial."[10]

Next Up: A Concrete Example

Next time we’ll take a look a Git in action with a genuine illustration!

Previous
Previous

Git 2: Initial Setup

Next
Next

State Machines