Git (software)

From Wikipedia, the free encyclopedia

Git

Developer:	Junio C Hamano
Latest release:	1.4.4.2 / December 4, 2006^[1]
OS:	POSIX
Use:	Revision control file system
License:	GPL
Website:	http://git.or.cz

Git is a revision control file system project begun by Linus Torvalds to manage the Linux kernel and now maintained by Junio Hamano. It is free software, released under the GNU General Public License version 2. Originally designed only as a low-level engine^[2] that others could use to write front ends such as Cogito or StGIT, the core Git project has since become a complete revision control system^[3] that is usable directly. It is targeted to run on Linux, but is perfectly usable on other Unix-like operating systems (like BSD, Solaris and Darwin). Git has been made to work under MS Windows using cygwin^[4], but it is noticeably slower, due to its heavy use of file system features that are particularly fast on Linux^[5]^[6].

1 Unique characteristics
2 Early history
3 Implementation
4 Using Git
5 Related projects
6 Projects that use Git
7 See also
8 References
9 External links

[edit] Unique characteristics

Git's design is a synthesis of Torvalds' intimate knowledge of maintaining a large distributed development project, and of file system performance. Combined with his urgent need to produce a working system in short order, these factors led to the following characteristics:

Strong support for non-linear development. Git supports rapid and convenient branching and merging, and includes powerful tools for visualizing and navigating a non-linear development history. A core assumption in Git is that a change will be merged more often than it is written, as it is passed around various reviewers. Torvalds himself does the most merging and least direct editing, so he has made sure that it works well.
Distributed development. Like BitKeeper, SVK and Monotone, Git gives each developer a local copy of the entire development history, and changes are copied from one such repository to another. These changes are imported as additional development branches, and can be merged in the same way as a locally developed branch. Repositories can be easily published via HTTP, FTP, ssh, rsync, or a special git protocol.
Efficient handling of large projects. Git is very fast, and scales well even when working with large projects or large histories. It is commonly an order of magnitude faster than other revision control systems, and several orders of magnitude faster on some operations.^{[citation needed]}
Cryptographic authentication of history. The Git history is stored in such a way that the name of a particular revision (a "commit" in Git terms) depends upon the complete development history leading up to that commit. Once it is published, it is not possible to change the old versions without it being noticed. (Monotone also has this property.)
Toolkit design. Following the Unix tradition, Git is a series of primitive programs written in C, and a large number of shell scripts that provide convenient wrappers. It is easy to chain the components together to do other clever things.
Pluggable merge strategies. As part of its toolkit design, git has a well-defined model of an incomplete merge, and it has multiple algorithms for completing it, culminating in telling the user that it is unable to complete the merge automatically and manual editing is required. It is thus easy to experiment with new merge algorithms.
Garbage accumulates unless collected. Aborting operations or backing out changes will leave useless dangling objects in the database. These are generally a small fraction of the continuously growing history of wanted objects, but reclaiming the space using git-fsck-objects can be slow.

One property of Git that has led to considerable controversy is that it snapshots directory trees of files. The earliest systems for tracking versions of source code, SCCS and RCS, worked on individual files and emphasized the space savings to be gained from delta encoding the (mostly similar) versions. Later revision control systems maintained this notion of a file having an identity across multiple revisions of a project.

Git rejects this concept^[7] and does not explicitly record file revision relationships at any level below the source code tree. This has the consequence that:

Renames are handled implicitly rather than explicitly. A major complaint with CVS is that it uses the name of a file to identify its revision history, so moving or renaming a file is not possible without either interrupting its history, or renaming the history and thereby making the history inaccurate. Most post-CVS revision control systems solve this by giving a file a unique long-lived name (a sort of inode number) that survives renaming. Git does not record such an identifier, and this is claimed as an advantage^[8]^[9]. Source code files are sometimes split or merged as well as simply renamed^[10], and recording this as a simple rename would freeze an inaccurate description of what happened in the (immutable) history. Git addresses the issue by detecting renames while browsing the history of snapshots rather than recording it when making the snapshot^[11]. (Briefly, given a file in revision N, a file of the same name in revision N-1 is its default ancestor. However, when there is no like-named file in revision N-1, Git searches for a file that existed only in revision N-1 and is very similar to the new file.) However, it does require more CPU-intensive work every time history is reviewed, and a number of options to adjust the heuristics.

Additionally, people are sometimes upset by the storage model:

Periodic explicit object packing. Git stores each newly created object as a separate file. Although individually compressed, this takes a great deal of space and is inefficient. This is solved by the use of "packs" that store a large number of objects in a single file (or network byte stream), delta-compressed among themselves. Packs are compressed using the heuristic that files with the same name are probably similar, but do not depend on it for correctness. Newly created objects (newly added history) are still stored singly, and periodic repacking is required to maintain space efficiency.

[edit] Early history

Git development began after many kernel developers were forced to give up access to the proprietary BitKeeper system (see "Zero-cost BitKeeper for Linux and other open source projects"). The ability to use BitKeeper as freeware had been withdrawn by the copyright holder Larry McVoy after he claimed Andrew Tridgell had reverse engineered the BitKeeper protocols in violation of the BitKeeper license. At Linux.Conf.Au 2005, Tridge demonstrated during his keynote that the reverse engineering process he had used was simply to telnet to the appropriate port of a Bitkeeper server and type "help".

The development of Git began on April 6, 2005,^[12] and proceeded very rapidly. The first merge of multiple branches was done on April 18, 2005,^[13] and two months later (June 16, 2005), the kernel 2.6.12 release^[14] was managed by Git.^[15]

Linus wanted a distributed system that he could use like BitKeeper, but none of the available free systems met his needs, particularly his performance needs. From an e-mail he wrote on April 7, 2005 while writing the first prototype:^[16]

However, the SCMs I've looked at make this hard. One of the things (the main thing, in fact) I've been working at is to make that process really efficient. If it takes half a minute to apply a patch and remember the changeset boundary etc. (and quite frankly, that's fast for most SCMs around for a project the size of Linux), then a series of 250 emails (which is not unheard of at all when I sync with Andrew, for example) takes two hours. If one of the patches in the middle doesn't apply, things are bad bad bad.

Now, BK wasn't a speed deamon either (actually, compared to everything else, BK is a speed deamon, often by one or two orders of magnitude), and took about 10-15 seconds per email when I merged with Andrew. HOWEVER, with BK that wasn't as big of an issue, since the BK<->BK merges were so easy, so I never had the slow email merges with any of the other main developers. So a patch-application-based SCM "merger" actually would need to be faster than BK is. Which is really really really hard.

So I'm writing some scripts to try to track things a whole lot faster. Initial indications are that I should be able to do it almost as quickly as I can just apply the patch, but quite frankly, I'm at most half done, and if I hit a snag maybe that's not true at all. Anyway, the reason I can do it quickly is that my scripts will not be an SCM, they'll be a very specific "log Linus' state" kind of thing. That will make the linear patch merge a lot more time-efficient, and thus possible.

(If a patch apply takes three seconds, even a big series of patches is not a problem: if I get notified within a minute or two that it failed half-way, that's fine, I can then just fix it up manually. That's why latency is critical - if I'd have to do things effectively "offline", I'd by definition not be able to fix it up when problems happen).

Linus achieved his performance goals; on April 29, 2005, the nascent Git was benchmarked recording patches to the Linux kernel tree at the rate of 6.7 per second.^[17]

While strongly influenced by BitKeeper, Linus deliberately attempted to be different where possible, leading to a very novel design.^[18]

He developed the system until it was usable by technical users, then turned over maintenance on July 26, 2005 to Junio Hamano, a major contributor to the project.^[19] Junio was responsible for the 1.0 release^[20] on December 21, 2005, and is maintainer to the present day.

[edit] Implementation

Like BitKeeper, Git does not use a centralized server. However, Git's primitives are not inherently a SCM system. Torvalds explains,^[21]

In many ways you can just see git as a filesystem — it's content-addressable, and it has a notion of versioning, but I really really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have absolutely zero interest in creating a traditional SCM system.

(Note that his opinion has changed since then.)^[22]

Git has two data structures, a mutable index that caches information about the working directory and the next revision to be committed, and an immutable, append-only object database containing four types of objects:

A blob object is the content of a file. Blob objects have no names, timestamps, or other metadata.
A tree object is the equivalent of a directory: it contains a list of filenames, each with some type bits and the name of a blob or tree object that is that file, symbolic link, or directory's contents. This object describes a snapshot of the source tree.
A commit object links tree objects together into a history. It contains the name of a tree object (of the top-level source directory), a timestamp, a log message, and the names of zero or more parent commit objects.
A tag object is a container that contains reference to another object and can hold additional meta-data related to another object. Most commonly it is used to store a digital signature of a commit object corresponding to a particular release of the data being tracked by Git.

The object database can hold any kind of object. An intermediate layer, the index, serves as connection point between the object database and the working tree.

Each object is identified by a SHA1 hash of its contents. Git computes the hash, and uses this value for the object's name. The object is put into a directory matching the first two characters of its hash. The rest of the hash is used as the file name for that object.

Git stores each revision of a file as a unique blob object. The relationships between the blobs can be found through examining the tree and commit objects. Newly added objects are stored in their entirety using zlib compression. This can consume a large amount of hard disk space quickly, so objects can be combined into packs, which use delta compression to save space, storing blobs as their changes relative to other blobs.

[edit] Using Git

Git is quite easy to use, a selection of basic commands is given below (for a complete list, see the GIT manpages):

git init-db — creates a new repository.
git add . — adds all files in the current directory to the list of files under git revision control.
git status — shows which files in a working copy need to be committed
git commit -a — commit all changes in a working copy into the repository
git log — show a listing of changes committed to the repository starting from the most recent
git view — graphical browsing of the repository history

As a small but realistic example of git's distributed development abilities, consider the following sequence of commands, to merge the linuxpps changes with a development Linux version.

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux

This obtains a copy of the complete development history of Linus's standard kernel, placing it in a newly created directory called "linux". As of August 2006, this will download and store about 145 Megabytes of packed history. A further 330 Megabytes will be required for the unpacked current source and object files.

git checkout -b pps v2.6.18-rc4

This will create and check out a new branch, named "pps", starting from the v2.6.18-rc4 version.

git pull git://git.enneenne.com/linuxpps master

This will communicate with the git server on git.enneenne.com, where the linuxpps patches are developed, and download a copy of the new development on the "master" branch there that is not now present in your 'original'. Then it will merge those changes with the current branch head. The output will be:

Unpacking 102 objects
 100% (102/102) done
Trying really trivial in-index merge...
fatal: Merge requires file-level merging
Nope.
Merging HEAD with 2f470aeb00eccdc18d4c72f7f1790dcc56863913
Merging: 
9f737633e6ee54fc174282d49b2559bd2208391d Linux v2.6.18-rc4 
2f470aeb00eccdc18d4c72f7f1790dcc56863913 Timestamps are now collected at the beginning
of "linuxpps_event()" function. 
found 1 common ancestor(s): 
a8bd60705aa17a998516837d9c1e503ad4cbd7fc Linux 2.6.17-rc5 
Auto-merging include/linux/netlink.h 
Auto-merging drivers/Kconfig 
Auto-merging drivers/Makefile 
Auto-merging drivers/serial/8250.c 

Merge e3c5d7d53a28f61fe6a83de333b46b202357e19e, made by recursive.
 drivers/Kconfig              |    2 
 drivers/Makefile             |    1 
 drivers/pps/Kconfig          |   43 ++++
 drivers/pps/Makefile         |    7 +
 drivers/pps/clients/Kconfig  |   23 ++
 drivers/pps/clients/Makefile |    6 +
 drivers/pps/clients/ktimer.c |  108 +++++++++
 drivers/pps/kapi.c           |  173 +++++++++++++++
 drivers/pps/pps.c            |  380 ++++++++++++++++++++++++++++++++
 drivers/pps/procfs.c         |  180 +++++++++++++++
 drivers/serial/8250.c        |   85 +++++++
 include/linux/netlink.h      |    1 
 include/linux/pps.h          |   93 ++++++++
 include/linux/timepps.h      |  498 ++++++++++++++++++++++++++++++++++++++++++
 14 files changed, 1597 insertions(+), 3 deletions(-)
 create mode 100644 drivers/pps/Kconfig
 create mode 100644 drivers/pps/Makefile
 create mode 100644 drivers/pps/clients/Kconfig
 create mode 100644 drivers/pps/clients/Makefile
 create mode 100644 drivers/pps/clients/ktimer.c
 create mode 100644 drivers/pps/kapi.c
 create mode 100644 drivers/pps/pps.c
 create mode 100644 drivers/pps/procfs.c
 create mode 100644 include/linux/pps.h
 create mode 100644 include/linux/timepps.h

First, the (relatively small) changes are downloaded and unpacked, then there is some chatter about the merge process, including the SHA-1 hashes of the objects being merged, and the files that required non-trivial merging. Finally, there is a summary of the changes that were made, and how large they are. The changes are automatically checked in, but can be easily reverted if they are not wanted. The "common ancestor" version mentioned (v2.6.17-rc5) is where linuxpps development began. After the merge, both Linus's changes between then and v2.6.18-rc4, and the linuxpps changes, are present.

What is noteworthy is that:

After the initial bulk download, communication for the merge is small and efficient.
No permission or assistance from the original Linux tree repository is required.
The merged version now has, not just the total change made in linuxpps development (as might be produced by a patch), but the complete development history of those changes.
It could have been done equivalently in the other order, starting with a copy of the linuxpps tree and then merging Linus's kernel.org tree. The only difference would have been which ancestor was listed as the "first parent" of the merge.

[edit] Related projects

[edit] Projects built on top of Git

Cogito (homepage) - Petr Baudiš maintains a set of scripts called Cogito (formerly git-pasky), a revision control system that uses Git as its backend.
StGIT (homepage) - Stacked GIT is a Python application providing similar functionality to Quilt (homepage) (i.e. pushing/popping patches to/from a stack) on top of Git, to manage patches until they get merged upstream.
pg (Patchy GIT) is a shell script wrapper around Git to help the user manage a set of patches to files. pg is somewhat like Quilt or StGIT, but it does have a slightly different feature set.
DarcsGit is an enhancement to Darcs enabling it to interact with Git repositories.
bzr git support plugin is a plugin for Bazaar to read Git trees. Though still in alpha stage, it provides enough support for bzrk visualisation.

[edit] Web interfaces

gitweb – a Perl implementation maintained by Kay Sievers. Used at kernel.org
wit – a Python implementation maintained by Christian Meder.
git-php - a PHP implementation by Zack Bartel

[edit] History visualization

gitk is a simple Tcl/Tk GUI for browsing history of Git repositories easily, distributed with Git.
QGit (SourceForge project page) is a Qt GUI for browsing history of Git repositories, similar to gitk.