Version Control System – Managing Your Projects

Note This part of the lecture note has been partially extracted and modified from Prof. Randy LeVeque’s class website on HPC.

In this class we will use git for

  • homework submission,
  • code project submission,
  • final coding project submission,
  • electronic file transfers needed for the course work between you and the instructor.

See the below for more information on using git and the repositories required for this class. There are many other version control systems that are currently popular, such as cvs, Subversion, Mercurial, and Bazaar.

Version control systems were originally developed to aid in the development of large software projects with many authors working on inter-related pieces. The basic idea is that you want to work on a file (one piece of the code), you check it out of a repository, make changes, and then check it back in when you’re satisfied. The repository keeps track of all changes (and who made them) and can restore any previous version of a single file or of the state of the whole project. It does not keep a full copy of every file ever checked in, it keeps track of differences diff between versions, so if you check in a version that only has one line changed from the previous version, only the characters that actually changed are kept track of.

It sounds like a hassle to be checking files in and out, but there are a number of advantages to this system that make version control an extremely useful tool even for use with you own projects if you are the only one working on something. Once you get comfortable with it you may wonder how you ever lived without it.

Advantages

  • You can revert to a previous version of a file if you decide the changes you made are incorrect. You can also easily compare different versions to see what changes you made, e.g. where a bug was introduced.
  • If you use a computer program and some set of data to produce some results for a publication, you can check in exactly the code and data used. If you later want to modify the code or data to produce new results, as generally happens with computer programs, you still have access to the first version without having to archive a full copy of all files for every experiment you do. Working in this manner is crucial if you want to be able to later reproduce earlier results, as if often necessary if you need to tweak the plots for to some journal’s specifications or if a reader of your paper wants to know exactly what parameter choices you made to get a certain set of results. This is an important aspect of doing ‘reproducible research’, as should be required in science. If nothing else you can save yourself hours of headaches down the road trying to figure out how you got your own results.
  • If you work on more than one machine, e.g. a desktop and laptop, version control systems are one way to keep your projects synched up between machines.

Two Types of Version Control Systems: SVN vs. Git

Client-server systems (e.g., CVS, SVN)

The original version control systems all used a client-server model, in which there is one computer that contains “the repository” and everyone else checks code into and out of that repository.

Systems such as CVS and Subversion (svn) have this form. An important feature of these systems is that only the repository has the full history of all changes made.

Please see articles on comparison between svn and git:

both of which give brief overviews on two different client-server systems.

Distributed systems (e.g., Git)

Git, and other systems such as Mercurial and Bazaar, use a distributed system in which there is not necessarily a “master repository’‘. Any working copy contains the full history of changes made to this copy.

The best way to get a feel for how git works is to use it, for example by following the instructions in the next section.

Remark Please also go watch the following Youtube video tutorials and a cheat sheet on git:

Git for the Class using the Git server on SOE servers

Instructions for cloning the class git repository

Note This part of the lecture note has been partially extracted from Prof. Randy LeVeque’s class website on Git and has been modified sligtly.

All of the materials for this class, including homework assignments, sample programs, and lecture note (html and pdf) are controled in a Git repository hosted at one of the SOE server, located at riverdance.soe.ucsc.edu. See a short instruction on how to set up your own git repository on one of the SOE server.

In addition to viewing the class materials and associated files via the link above, you can also view changesets, issues, and update histories, etc. as well. To obtain a copy of the class git repo, simply create one directory where you want your copy to reside, say, ams209 in your home directory, move to the directory, and then clone the repository as follows:

$ mkdir ams209
$ cd ams209
$ git clone yourSOEaccount@riverdance.soe.ucsc.edu:/soe/dongwook/GitRepos/teaching/2017-2018/ams209 ./

If you fail to clone the repo with the following message:

$ fatal: Authentication failed

then this means that you haven’t been invited to join as an AMS209 group member to have an access to the course repo. In this case, please send me your email (preferably your ucsc email, rather than your personal email) so that I can send you out an invitation. You would like to use the same email when you create your own account on either SOE (see instruction) or Bitbucket (see Creating your own Bitbucket repository) for your own repo. You will need an SOE account too. If you don’t have any, please fill out the account request form, listing me as a sponsoring faculty.

There is no (white) space in the above git command line. At this point, it is assumed you have git installed on your OS. Otherwise, go visit download:git. The clone statement will download the entire contents of the class repository as a new subdirectory called ams209.

Keep your cloned git repo updated/synced with the course repo

The files in the class repository remotely hosted in the SOE git server will continuously get changed and updated as the quarter progresses with new notes, sample programs, and homework sets, etc. In order to bring these changes over to your cloned copy, all you need to do is

$ cd ams209
$ git fetch origin
$ git merge origin/master

The git fetch command instructs git to fetch any changes from origin, which points to the remote repository (e.g., SOE servers, bitbucket, or Github; riverdance SOE server in the current example) that you originally cloned from. In the merge command, origin/master refers to the master branch in this repository (which is the only branch that exists for this particular repository). This merges any changes retrieved into the files in your current working directory.

Remark You need to be online to run the git fetch origin master command which will fetch all the up-to-date changes from the remote repository origin to your local working branch master. Once you have done git fetch, your computer can be offline to proceed git merge origin/master to integrate those downloaded changes from git fetch origin master to your local master branch.

The last two command can be combined as:

$ git pull origin master

or simply:

$ git pull

because origin and master are the defaults.

There are three terminologies above, origin, master, and origin/master. Let’s now give clear definitions of them:

  • origin: a remote repository that exists over the network (e.g., SOE servers, Bitbucket, Github)
  • master: a local branch (e.g., your local working branch after cloning from origin)
  • origin/master (or equivalently, remote/origin/master): a remote branch that is a local copy of the branch named master on the remote named origin.

A couple of frequently used examples are below:

  • The syntax to push (or pull) commits made on your local branch to (or from) a remote repo:

    $ git push <REMOTENAME> <BRANCHNAME>
    $ git pull <REMOTENAME> <BRANCHNAME>
    

    For exmaple, to push (or pull) your local changes in your local master branch to (or from) the remote master branch in origin repo:

    $ git push origin master (or simply, git push)
    $ git pull origin master (or simply, git pull)
    
  • The syntax to retrieve all the updates made to a remote repository (e.g., origin) without merging those changes into your own branch:

    $ git fetch <REMOTENAME>
    

    A default example is:

    $ git fetch origin (or simply, git fetch)
    
  • The syntex to merge your local changes with changes made by others:

    $ git merge <REMOTENAME>/<BRANCHNAME>
    

    A default example is:

    $ git merge origin/master
    
  • As we’ve seen already, the last two git commands combined together:

    $ git fetch origin
    $ git merge origin/master
    

    are equivalent to:

    $ git pull origin master (or simply git pull)
    

Remark To read more about origin, master, and origin/master, please read the following articles: article 1, article 2.

Creating your own Bitbucket repository

In addition to using the class repository, you can create their own repository either on one of the SOE servers or on Bitbucket. As the first option, if you wish to set up your own repo on the SOE servers, please follow the instructions here.

Let’s take a look at the second option and see how you can set up your git repo on non-campus remote places such as Github, or Bitbucket. It is possible to use git for your own work without creating a repository on a hosted site (such as Github, Bitbucket, or SOE servers), but there are several reasons you would like to create a remote repo. In the rest, we are going to use Bitbucket as our non-campus remote host site choice:

  • You should learn how to use Bitbucket for more than just pulling changes.
  • You will use this repository to “submit” your solutions to homeworks. You will give the instructor and TA permission to clone your repository so that we can grade the homework (others will not be able to clone or view it unless you also give them permission).
  • It is recommended that after the class ends you continue to use your repository as a way to back up your important work on another computer (with all the benefits of version control too!). At that point, of course, you can change the permissions so the instructor and TA no longer have access.

Below are the instructions for creating your own repository. Note that this should be a private repository so nobody can view or clone it unless you grant permission.

Anyone can create a free private repository on Bitbucket. Note that you can also create an unlimited number of public repositories free at Bitbucket, which you might want to do for open source software projects, or for classes like this one.

Remark To make free open access repositories that can be viewed by anyone, Github is recommended, which allows an unlimited number of open repositories and is widely used for open source projects.

Remark Please take a look at an article comparing Bitbucket and GitHub

Remark A good graphical tutorial is available at tutorial 1, and tutorial 2.

Getting used to your own local git repo

We will clone your Bitbucket repository and check that testfile.txt has been created and modified as directed below. If you use one of the SOE servers to host your remote repository, please follow the instructions and jump to Step 9 below.

  1. On the machine you’re working on:

    $ git config --global user.name "Your Name"
    $ git config --global user.email you@example.com
    

    These will be used when you commit changes. If you don’t do this, you might get a warning message the first time you try to commit.

  2. Go to http://bitbucket.org/ and click on “Sign up now” if you don’t already have an account.

  3. Fill in the form, make sure you remember your username and password.

  4. You should then be taken to your account. Click on “Create” next to “Repositories”.

  5. You should now see a form where you can specify the name of a repository and a description. The repository name need not be the same as your user name (a single user might have several repositories). For example, the class repository is named ams209-fall-2016, owned by user dongwook159. To avoid confusion, you should probably not name your repository ams209-fall-2016.

    You should stick to lower case letters and numbers in your repository name, e.g. ams209-ucsc or ams209-scicomp might be good choices. Upper case and special symbols such as underscore sometimes get modified by bitbucket and the repository name you try to paste into the homework submission form might not agree with what bitbucket expects.

    Don’t name your repository homework1 because you will be using the same repository for other homeworks later in the quarter.

  6. Make sure you click on “Private” at the bottom. Also turn “Issue tracking” and “Wiki” on if you wish to use these features.

  7. Click on “Create repository”.

  8. You should now see a page with instructions on how to clone your (currently empty) repository. In a Unix window, cd to the directory where you want your cloned copy to reside, and perform the clone by typing in the clone command shown. This will create a new directory with the same name as the repository.

  9. You should now be able to cd into the directory this created.

  10. The directory you are now in will appear empty if you simply do:

    $ ls
    

    But it will look slightly different if you try:

    $ ls -a
    ./  ../  .git/
    

    the -a option causes ls to list files starting with a dot, which are normally suppressed. See Basic Unix/Linux Commands for a discussion of ./ and ../. The directory .git is the directory that stores all the information about the contents of this directory and a complete history of every file and every change ever committed. You shouldn’t touch or modify the files in this directory because they are used by git to control versions, commit changes and their history, etc.

  11. Add a new file to your directory:

    $ cat > testfile.txt
    This is a new file
    with only two lines so far.
    ^D
    

    The Unix cat command simply redirects everything you type on the following lines into a file called testfile.txt. This goes on until you type a <ctrl>-d (the 4th line in the example above). After typing <ctrl>-d you should get the Unix prompt back. Alternatively, you could create the file testfile.txt using your favorite text editor (see Items for the Class).

  12. To see status of your folder, type:

    $ git status -s
    

    The response should be:

    ?? testfile.txt
    

    The ?? means that this file is not under revision control. The -s flag results in this short status list. Leave it off for more information.

    To put the file under revision control, type:

    $ git add testfile.txt
    $ git status -s
    A  testfile.txt
    

    The A means it has been added. However, at this point git is not we have not yet taken a snapshot of this version of the file. To do so, type:

    $ git commit -m "My first commit of a test file."
    

    The string following the -m is a comment about this commit that may help you in general remember why you committed new or changed files.

    You should get a response like:

    [master 31cb6ed] My first commit of a test file.
    1 file changed, 2 insertions(+)
    create mode 100644 testfile.txt
    

    We can now see the status of our directory via:

    $ git status
    # On branch master
    nothing to commit (working directory clean)
    

    Alternatively, you can check the status of a single file with:

    $ git status testfile.txt
    

    You can get a list of all the commits you have made (only one so far) using:

    $ git log
    
    commit 31cb6ed38310eed36f47d3d3aed769e03da540c9
    Author: dongwook159 <dlee79@ucsc.edu>
    Date:   Fri Sep 25 00:04:14 2016 -0700
    
    My first commit of a test file.
    

    The number 31cb6ed38310eed36f47d3d3aed769e03da540c9 above is the “name” of this commit and you can always get back to the state of your files as of this commit by using this number. You don’t have to remember it, you can use commands like git log to find it later.

    Yes, this is a number... it is a 40 digit hexadecimal number, meaning it is in base 16 so in addition to 0, 1, 2, ..., 9, there are 6 more digits a, b, c, d, e, f representing 10 through 15. This number is almost certainly guaranteed to be unique among all commits you will ever do (or anyone has ever done, for that matter). It is computed based on the state of all the files in this snapshot as a SHA-1 Cryptographic hash function, called a SHA-1 Hash for short.

Modifying a file

Now let’s modify this file:

$ cat >> testfile.txt
Adding a third line
^D

Here the >> tells cat that we want to add on to the end of an existing file rather than creating a new one. (Or you can edit the file with your favorite editor and add this third line.)

Now try the following:

$ git status -s
 M testfile.txt

The M indicates this file has been modified relative to the most recent version that was committed.

To see what changes have been made, try:

$ git diff testfile.txt

This will produce something like:

diff --git a/testfile.txt b/testfile.txt
index d80ef00..fe42584 100644
--- a/testfile.txt
+++ b/testfile.txt
@@ -1,2 +1,3 @@
 This is a new file
 with only two lines so far
+Adding a third line

The + in front of the last line shows that it was added. The two lines before it are printed to show the context. If the file were longer, git diff would only print a few lines around any change to indicate the context.

Now let’s try to commit this changed file:

$ git commit -m "added a third line to the test file"

This will fail! You should get a response like this:

# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working
#   directory)
#
#   modified:   testfile.txt
#
no changes added to commit (use "git add" and/or "git commit -a")

git is saying that the file testfile.txt is modified but that no files have been staged for this commit.

If you are used to Mercurial, git has an extra level of complexity (but also flexibility): you can choose which modified files will be included in the next commit. Since we only have one file, there will not be a commit unless we add this to the index of files staged for the next commit:

$ git add testfile.txt

Note that the status is now:

$ git status -s
M  testfile.txt

This is different in a subtle way from what we saw before: The M is in the first column rather than the second, meaning it has been both modified and staged.

We can get more information if we leave off the -s flag:

$ git status

# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#   modified:   testfile.txt
#

Now testfile.txt is on the index of files staged for the next commit.

Now we can do the commit:

$ git commit -m "added a third line to the test file"

[master 51918d7] added a third line to the test file
 1 file changed, 1 insertion(+)

Try doing:

$ git log

or:

$ git log --graph

now and you should see something like:

commit 271bd14e5b8d68840e7e6481ad7e99e5708e50e7
Author: dongwook159 <dlee79@ucsc.edu>
Date:   Sun Sep 25 00:02:34 2016 -0700

       added a third line to the test file

       commit 0c20925f98b5d76d0b973d25fdc78fd43941792e
       Author: dongwook159 <dlee79@ucsc.edu>
       Date:   Sun Sep 25 00:01:25 2016 -0700

       My first commit of a test file.

If you want to revert your working directory back to the first snapshot you could do:

$ git checkout  31cb6ed383
Note: checking out '31cb6ed383'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

HEAD is now at 31cb6ed383... My first commit of a test file.

Take a look at the file, it should be back to the state with only two lines. You are now in a situation called a detached HEAD state. To learn more about it and how to fix the situation, take a look at the following articles:

Note that you don’t need the full SHA-1 hash code, the first few digits are enough to uniquely identify it.

You can go back to the most recent version with:

$ git checkout master
Switched to branch 'master'

We won’t discuss branches, but unless you create a new branch, the default name for your main branch is master and this checkout command just goes back to the most recent commit.

  1. So far you have been using git to keep track of changes in your own directory, on your computer. None of these changes have been seen by Bitbucket, so if someone else cloned your repository from there, they would not see testfile.txt.

    Now let’s push these changes back to the Bitbucket repository. First do:

    $ git status
    

    to make sure there are no changes that have not been committed. This should print nothing.

    Now do:

    $ git push -u origin master
    

    This will prompt for your Bitbucket password and should then print something indicating that it has uploaded these two commits to your bitbucket repository.

    Not only has it copied the 1 file over, it has added both changesets, so the entire history of your commits is now stored in the repository. If someone else clones the repository, they get the entire commit history and could revert to any previous version, for example.

    To push future commits to bitbucket, you should only need to do:

    $ git push
    

    and by default it will push your master branch (the only branch you have, probably) to origin, which is the shorthand name for the place you originally cloned the repository from. To see where this actually points to:

    $ git remote -v
    

    This lists all remotes. By default there is only one, the place you cloned the repository from. (Or none if you had created a new repository using git init rather than cloning an existing one.)

  2. Check that the file is in your Bitbucket repository: Go back to that web page for your repository and click on the “Source” tab at the top. It should display the files in your repository and show testfile.txt.

    Now click on the “Commits” tab at the top. It should show that you made two commits and display the comments you added with the -m flag with each commit.

    If you click on the hex-string for a commit, it will show the change set for this commit. What you should see is the file in its final state, with three lines. The third line should be highlighted in green, indicating that this line was added in this changeset. A line highlighted in red would indicate a line deleted in this changeset.

Rolling back to a previous state

Let’s take a look at the case where you do not like your last change you made to your repo, and you want to revert your repo status back to a previous state, say,

  • commit 1b82c21688befa80560807247594d73768d64f0a (the current unsatisfied revision) –> commit c27d1bdf0098efe59aa25f809a719ce4fa910fef (the previous revision you wish to roll back to)

In this case, there are two ways to roll back your repo to the previous state.

Firstly, if you do:

$ git reset --hard c27d1bdf0098

it will revert both the local code and the local history back to the previous state. This might look ok but it would fail if you wished to push your reverted repo to the remote public repo especially when there is someone else in your team who already has the new history from the state commit 1b82c21688befa80560807247594d73768d64f0a.

Instead, if you do:

$ git reset --soft c27d1bdf0098

it will only revert your local files back to the previous state, leaving your history unchanged. In this case, you can successfully push your changes to the public repo without causing any conflicts in histories among your project team members.

In case you want to recover files that are deleted locally, you can do:

$ git ls-files -d | xargs git checkout --

Similarly, to recover modified files back to the previous states:

$ git ls-files -m | xargs git checkout --

See more examples at https://git-scm.com/docs/git-ls-files.

Remark Wait a minute... what is the command xargs above??? It is one of the most powerful linux commands, especially when combined with other commands. Please take a look at an article for more.

In some cases, you may wish to forget about all your local changes and want git to overwrite the entire local files. In general, if you have some changes in your local files that git sees as potential conflicts, git pull will not allow you to bring in the most recent updates committed to the git by others. Git will give you errors such as:

$ error: Your local changes to the following files would be
overwritten by merge:

or:

$ error: The following untracked working tree files would be overwritten by merge:

In this case if you don’t mind overwritting your local changes with whatever available in the git, you can do the following:

$ git fetch --all
$ git reset --hard origin/master

or you can combine the two in a single line command using &&:

$ git fetch --all && git reset --hard origin/master

Again, with this command, all of your local changes will be lost with or without –hard option, and therefore any local commits that haven’t been pushed will be lost. So, you do this if you know what you’re doing and trust the recent updates by pulling from the git repo.

Understanding Git Workflows

Please read a very nice tutorial

Summary

The commands we discussed so far will give you a good start with git. As you’re getting used to use git you will learn that only a handful git commands are needed in many cases. This is in particular true unlesss you work on the project with many other project members over the network. In our class it will primarily be yourself only who will keep checking in and out changes to and from your central repo hosted in Bitbucket. Another frequent usage will be to sync your local repo with the course repo on a regular basis.

In this simple project enviroment, you will most likely need to use the following commands:

$ git status
$ git add
$ git commit
$ git push
$ git pull

Quick Exercise

Consider that you are working on your git repo, and suppose you just created a new file, roster.txt:

$ touch roster.txt

After editing the file, you check the status of roster.txt:

$ git status -s ./

and you see:

$ ?? roster.txt

As you keep working on in this way, you find that there are bunch of such newly created files having ?? marks, for instance:

$ ?? roster.txt
$ ?? roster1.txt
$ ?? roster2.txt
$ ?? roster3.txt
$ ?? roster4.txt

At this point you can either add them to the git by doing:

$ git add roster.txt roster1.txt roster2.txt roster3.txt roster4.txt

or delete them if you don’t need them anymore:

$ rm roster.txt roster1.txt roster2.txt roster3.txt roster4.txt

When doing this, you realize that you need to type (or copy) each and every file name of them one by one. Clearly this will be a very tedious task if there are millions of such files.

  1. Can you come up with a quick way of doing this by using linux commands?
  2. Can you make your own alias command by adding it to your .bash_profile or .bashrc?