This lesson aims to introduce you to using Git and GitHub for reproducible research. The first draft of these materials was a little intimidating and unclear about how to use these tools in practice. Git and GitHub are valuable tools; I want to show you how to use the tools, and you can decide whether you want to use them in your work.
Version control software is not strictly necessary for reproducible research. In the short-term it can be a lot of work to learn and to use. But, it has some pretty awesome long-term benefits. I want to ease the short-term pain of learning Git/GitHub and illustrate the long-term gain of adding Git/GitHub to your toolbox.
For the moment, think of Git like a camera. You can use Git to take snapshots of your code over time. This is valuable because it lets you:
At the same time, Git is like Adobe Photoshop. You can use Git to combine snapshots together. This is great for collaborating with other people.
It’s worth mentioning that Git works best with plain text files (code, markdown, csv, …). It’s less helpful when working with binary files like Microsoft Word or Excel documents.
GitHub is a web application that facilitates collaboration.
GitHub is a website that hosts Git repositories.
It provides a nice graphical user interface (GUI) for exploring Git repositories.
Source code on GitHub is real open source: anyone can study it and grab it.
GitHub provides issue tracking.
GitHub makes it easy to suggest changes to anyone (pull requests).
Open source
Open source means everyone can see my stupid mistakes.
Version control means everyone can see every stupid mistake I’ve ever made.
If you store your code on GitHub, everyone can see everything. They can even see everything that ever was.
I think this openness is a good thing. You may be shy about your code, but probably no one is looking. And if they are looking, that is actually a good thing.
A common hurdle when working with new software is installation! To help you clear this hurdle, I’ve copied installation instructions from The Carpentries. If these instructions don’t work for you, please let me know in the comments at the bottom of this page.
Run the installer and follow the steps below:
If your “HOME” environment variable is not set (or you don’t know what this is):
cmd
and press Enter)Type the following line into the command prompt window exactly as shown:
setx HOME "%USERPROFILE%"
Press Enter, you should see SUCCESS: Specified value was saved.
Quit command prompt by typing exit
then pressing Enter.
This will provide you with both Git and Git Bash (a command line interface).
For OS X 10.9 and higher, install Git for Mac
by downloading and running the most recent “mavericks” installer from this list.
Because this installer is not signed by the developer, you may have to
right click (control click) on the .pkg file, click Open, and click
Open on the pop up window.
After installing Git, there will not be anything in your /Applications
folder,
as Git is a command line program.
For older versions of OS X (10.5-10.8) use the
most recent available installer labelled “snow-leopard”
available here.
If Git is not already available on your machine you can try to
install it via your distro’s package manager. For Debian/Ubuntu run
sudo apt-get install git
and for Fedora run
sudo dnf install git
.
This lesson focuses on using Git at the command line. (Ideally, we would have gone through the Unix lesson and you’d be feeling good about using command line tools. In reality, we haven’t done that. I’ll try make things accessible and be responsive to any issues you might run into.)
These materials focus on using Git at the command line (rather than using a GUI) to help you understand how Git works.
You might find that a GUI works better for you than the command line.
I’m focusing on the command line because it’s the same for everyone, and it will help you understand what GUIs are doing under the hood.
Back in the Unix lesson, there should be a section discussing shells.
In computing, a shell is a user interface for access to an operating system’s services. In general, operating system shells use either a command-line interface (CLI) or graphical user interface (GUI), depending on a computer’s role and particular operation. It is named a shell because it is the outermost layer around the operating system kernel.
We are going to use Bash as our CLI shell. Bash is a commonly-used shell that gives you the power to do simple tasks more quickly.
On Windows, I suggest using Git Bash, which was installed when you installed Git.
The default shell in all versions of macOS is Bash, so no need to
install anything. You access Bash from the Terminal (found in
/Applications/Utilities/
).
On Unix-like operating systems, the default shell is usually Bash, but if your machine is set up differently you can run it by opening a terminal and typing bash. There is no need to install anything.
When you first install Git, it’s helpful to configure some global variables. For instance, tell Git your name and email address for it to use when you take snapshots of your code:
$ git config --global user.name "Jane Doe"
$ git config --global user.email "jdoe@hbs.edu"
Tell it to use nice colors when printing to the command line so it’s easier to read:
$ git config --global color.ui true
All of this information gets written to a file in your home directory
~/.gitconfig
.
If you don’t already have a GitHub account…
You will need an account at github.com for parts of this lesson. The free account for individuals should be fine for most users (you can always upgrade later if you want). It’s worth considering what personal information you’d like to reveal. For example, you may want to review these instructions for keeping your email address private.
The plan for this section is to show you how to take snapshots of your project in a linear fashion. Imagine your project is a mountain. As you make your way up the mountain it makes sense to place anchors along the way, that way if you ever lose your footing you’ll only fall to the last anchor not the bottom of the mountain!
Before you can take pictures of your code, you’ll need to initialize
your repository using git init
. Let’s create a new project folder on
your desktop and initialize it to be a git repository.
$ mkdir ~/Desktop/my-project
$ cd ~/Desktop/my-project
$ git init
Initialized empty Git repository in /Users/amarder/Desktop/my-project/.git/
Note that git init
creates a .git/
subdirectory in your current
working directory. This is where Git saves all your snapshots.
Taking a snapshot is a two-step process. We use git add
to stage
changes and git commit
save the snapshot in the hidden .git/
subdirectory.
Let’s do an example. In your project folder create a README.md
file.
# My Example Project
Let's see if Andrew has any cool tricks to show us...
After you’ve saved that file use git add
and git commit
to take a
snapshot.
$ git add README.md
$ git commit -m "Add README"
[master (root-commit) d950cb5] Add README
1 file changed, 3 insertions(+)
create mode 100644 README.md
Use git add
to tell Git that you want to start keeping track of this
file. This is called “staging”, or you say the file is “staged”. Use
git commit
to add the file to the history of the project. The -m
argument allows one to enter a message. Without -m
, git
will spawn
a text editor. Use a meaningful message. Message can have multiple
lines, but make 1st line like the subject of an email.1
Let’s create a Bash script to list all the files in your .git/
subdirectory, call it git_list.sh
:
find .git/ -type f
After you save your new script you can print the results into a new file:
$ ls
README.md git_list.sh
$ bash git_list.sh > files.txt
$ ls
README.md files.txt git_list.sh
You can use git add -A
to stage all changed files in this directory
and then use git commit
to save your second snapshot.
$ git add -A && git commit -m "Wrote a bash script"
[master 7b620ba] Wrote a bash script
2 files changed, 23 insertions(+)
create mode 100644 files.txt
create mode 100644 git_list.sh
As you work with Git, you’ll start to notice files that haven’t
been on your radar before. For instance, when you’re working
interactively in R, R will create an .Rhistory
file to record all
the commands you type at the console. I suggest that you omit files
like these from your snapshots. That way your photos will be easier to
read, there won’t be a bunch of superfluous files. Git offers a
mechanism to handle this. Create a .gitignore
file in your project
directory and Git will ignore any file that matches a pattern in that
file, here’s an example file:
.Rhistory
*.Rout
You can even set up a global ignore file using the git config
command.
$ git config --global core.excludesfile ~/.gitignore_global
The general rule for what files to include is to commit the source files, not the derived files (R code not images).
Like any good rule, this one is often broken. For a manuscript, I might include the pdf at major milestones (at submission, after revision, and upon acceptance), so that I don’t have to work as hard to reconstruct them.
Warning: Ignoring critical files is a common mistake.
So, you’re fully equipped to snapshot your project in a linear fashion. Here’s the workflow:
git add -A && git commit -m "<insert-message>"
There are a whole bunch of additional tools I want to introduce but are not mission-critical.
When you’ve edited files but haven’t taken a snapshot yet,
git status
will show you what files have changed.
git diff
will give you even more information showing what lines
have been added/removed to each file.
To list all your snapshots use git log
.
For files that are being tracked by git:
Use git rm
instead of just rm
. git rm
removes the file and
stages that change.
Use git mv
instead of just mv
. git mv
moves the file and
stages that change.
A general best practice: Use frequent, small commits.
So, your project is in good shape. You’ve been taking linear snapshots and everything’s working fine. You’re thinking about adding this new feature. It’s going to be very complicated, require a bunch of commits, and you’re not sure it’s going to work out. This is where branches can be helpful. So, let’s see how to branch and merge!
Use branches to test out new features without breaking code that
works. Use the git branch
command to create, list, or delete
branches. Use git checkout
to switch between branches. Below we (1)
create a branch, (2) list all branches, and (3) switch to the new
branch.
$ git branch my-experiment
$ git branch
* master
my-experiment
$ git checkout my-experiment
Switched to branch 'my-experiment'
Once you’re on the new branch feel free to experiment freely. Nothing
you do here will affect the master
branch until you merge it
back. Edit some files and create a new commit on the my-experiment
branch.
When you’re happy with the work, merge it back into your master
branch. You do this by (1) switching to the master
branch and (2)
merging the my-experiment
branch into your current branch (master
).
$ git checkout master
Switched to branch 'master'
$ git merge my-experiment
Updating 7b620ba..ead9d4c
Fast-forward
README.md | 2 ++
1 file changed, 2 insertions(+)
If you’re working alone you probably won’t see much value in using branches. When you’re working with others they become mission critical…
I have posted this lesson on GitHub because I want to collaborate with you to make these materials better. In this section, I will walk through the tools you’ll need so we can work together!
There are two different models for contributing to projects on GitHub:
In the fork and pull model, anyone can fork an existing repository and push changes to their personal fork without needing access to the source repository. The changes can be pulled into the source repository by the project maintainer. This model is popular with open source projects as it reduces the amount of friction for new contributors and allows people to work independently without upfront coordination.
In the shared repository model, collaborators are granted push access to a single shared repository and topic branches are created when changes need to be made. Pull requests are useful in this model as they initiate code review and general discussion about a set of changes before the changes are merged into the main development branch. This model is more prevalent with small teams and organizations collaborating on private projects.
Since I want this lesson to be open and allow anyone to suggest changes, I will be focusing on the fork and pull model.
One of the best features of GitHub is the ease with which you can suggest changes to others’ code, either via an Issue, or best of all via a Pull Request.
Here’s the process for submitting a pull request.
Go to the repository
Using your web browser, find the repository on GitHub you want to contribute to. Feel free to use this repository for practice.
Fork the repository
This is done by clicking the “Fork” button. This copies the repository to your GitHub account.
Clone the repository
On the command line use the git clone
command (be sure to
insert your GitHub username and the correct repository name).
$ git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY
This copies your repository from GitHub to your local machine.
Create a new branch
To create a new branch and switch to it with one command use
git checkout -b
.
$ cd YOUR-REPOSITORY
$ git checkout -b BRANCH-NAME
Working on a new branch leaves the master
branch
unchanged. This will make it easier for you to make additional
contributions in the future.
Create new commit(s)
Make your desired edits and commit them.
$ git add -A && git commit -m "COMMIT-MESSAGE"
Push your changes to GitHub
Use git push
to push your new branch up to GitHub.
$ git push origin BRANCH-NAME
This updates your copy of the repository on GitHub.
Create pull request
Go to your GitHub repository
https://github.com/YOUR-USERNAME/YOUR-REPOSITORY
and click “New
pull request”. The final steps on GitHub are (hopefully) pretty
clear; let me know in the comments if you run into any
difficulties. After the pull request has been created, it will be
easy for the project maintainer to merge your modifications into
the master
branch of their repository.
Admittedly, this is a lot of work for you. The advantage of this approach is it’s very little work for the project maintainer, so they’ll be more likely to incorporate your work. Over time, you’ll get used to the process and it will be less work for you.
I want you to be a frequent collaborator on these materials. Ideally,
the master
branch will be changing often. The
instructions in the previous section are great for your first pull
request, but I want you to be set up to make many more pull requests
in the future. To do this, we’ll use the git remote
command to tell
Git there’s another repository you want to track.
git remote add rcs https://github.com/hbs-rcs/reproducible-research.git
This will add information to the .git/config
file.2
[remote "origin"]
url = https://github.com/your_username/reproducible-research.git
fetch = +refs/heads/*:refs/remotes/origin/*
[remote "rcs"]
url = https://github.com/hbs-rcs/reproducible-research.git
fetch = +refs/heads/*:refs/remotes/rcs/*
Once you’ve set your remote the process for submitting pull requests looks like this:
Update your master
branch to match mine:
git checkout master
git pull rcs master
Create a new branch based off of master
to work on:
git branch new-feature
git checkout new-feature
Make your changes and commit them (git add
, git commit
).
Push your branch to your GitHub repository:
git push origin new-feature
Create a pull request.
At some point in the future, you may be faced with a merge conflict. Here’s how they happen:
master
branch.master
branch because we made conflicting edits.
Here’s how to handle merge conflicts. Update your master
branch to
match mine.
git checkout master
git pull rcs master
Merge master
into the branch you’ve been working on.
git checkout new-feature
git merge master
You’ll see a message letting you know there’s a merge conflict.
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.
Inside the file you’ll see:
<<<<<<< HEAD
A line in my file.
=======
A line in my RCS's file.
>>>>>>> 031389f2cd2acde08e32f0beb084b2f7c3257fff
Edit the file as you see fit (keep your line, keep my line, come up
with an entirely new line). Then you’re ready for the typical process
(git add
, git commit
, git push origin new-feature
).
RStudio has great features for using Git and GitHub from within the IDE.
Check out the RStudio IDE Cheat Sheet.
The key thing is that a Project in RStudio is a directory (with an
RStudio configuration file, blah.Rproj
) your .git
folder will be
stored in this same directory.
Free accounts on github.com now get unlimited private repositories, but they are limited to 3 collaborators.
code.harvard.edu is great for teams at Harvard that want to keep their work private.
Personally, I want my editor to have Git integration. I don’t like having a separate GUI for dealing with Git.
If you are working with Git at the command line a lot, I have enjoyed Bash-it. Their git aliases save a lot of typing.
If people were going to adopt version control only for particular projects (e.g., those that were especially complicated, had many collaborators, etc.) is there any rule of thumb that you’d give people for when a project gets big enough that it really should be version controlled? I know you version control for all projects, but I think people will likely use some combination of version controlling (project dependent) and not.
I don’t have a clear-cut answer here. The advantage of focusing on small projects is that it’s a great opportunity to get familiar with Git. The advantage to focusing on larger projects is they’re likely to have longer histories so the benefits of working with version control will become more apparent.
Can you add commit statements to your code so that it commits as you initialize a session or at various points along the way in a program (not sure if that makes sense as written out)?
Yes, but this is dangerous. You want to be very explicit with Git. Tell it exactly when you want to take a snapshot of your code. If you start automating that process, you’ll likely end up with a less useful history.
Because people were generally very apprehensive about Git, I’m wondering whether it might help to divide up the sections by the very very basic commands and then the more advanced commands that might or might not come up in RCS work so they can focus on at least knowing the basics. People conveyed that they got especially confused when dealing with branching, merging, forking, etc.
Absolutely! I tried to do this by dividing commands between the Git and GitHub sections.
In the Git section, we cover status
, diff
, log
, add
, commit
,
mv
, and rm
. These commands are necessary for keeping a completely
linear history.
In the GitHub section, we cover push
, pull
, fetch
, and
merge
. These are necessary for collaborating with others. We also
cover branch
and checkout
in this section. But, maybe it makes
sense to cover branches separately because they’re freaking people
out.
I really like the idea of doing some hands on work during the session as you’ve added at the end. (confession: I am intimidated by forking, cloning, etc. over command line, but I’m sure we’ll all be more comfortable by the end of the session)
No reason to be intimidated. You’ve got this!
How do I delete a repo?
On your computer, if you delete the .git
subdirectory, the folder
will no longer be a Git repository.
On GitHub, go to the settings for the repository and head down to the “Danger Zone”.
Why is committing a two-step process? Sometimes you’ll be editing a whole bunch of files at once and you want to take a few different photos so the project history is easier to read. It’s a non-intuitive process but the flexibility is a plus for many Git users.
[return]Occasionally, I have had to edit this file to fix unwanted behavior. There are three different constructions for the url:
https://github.com/user/repo
git://github.com/user/repo
git@github.com:user/repo
With https://
, you’ll need to enter your GitHub login and password each
time. With git://
, you’ll have read-only access. With git@
, you
need to set up ssh (more work initially, but you’ll get write access
without having to enter your login and password).