versioning_data_scripts

Versioning your Data and Scripts


Previous: README


Setting up for today’s class

For this quick hands-on session we will be using a Graphical User Interface (GUI) to work with Git. Let’s start by:

  1. Downloading and installing GitKraken.
  2. Creating an account for yourself on GitHub. Please select the free/academic account, as this option has more long-term flexibility.
  3. Downloading the workshop sample files zipped folder and unzip it.

What is Version Control?

Version control can be used to keep track of versions of a piece of work that either a single person is working on, or a shared document. It is designed to avoid a situation like noted below.

mydocument.txt
mydocument_v2.txt
mydocument_v3_rev-BHP.txt
mydocument_v8_Final?.txt

Some tools let us deal with this a bit better without creating a new file for every “save”, such as Microsoft Word’s “Track Changes” or DropBox’s and Google Docs’ “version history” feature.

Version control systems start with a base version of the document and then save just the changes you made at each step of the way by taking a so-called “snapshot”. A snapshot records information about when a doc was saved, and all the changes between the current document and the previous version. The user (you) decides when these snapshots are collected, and this allows one to ‘rewind’ your file to an older version.

For example, two users can make independent sets of changes based on the same document and have 2 separate snapshots documenting the changes.

If there aren’t conflicts (i.e updates to the same line), the two sets of changes can be “merged” back into the same base document.

Version Control Systems and Hosts

There are a lot of different version control systems available. These systems enable you to track changes locally or remotely (easy for collaborations), and there are hosts available for remote management of your “repositories”.

In this class we will be focusing on Git. Git is usually used for version control on a local computer and you do not need internet access to use it (internet access will be needed to download Git). The local version control setup with Git (or other version control systems) can be connected to an online setup that hosts repositories for sharing and collaboration.

GitHub is currently the most popular host of open source projects by number of projects and number of users. But other hosts exist, including SourceForge, BitBucket, and Gitlab, to name a few.

Why use Version Control?

The two main reasons to use version control are to:

Though version control was originally designed for dealing with code for large collaborative projects, there are many benefits to using it in other projects with text files too (.txt, .csv, .tsv). Some examples of projects making use of version control systems like Git: writing manuscripts, books or dissertations, and for collaboratively developing as well as distributing teaching materials (e.g. the Github repository for this class).

Note: Different Version Control systems handle different non-text files differently. In most cases Word documents, graphics files, data objects from R or STATA, etc., can be included but most tools have limited capabilities for saving version information for these.

Why Not use Dropbox or Google Drive?

Dropbox, Google Drive and other services offer some form of version control in their systems. There are times when this may be sufficient for your needs. However there are a number of advantages to using a version control system like Git, e.g. facilitating sharing/reproducibility and collaborations. Benefits of collaborating with Version Control include:


Next: Getting Started with Git using GitKraken