Base R - Common Gotchas

Learning Objectives

  • Familiarize participants with R syntax
  • Get comfortable working at the console
  • Understand the concepts of variables and assignment
  • Gain exposure to functions
  • Understand the concepts of vectors and data types

R Syntax: An Example Script

Below is a script to demonstrate some of the syntax of the R language.

# Define a new function.
suggest <- function(n) {
    for (i in 1:n) {
        my_message <- "R is awesome!\n"
        cat(my_message)
    }
}

# Call that function.
suggest(n = 10)

# Create a new list variable.
person <- list(first_name = "John", last_name = "Chambers")

# Print the person's first name.
cat(person$first_name, "\n")

To run this code:

  1. Open RStudio.
  2. Navigate to “File > New File > R Script”.
  3. Copy and paste the code above into your untitled R Script.
  4. Press the “Source” button.

Let’s take a quick walk through the code:

  1. Define a new function named suggest that takes a single argument n. n is the number of times this function will suggest R is awesome!

  2. The line my_message <- "R is awesome!\n" creates a new text variable called my_message. That line uses the assignment operator <- to assign a value to an variable.

  3. Notice how the code is indented, this makes it easier for humans to read and understand the code structure.

  4. The line suggest(n = 5) calls the suggest function with an argument of n = 5. Although one can often use = for assignment, it is best practice to use <- for assignment and = when naming an argument to a function.

  5. Notice that elements in a list can be accessed using the $ operator. In this example, person$first_name equals “John”. Fun fact, John Chambers created R when he was at Bell Labs.

The Console

You can get output from R simply by typing math in the console:

12 / 7

We can also comment on what we’re doing:

# I am adding 3 and 5. R is fun!
3 + 5

What happens if we do that same command without the # sign in the front?

I am adding 3 and 5. R is fun?
3 + 5

Now R is trying to run that sentence as a command, and it doesn’t work.

What if we replace the hash # with a double quote "

" I am adding 3 and 5. R is fun?
3 + 5

Now we’re stuck over in the console. The + sign means that it’s still waiting for input, so we can’t type in a new command. To get out of this type Esc. This will work whenever you’re stuck with that + sign.

It’s great that R is a glorified caluculator, but obviously we want to do more interesting things. To do useful and interesting things, we need to assign values to variables. To create variables, we need to give it a name followed by the assignment operator <- and the value we want to give it…

Challenge

What is 22 divided by 7?

Variables and Assignment

For instance, instead of adding 3 + 5, we can assign those values to variables and then add them.

# Assign 3 to a.
a <- 3
# Assign 5 to b.
b <- 5

# What is a?
a
# What is b?
b

# Add a and b.
a + b

<- is the assignment operator. It assigns a value on the right to a variable name on the left. So, after executing x <- 3, the value of x is 3. The arrow can be read as 3 goes into x. You can also use = for assignments but not in all contexts so it is good practice to use <- for assignments. = should only be used to specify the values of arguments in functions.

In RStudio, typing Alt - (push Alt, the key next to your space bar at the same time as the - key) will write <- in a single keystroke.

When assigning a value to an variable, R does not print anything. You can force R to print the variable’s value by using parentheses:

(s <- "hi")

or by typing the variable’s name:

s <- "hi"
s

Challenge

To get comfortable with variables and assignment, let’s use R to do some unit conversions.

  1. Suppose you’re thinking about running a 5 kilometer race and you want to know how many miles that is. Let’s create a variable distance_km and assign it the value 5.

  2. Now create a new variable miles_per_km and assign it the value 0.621371 (there are 0.621371 miles in a kilometer).

  3. Given these two variables, create a third variable named distance_miles that contains the number of miles in the race.

  4. Inspecting your newest variable, you should see that a 5 kilometer race is a little longer than three miles.

  5. Now suppose you did really well in the 5k and you want to run a 10k. Assign the value 10 to the variable distance_km. What is stored in the distance_miles variable now?

Functions and Function Arguments

The other key feature of R are functions. R has many built-in functions. Some examples of these are mathematical functions, like sqrt and round. You can also get functions from libraries (which we’ll talk about in a bit), or even write your own.

Functions are “canned scripts” that automate something complicated or convenient or both. Many functions are predefined, or become available when using the function library (more on that later). A function usually gets one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt. The argument (the input) must be a number, and the return value (the output) is the square root of that number. Executing a function (running it) is called calling the function. An example of a function call is:

a <- 2
sqrt(a)

Here, the value of a is given to the sqrt function, the sqrt function calculates the square root.

The return value of a function need not be numerical (like that of sqrt), and it also does not need to be a single item: it can be a set of things, or even a data set. We’ll see that when we read data files in to R.

Arguments can be anything, not only numbers or filenames, but also other variables. Exactly what each argument means differs per function, and must be looked up in the documentation. If an argument alters the way the function operates, such as whether to ignore missing values, such an argument is sometimes called an option.

Most functions can take several arguments, but many have so-called defaults. If you don’t specify such an argument when calling the function, the function itself will fall back on using the default. This is a standard value that the author of the function specified as being “good enough in standard cases”. An example would be what symbol to use in a plot. However, if you want something specific, simply change the argument yourself with a value of your choice.

Let’s try a function that can take multiple arguments round:

round(3.14159)

The return value from this function call is 3. That’s because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the round function. We can use args(round) to see what arguments this function takes:

args(round)

Or we can look at the help file for this function using ?round or help(round):

?round
help(round)

We see that if we want a different number of digits, we can type digits = 2 or however many we want.

round(3.14159, digits = 2)

If you provide the arguments in the exact same order as they are defined you don’t have to name them:

round(3.14159, 2)

However, it’s usually not recommended practice because it’s a lot of remembering to do, and if you share your code with others that includes less known functions it makes your code difficult to read. It’s fine to not include the names of the arguments for basic functions like mean, min, and others.

Another advantage of naming arguments, is that the order doesn’t matter. This is useful when there start to be more arguments.

Challenge

What arguments does the abs() function take? What does it return? What is abs(-1)?

Naming Variables and Functions

Variables can be given any name such as x, current_temperature, or subject_id. You want your variable names to be explicit and not too long. They cannot start with a number (2x is not valid but x2 is). R is case sensitive (length_mb is different from length_MB). There are some names that cannot be used because they represent the names of fundamental functions in R (if, else, for, …). See here for a complete list of reserved keywords. In general, even if it’s allowed, it’s best to not use other function names (c, T, mean, data, df, weights, …). In doubt check the help to see if the name is already in use. It’s also best to avoid dots (.) within a variable name as in my.dataset. There are many functions in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it’s best to avoid them. It is also recommended to use nouns for variable names, and verbs for function names. It’s important to be consistent in the styling of your code (where you put spaces, how you name variables, …). In R, two popular style guides are Hadley Wickham’s and Google’s.

Vectors and Data Types

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It’s basically just a list of values, mainly either numbers or characters. They’re special lists that you can do math with. You can assign this list of values to a variable, just like you would for one item. For example we can create a vector of race lengths:

race_lengths <- c(1, 5, 10)
race_lengths

A vector can also contain characters:

cities <- c("Boston", "Chicago", "San Francisco")
cities

There are many functions that allow you to inspect the content of a vector. length tells you how many elements are in a particular vector:

length(race_lengths)
length(cities)

You can also do math with whole vectors. For instance if we wanted to convert the lengths of all the races from kilometers to miles, we can do:

race_lengths * 0.621371

or we can add the data in the two vectors together:

round_trips <- race_lengths + race_lengths
round_trips

This is very useful if we have data in different vectors that we want to combine or work with.

There are few ways to figure out what’s going on in a vector.

  1. class indicates the class (the type of element) of an variable:
class(race_lengths)
class(cities)
  1. The function str provides an overview of the variable and the elements it contains. It is a really useful function when working with large and complex variables:
str(race_lengths)
str(cities)

You can add elements to your vector by using the c function (c is for combine):

lengths <- c(race_lengths, 1) # adding at the end
lengths <- c(2, lengths)      # adding at the beginning
lengths

What happens here is that we take the original vector race_lengths, and we are adding another item first to the end of the other ones, and then another item at the beginning. We can do this over and over again to build a vector or a dataset.

We just saw 2 of the 6 data types that R uses: "character" and "numeric". The other 4 data types are:

  • "logical" for TRUE and FALSE (the boolean data type)
  • "integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
  • "complex" to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them
  • "raw" that we won’t discuss further

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame), and factors (factor).

Challenge

Create a vector with five numbers.

Use the mean() function to find the mean of those numbers.

What is NA in R? Hint use help(NA).

Create a vector with two numbers and one NA value. What is the mean of this vector?

Sequences

The colon character : is a special operator that creates numeric vectors in increasing or decreasing order. Try it out:

1:10
10:1

The function seq() can be used to create more complicate sequences of numbers:

seq(1, 10, by=2)
seq(5, 10, length.out=3)       # equal breaks of sequence into vector length = length.out
seq(50, by=5, length.out=10)   # sequence 50 by 5 until you hit vector length = length.out
seq(1, 8, by=3)                # sequence by 3 until you hit 8

The : operator and seq() function are useful in many contexts, particularly when indexing vectors and data frames.

Indexing

Although we have created data frames in R, we don’t really know how to interact with these objects. For instance, how do we subset a data frame, selecting specific columns or rows? This section introduces how one selects specific data out of a vector and then how to do the same for a data frame.

Vectors

To illustrate how to extract one or several values from a vector, let’s create a new vector.

x <- c("h", "e", "l", "l", "o")

x is a character vector that has five elements. Like working with vectors in math, to select a specific element of the vector we use square brackets. What is the second element of x?

x[2]

What happens when we select elements at multiple indices?

x[c(3, 2)]
x[2:4]
x[c(3, 2, 2:4)]

R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) start counting at 0.

Data Frames

A data frame has rows and columns (it has 2 dimensions). If we want to extract some specific data, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).

load('continents.RDA', verbose = TRUE)

continents[1, 1]   # first element in the first column of the data frame
continents[1, 3]   # first element in the 3rd column
continents[1:3, 4] # first three elements in the 4th column
continents[3, ]    # the 3rd row
continents[, 4]    # the 4th column

first_three_rows <- continents[1:3, ]

Challenge

Calling the function nrow() on a data.frame returns the number of rows in that data frame. Use it, in conjuction with seq() to create a new data.frame called even_continents that includes every other row of the continents data frame starting at row 2 (2, 4, …).


For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are population counts in column 3 or 4? oh, right… they are in column 3). In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ operator. In this case, the entire column is a vector. You can use names(continents) or colnames(continents) to remind yourself of the column names. For instance, to extract all the continent names from our dataset:

continents$continent

In some cases, you may want to select more than one column. You can do this using the square brackets. Suppose we wanted continent and percent_total_pop information:

continents[, c("continent", "percent_total_pop")]

You can even access columns by column name and select specific rows of interest. For example, if we wanted the continent and percent_total_pop of just rows 2 through 4, we could do:

continents[2:4, c("continent", "percent_total_pop")]

Factors

Factors are used to represent categorical data. Factors can be ordered or unordered and are an important class for statistical analysis and for plotting.

Factors are stored as integers, and have labels associated with these unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.

Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

continent <- factor(c("Europe", "Americas", "Oceana", "Europe", "Africa", "Africa"))

R will assign 1 to the level Africa and 2 to the level Americas (because Af comes before Am, even though the first element in this vector is Europe). You can check this by using the function levels(), and check the number of levels using nlevels():

levels(continent)
nlevels(continent)

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”) or it is required by particular type of analysis. Additionally, specifying the order of the levels allows to compare levels:

gdp <- factor(c("low", "high", "medium", "high", "low", "medium", "high"))
levels(gdp)
gdp <- factor(gdp, levels=c("low", "medium", "high"))
levels(gdp)
min(gdp) ## doesn't work
## Error in Summary.factor(structure(c(1L, 3L, 2L, 3L, 1L, 2L, 3L), .Label = c("low", : 'min' not meaningful for factors
gdp <- factor(gdp, levels=c("low", "medium", "high"), ordered=TRUE)
levels(gdp)
min(gdp) ## works!

In R’s memory, these factors are represented by numbers (1, 2, 3). They are better than using simple integer labels because factors are self describing: low, medium, and high is more descriptive than 1, 2, 3. Which is low? You wouldn’t be able to tell with just integer data. Factors have this information built in. It is particularly helpful when there are many levels.

Converting factors

If you need to convert a factor to a character vector, simply use as.character(x).

Converting a factor to a numeric vector is however a little trickier, and you have to go via a character vector. Compare:

f <- factor(c(1, 5, 10, 2))
as.numeric(f)               ## wrong! and there is no warning...
as.numeric(as.character(f)) ## works...
as.numeric(levels(f))[f]    ## The recommended way.

 

Challenge

The function table() tabulates observations and can be used to create bar plots quickly. For instance:

continent <- factor(c("Oceana", "Americas", "Asia", "Africa", "Europe", "Europe",
                   "Asia", "Africa", "Europe"))
table(continent)
barplot(table(continent))

  1. How can you recreate this plot but by having “Europe” and “Americas” being listed as the first two?
  1. We’ve read in the gapminder data by default which includes the categorical data as factors. But this doesn’t make sense for part of the data. Using what you’ve learned in this lesson and the previous one, use your R knowledge to create the gapminder dataframe such that only continents is represented as a factor, and order it such that Europe and Americas are the first two continents.

Learning Objectives

  • Understand the concept of a data.frame
  • Learn how to create new data.frame objects
  • Use sequences to index data
  • Know how to access any element of a data.frame

What is a Data Frame?

The data.frame class of objects is the de facto data structure for most tabular data, we use data frames for statistics and plotting.

A data.frame object is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (characters, integers, factors, …). The str() function is useful to inspect the data types of the columns.

Creating a Data Frame

In this section, we create a data.frame object by hand to see what goes into making a data.frame and to get an idea of what data frames look like. Then we create another data frame by reading data from a file.