Learning Objectives

Upon completion, participants will be able to
  1. Describe R and RStudio.
  2. Load data frames in R.
  3. Utilize the building blocks of the R language.
  4. Extract parts of a dataset.
  5. Reshape wide data into tidy data in R.
  6. Separate columns using delimiters in R.
  7. Write, save, and run an R script.

About R

R is a free software programming language for statistical computing and graphics. When you have RStudio installed, it integrates with R as an IDE (Integrated Developmet Environment) to provide further functionality. You can do without RStudio, but it makes working in R more effective.

Why R?

  • Statistics, graphics, general purpose programming
  • Thousands of packages available - these are collections of functions that implement all kinds of analyses of data sets from various disciplines
  • Open source software - use for free, modify as you wish
  • Active community of developers - new packages are being written as we speak

About RStudio

RStudio is an interface for R. We will be working with R using RStudio. This is a piece of software (also known as integrated development environment, or IDE) that makes working in R much easier. When you open RStudio, you should see 4 windows:

The 4 windows of RStudio

  1. Top left: text editor. Write your commands/code here, and you can save your work as a script.
  2. Bottom left: R console, run (execute) your commands here.
  3. Top right: consists of things R keeps track of in the History and Environment tabs.
    • The History tab records all the commands you type in the R console.
    • The Environment tab keeps track of all the objects you create in the current session.
    Both records can be saved for later as .Rhistory and .RData files.
  4. Bottom right: Consists of several helpful tabs. For now, notice that the Files tab allows you to navigate between folders.

For this workshop, we need to navigate to the data folder within the UWI_Mona folder that we created on the Desktop.

Click the three dots icon at the top right corner of the Files tab and navigate to Desktop/UWI_Mona/data. When there, click More and select Set as working directory.

Notice the change in your console window - your work with R is now done from data folder.

Executing Commands

When you type commands in the console window (bottom left) and press Enter on your keyboard, they are executed immediately and the output is displayed.

3 + 5
12/7

We can also comment on what it is that we’re doing

# I'm adding 3 and 5. R is fun!
3+5

What happens if we type that same command without the # sign in the front?

I'm adding 3 and 5. R is fun!
3+5

Now R is trying to run that sentence as a command, and it doesn’t work. Now we’re stuck over in the console. The + sign means that it’s still waiting for input, so we can’t type in a new command. To get out of this press the Esc key. This will work whenever you’re stuck with that + sign.Symbol > means that the R is ready for the next command. If you enter incomplete commands, you will see + which means that the system is waiting for you to complete the command.

Now, let's try something other than mathematical commands.

print("How are you?") + 5
12/7
 print("Good morning")
print(64)

NOTE: Use quotes around characters, but not around numbers, unless you want them to be seen as characters.

Notice the commands that come with parentheses beside it. These are called functions. Functions typically need an argument inside the parentheses.

Using R

Assigning to Variables

One of the main concepts of any programming language is the notion of a variable. Variables are created to store values for future use. For example, if you wanna get the value of 3+5, you have to run the command every time. But you can store the value under a variable for future use. A variable with a value becomes an object.

To create a variable in R, use <- (Alt + dash) assignment operator:


# variable name that stores the character value "Jane"
name <- "Jane"
print(name)

# variable price that stores the value 3.99
price <- 3.99
print(price)

# working with environment
# remember what env does? it stores the objects you created. Let's see what the environment tab show us. So, how do you see the list on screen?

# list all objects in your environment
ls()

# how do you remove an object?
#rm(objectName)
rm(price)

#remove all objects, clear environment
rm(list=ls()) 

Scripting

It is often the case that we would like to reuse our commands. If so, you can type commands in the text editor window.

Let's open a new R Script file by clicking the + sign on the top right corner of scripting panel and write our command there, for example print("Good morning!").

To check if this command works, you can send it for execution to console with CTRL+Enter click. (For a Mac, you press Command+Enter.) Otherwise, if you want to keep going without executing the command, press Enter to move to the next line in your script.

We have a very simple example here, but you can imagine writing hundreds of commands in the order you want them to be executed to accomplish a certain task. This is what an R program or R script is.

It is a good practice to comment your code. The comments (statements that are helpful to the user, but are not seen by R as commands to be executed) in R start with #.

Let's add a comment to our simple script.

#my first R command 
print("Good morning")

We can save the file with our commands and reuse it later. For now, let's save our example as R_commands.R file in the data folder.

Functions

In general, a function takes an input and transforms it according to the function's definition(rules). You can recognize functions in R by the presence of parantheses. Objects in parantheses are called function's arguments.

#applying square root function
mass<-64 #is a variable
sqrt(mass) #function with argument provided
res<-sqrt(mass) #variable with a function as its value

In the example, the square root function takes the mass object as input and finds the square root of its value. The result is then assigned to res variable. In our second example, getwd() is a function that outputs your current location within the file system. Although there is no input (many functions do not require arguments), parantheses are still required for a proper syntax in R.

There are thousands of built-in functions in R. There are also help functions that you can use to find out what other function do and how to use them. The help appears in the bottom-right window of the RStudio.

# helpfunctions
?plot
help(mean)

Reshaping Our Data in R

Let's work in scripting mode from now on so that you will have the record of all commands we used in this lesson.

Packages

The functionality of R is expanded by R packages that include functions not present in the default installation of R. When you need to use another package, do these 2 things:

# download the package to your machine - once per computer
install.packages("packagename")

#load package into your active session of R - once per session
libary(packagename)

We will use the reshape2 and tidyr packages for this exercise because they were created for tidying data, which is what we need.

Question: How do we install and load the functions available from these packages?

Data types and Data structures

Data Types

Single-element data structures are the smallest units in R. For example, if we assigned the value of 45 to a variable age by inputing age<-45, we just created the smallest object in R.

Variables can also hold values of various types. The most common data types are
  • numeric(double+integer)
  • character
  • logical
  • complex
Some useful functions to know about an object are the following:
length(age)
str(age)
is.integer(age)
typeof(age)
typeof(is.integer(age))
Question: What data type is stored in `score` variable after inputting score<-c(1,4,3)?

The last expression is an example of nested function. Nested functions are very common in R, but are very difficult to understand at first. You can always split nested function into a series of single function calls. Remember that the variable inside the most inner paranthesis is an argument(input)for the function that will be evaluated first.

Sometimes you will need to convert between data types. There are functions that do that: as.integer(), as.character(), and so on. NOTE: The conversion between data types is not always possible. Think about converting between character and integer.

 

Data Structures

The small objects can be combined to build larger objects. Let's load the countries-BMI-1.0.csv dataset to explore it.
#check where you are
getwd()
    
#move to the correct folder, if needed
setwd(~/Desktop/UWI-Mona/data)
    
#load data
cdata <- read.csv("countries-BMI-1.0.csv", skip = 2, header = F)
    
#check the first few rows
head(cdata)
    
Our smallest objects can be used to represent a single element in the dataset, like individual year, or individual country, but what happens when you combine them? Here are common data structures:
  • Vectors: collection of elements of the same data type
  • Factors: special vectors used to represent categorical data
  • Lists: generic vectors - collection of elements with different data types
  • Matrices: 2-dimentional vectors - i.e., same data type
  • Data frames: 2-dimensional lists - i.e., collection of elements with different data types
Type str(cdata) to see info about your variable.

Data frames

Question: For our cdata variable, could you make an informative guess about how what data structure this is in R?

Yes! It is a list of factors of equal length, or a data frame.

Notice that a data is a list of vectors. So, using typeof(cdata) will return list. Data frames are extremely useful data structures as they represent table-like datasets.

 

Tip:

dim(m)  # tells you number of rows and columns in your matrix

Subsetting

Now let's talk about how to take your dataset apart, or subsetting. Subsetting is a very common practice in programming. In general, you can access every element of your data set. You must be able to do that to manipulate and analyze your data. There are many ways to subset the data, but we will use two below.

 

First Way to Subset
#indicate which row and column
cdata[x, y]  #Structure: Row number comes before the comma; column number comes after the comma.

#some examples
cdata[5, 3] #Retrieves the data from the cell in the 5th row in the 3rd column of the data frame
cdata[c(5,6), 3] #Retrieves the data from the 2nd and 5th and 6th rows in the 3rd column. You can also use a colon to represent a range - i.e., cdata[c(5:6), 3]
cdata[, 1] #Retrieves the entire first column of data

#use head to show the first 6 rows in the output
head(cdata[, 1])
    

Let's add a header to the data in order to see the other methods of subsetting. We need to do this by combining the first two rows which contain information about the year and sex in order to make unique headers. We will select the rows into a variable we will call headers.

Use nrows to create "headers" using the first two rows of the dataset:
headers <- read.csv("countries-BMI-1.0.csv", nrows = 2, header = F)

Next, use sapply and paste to combine the data in each of the two columns into one header, separated by an underscore. Then use the names() function to add that created header to your dataset.

headers_names <- sapply(headers, paste, collapse = "_")
    names(cdata) <- headers_names
    
    #see the result
    head(cdata)

Reshaping/Transposing Data

Now, we are going to change the dataset to a more tidy format by transposing the data, also known as reshaping or melting. We want to go from having BMIs going across columns to being all in one column. To do this, our Year/Sex headers will need to become Year and Sex columns.

#tidy/melt the data after the Country column. Save that action in a variable called longdata.
longdata <- melt(cdata, id.vars = c("_Country"))
    
#separate the year and sex into their own columns, using the underscore as the separator, and save that to a variable called countriesBMI2.
countriesBMI2 <- separate(data = longdata, col = variable,  into = c("Year", "Sex"), sep = "_")

#check your work
head(countriesBMI2)
Second Way to Subset

Let's use the second way to subset to clean the data.

Instead of being blank, areas that have no data say "No Data." This is not useful in most programs. So, we will make these nulls. In R, nulls are represented as NA.

#subset by column using a dollar sign and column name after the variable name
countriesBMI2$value[countriesBMI2$value == "No data"] <- NA

Right now, the BMI averages are provided along with the standard deviations' upper and lower limits. Because of this, those cells are string values instead of numerical. So, let's also separate the BMIs from the standard error ranges so we can work with them as numerical values, which can be added, subtracted, averaged, etc.

#we will update the countriesBMI2 variable; so, be sure to run the previous command before running this one.
countriesBMI2 <- separate(data = countriesBMI2, col=value, into = c("BMI", "Error"), sep = " ")

#check your work
head(countriesBMI2)

Saving Files

Saving Your Dataset

Use write.csv to save your updated dataset to a new CSV file. Be sure to remember your data management rules for naming.

write.csv(countriesBMI2, file = "countries-BMI-2.0.csv")

 

Saving Your Script

Remember you are writing a simple R script? An R script (or any other script) is a series of commands that are executed in the order they are written. The commands that we have executed one by one in R studio can be written to a text file and then executed all at once by running the file (which is now an R script). R scripts usually have .R extensions. Let's save ours as reshape_countries-BMI.R.

Running Scripts

Let's run a script to melt/reshape the jamaica-worldbank-data.csv.

#view the first 7 columns and 6 rows of the dataset
head(read.csv("jamaica-worldbank-data.csv", header = T)[, 1:5])

#to run the script from RStudio (or R), use the source() function
source("reshape_jamaica_data.R")

Now you have run the script which created the file jamaica-worldbank-data-1.0.csv in your data folder. Visit the folder to see the new dataset.

Summary

This lesson introduced you to main ideas of programming: variables, functions, data structures and scripts. Now you can write your own simple programs in R and begin understanding R code written by others. Now that we have worked with the data in R, we will visualize our data in Tableau.

SEEKING HELP

I know the name of the function I want to use, but I’m not sure how to use it

If you need help with a specific function, let’s say barplot(), you can type:

?barplot

If you just need to remind yourself of the names of the arguments, you can use:

args(lm)

If the function is part of a package that is installed on your computer but don’t remember which one, you can type:

??geom_point

 

I want to use a function that does X, there must be a function for it but I don’t know which one…

If you are looking for a function to do a particular task, you can use help.search() (but only looks through the installed packages):

help.search("kruskal")

If you can’t find what you are looking for, you can use the rdocumention.org website that search through the help files across all packages available.

 

I am stuck… I get an error message that I don’t understand

Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”).

However, you should check stackoverflow.com. Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: http://stackoverflow.com/questions/tagged/r

The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.

The R FAQ is dense and technical but it is full of useful information.

How to Ask for Help

The key to get help from someone is for them to grasp your problem rapidly. You should make it as easy as possible to pinpoint where the issue might be.

Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem

If possible, try to reduce what doesn’t work to a simple reproducible example. If you can reproduce the problem using a very small data.frame instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question.

To share an object with someone else, if it’s relatively small, you can use the function dput(). It will output R code that can be used to recreate the exact same object as the one in memory:

dput(head(iris)) # iris is an example data.frame that comes with R
## structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), 
##     Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, 
##     1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, 
##     0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, 
##     1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", 
## "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
## 6L), class = "data.frame")

If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your questions is not related to a data.frame, you can save any R object to a file. Note: for this example, the folder “/tmp” needs to already exist.

saveRDS(iris, file="/tmp/iris.rds")

The content of this file is however not human readable and cannot be posted directly on stackoverflow. It can however be sent to someone by email who can read it with this command:

some_data <- readRDS(file="~/Downloads/iris.rds")

Last, but certainly not least, always include the output of sessionInfo() as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.2  backports_1.1.0 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.2     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.12   
##  [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.17      stringr_1.2.0  
## [13] digest_0.6.12   evaluate_0.10.1

Where to Ask for Help

  • Your friendly colleagues: if you know someone with more experience than you, they might be able and willing to help you.
  • Stackoverflow: if your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min.
  • The R-help: it is read by a lot of people (including most of the R core team), a lot of people post to it, but the tone can be pretty dry, and it is not always very welcoming to new users. If your question is valid, you are likely to get an answer very fast but don’t expect that it will come with smiley faces. Also, here more than everywhere else, be sure to use correct vocabulary (otherwise you might get an answer pointing to the misuse of your words rather than answering your question). You will also have more success if your question is about a base function rather than a specific package.
  • If your question is about a specific package, see if there is a mailing list for it. Usually it’s included in the DESCRIPTION file of the package that can be accessed using packageDescription("name-of-package"). You may also want to try to email the author of the package directly.
  • There are also some topic-specific mailing lists (GIS, phylogenetics, etc…), the complete list is here.

More Resources


The Data 'Shop, 2017. License. Contributing.
Questions? Feedback? Please file an issue on GitHub.
On Twitter: @123POW