Learning Objectives
Upon completion, participants will be able to
- Describe R and RStudio.
- Load data frames in R.
- Utilize the building blocks of the R language.
- Extract parts of a dataset.
- Reshape wide data into tidy data in R.
- Separate columns using delimiters in R.
- Write, save, and run an R script.
R is a free software programming language for statistical computing and graphics. When you have RStudio installed, it integrates with R as an IDE (Integrated Developmet Environment) to provide further functionality. You can do without RStudio, but it makes working in R more effective.
RStudio is an interface for R. We will be working with R using RStudio. This is a piece of software (also known as integrated development environment, or IDE) that makes working in R much easier. When you open RStudio, you should see 4 windows:
Files
tab allows you to navigate between folders.For this workshop, we need to navigate to the data
folder within the UWI_Mona
folder that we created on the Desktop.
Click the three dots icon at the top right corner of the Files
tab and navigate to Desktop/UWI_Mona/data. When there, click More and select Set as working directory.
Notice the change in your console window - your work with R is now done from data
folder.
When you type commands in the console window (bottom left) and press Enter on your keyboard, they are executed immediately and the output is displayed.
3 + 5
12/7
We can also comment on what it is that we’re doing
# I'm adding 3 and 5. R is fun!
3+5
What happens if we type that same command without the # sign in the front?
I'm adding 3 and 5. R is fun!
3+5
Now R is trying to run that sentence as a command, and it doesn’t work. Now we’re stuck over in the console. The +
sign means that it’s still waiting for input, so we can’t type in a new command. To get out of this press the Esc
key. This will work whenever you’re stuck with that +
sign.Symbol >
means that the R is ready for the next command. If you enter incomplete commands, you will see +
which means that the system is waiting for you to complete the command.
Now, let's try something other than mathematical commands.
print("How are you?") + 5
12/7
print("Good morning")
print(64)
NOTE: Use quotes around characters, but not around numbers, unless you want them to be seen as characters.
Notice the commands that come with parentheses beside it. These are called functions. Functions typically need an argument inside the parentheses.
One of the main concepts of any programming language is the notion of a variable. Variables are created to store values for future use. For example, if you wanna get the value of 3+5, you have to run the command every time. But you can store the value under a variable for future use. A variable with a value becomes an object.
To create a variable in R, use <-
(Alt + dash) assignment operator:
# variable name that stores the character value "Jane"
name <- "Jane"
print(name)
# variable price that stores the value 3.99
price <- 3.99
print(price)
# working with environment
# remember what env does? it stores the objects you created. Let's see what the environment tab show us. So, how do you see the list on screen?
# list all objects in your environment
ls()
# how do you remove an object?
#rm(objectName)
rm(price)
#remove all objects, clear environment
rm(list=ls())
It is often the case that we would like to reuse our commands. If so, you can type commands in the text editor window.
Let's open a new R Script file by clicking the + sign on the top right corner of scripting panel and write our command there,
for example print("Good morning!")
.
To check if this command works, you can send it for execution to console with CTRL+Enter click. (For a Mac, you press Command+Enter.) Otherwise, if you want to keep going without executing the command, press Enter to move to the next line in your script.
We have a very simple example here, but you can imagine writing hundreds of commands in the order you want them to be executed to accomplish a certain task. This is what an R program or R script is.
It is a good practice to comment your code. The comments (statements that are helpful to the user, but are not seen by R as commands to be executed) in R start with #
.
Let's add a comment to our simple script.
#my first R command
print("Good morning")
We can save the file with our commands and reuse it later. For now, let's save our example as R_commands.R file in the data
folder.
In general, a function takes an input and transforms it according to the function's definition(rules). You can recognize functions in R by the presence of parantheses. Objects in parantheses are called function's arguments.
#applying square root function
mass<-64 #is a variable
sqrt(mass) #function with argument provided
res<-sqrt(mass) #variable with a function as its value
In the example, the square root function takes the mass object as input and finds the square root of its value. The result is then assigned to res
variable. In our second example, getwd()
is a function that outputs your current location within the file system. Although there is no input (many functions do not require arguments), parantheses are still required for a proper syntax in R.
There are thousands of built-in functions in R. There are also help functions that you can use to find out what other function do and how to use them. The help appears in the bottom-right window of the RStudio.
# helpfunctions
?plot
help(mean)
Let's work in scripting mode from now on so that you will have the record of all commands we used in this lesson.
The functionality of R is expanded by R packages that include functions not present in the default installation of R. When you need to use another package, do these 2 things:
# download the package to your machine - once per computer
install.packages("packagename")
#load package into your active session of R - once per session
libary(packagename)
We will use the reshape2 and tidyr packages for this exercise because they were created for tidying data, which is what we need.
Question: How do we install and load the functions available from these packages?
Single-element data structures are the smallest units in R. For example, if we assigned the value of 45 to a variable age by inputing age<-45
, we just created the smallest object in R.
length(age)
str(age)
is.integer(age)
typeof(age)
typeof(is.integer(age))
Question:
What data type is stored in `score` variable after inputting score<-c(1,4,3)
?
The last expression is an example of nested function. Nested functions are very common in R, but are very difficult to understand at first. You can always split nested function into a series of single function calls. Remember that the variable inside the most inner paranthesis is an argument(input)for the function that will be evaluated first.
Sometimes you will need to convert between data types. There are functions that do that:as.integer()
, as.character()
, and so on. NOTE: The conversion between data types is not always possible. Think about converting between character and integer.
countries-BMI-1.0.csv
dataset to explore it.
#check where you are
getwd()
#move to the correct folder, if needed
setwd(~/Desktop/UWI-Mona/data)
#load data
cdata <- read.csv("countries-BMI-1.0.csv", skip = 2, header = F)
#check the first few rows
head(cdata)
Our smallest objects can be used to represent a single element in the dataset, like individual year, or individual country, but what happens when you combine them? Here are common data structures:
Type str(cdata)
to see info about your variable.
Question: For our cdata
variable, could you make an informative guess about how what data structure this is in R?
Yes! It is a list of factors of equal length, or a data frame.
Notice that a data is a list of vectors. So, using typeof(cdata)
will return list
. Data frames are extremely useful data structures as they represent table-like datasets.
dim(m) # tells you number of rows and columns in your matrix
First Way to Subset
#indicate which row and column
cdata[x, y] #Structure: Row number comes before the comma; column number comes after the comma.
#some examples
cdata[5, 3] #Retrieves the data from the cell in the 5th row in the 3rd column of the data frame
cdata[c(5,6), 3] #Retrieves the data from the 2nd and 5th and 6th rows in the 3rd column. You can also use a colon to represent a range - i.e., cdata[c(5:6), 3]
cdata[, 1] #Retrieves the entire first column of data
#use head to show the first 6 rows in the output
head(cdata[, 1])
Let's add a header to the data in order to see the other methods of subsetting. We need to do this by combining the first two rows which contain information about the year and sex in order to make unique headers. We will select the rows into a variable we will call headers
.
nrows
to create "headers" using the first two rows of the dataset:
headers <- read.csv("countries-BMI-1.0.csv", nrows = 2, header = F)
Next, use sapply
and paste
to combine the data in each of the two columns into one header, separated by an underscore. Then use the names()
function to add that created header to your dataset.
headers_names <- sapply(headers, paste, collapse = "_")
names(cdata) <- headers_names
#see the result
head(cdata)
Now, we are going to change the dataset to a more tidy format by transposing the data, also known as reshaping or melting. We want to go from having BMIs going across columns to being all in one column. To do this, our Year/Sex headers will need to become Year and Sex columns.
#tidy/melt the data after the Country column. Save that action in a variable called longdata.
longdata <- melt(cdata, id.vars = c("_Country"))
#separate the year and sex into their own columns, using the underscore as the separator, and save that to a variable called countriesBMI2.
countriesBMI2 <- separate(data = longdata, col = variable, into = c("Year", "Sex"), sep = "_")
#check your work
head(countriesBMI2)
Second Way to Subset
Let's use the second way to subset to clean the data.
Instead of being blank, areas that have no data say "No Data." This is not useful in most programs. So, we will make these nulls. In R, nulls are represented as NA.
#subset by column using a dollar sign and column name after the variable name
countriesBMI2$value[countriesBMI2$value == "No data"] <- NA
Right now, the BMI averages are provided along with the standard deviations' upper and lower limits. Because of this, those cells are string values instead of numerical. So, let's also separate the BMIs from the standard error ranges so we can work with them as numerical values, which can be added, subtracted, averaged, etc.
#we will update the countriesBMI2 variable; so, be sure to run the previous command before running this one.
countriesBMI2 <- separate(data = countriesBMI2, col=value, into = c("BMI", "Error"), sep = " ")
#check your work
head(countriesBMI2)
Use write.csv
to save your updated dataset to a new CSV file. Be sure to remember your data management rules for naming.
write.csv(countriesBMI2, file = "countries-BMI-2.0.csv")
reshape_countries-BMI.R
.
Let's run a script to melt/reshape the jamaica-worldbank-data.csv
.
#view the first 7 columns and 6 rows of the dataset
head(read.csv("jamaica-worldbank-data.csv", header = T)[, 1:5])
#to run the script from RStudio (or R), use the source() function
source("reshape_jamaica_data.R")
Now you have run the script which created the file jamaica-worldbank-data-1.0.csv
in your data
folder. Visit the folder to see the new dataset.
Summary
This lesson introduced you to main ideas of programming: variables, functions, data structures and scripts. Now you can write your own simple programs in R and begin understanding R code written by others. Now that we have worked with the data in R, we will visualize our data in Tableau.
If you need help with a specific function, let’s say barplot()
, you can type:
?barplot
If you just need to remind yourself of the names of the arguments, you can use:
args(lm)
If the function is part of a package that is installed on your computer but don’t remember which one, you can type:
??geom_point
If you are looking for a function to do a particular task, you can use help.search()
(but only looks through the installed packages):
help.search("kruskal")
If you can’t find what you are looking for, you can use the rdocumention.org website that search through the help files across all packages available.
Start by googling the error message. However, this doesn’t always work very well because often, package developers rely on the error catching provided by R. You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”).
However, you should check stackoverflow.com. Search using the [r]
tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: http://stackoverflow.com/questions/tagged/r
The Introduction to R can also be dense for people with little programming experience but it is a good place to understand the underpinnings of the R language.
The R FAQ is dense and technical but it is full of useful information.
The key to get help from someone is for them to grasp your problem rapidly. You should make it as easy as possible to pinpoint where the issue might be.
Try to use the correct words to describe your problem. For instance, a package is not the same thing as a library. Most people will understand what you meant, but others have really strong feelings about the difference in meaning. The key point is that it can make things confusing for people trying to help you. Be as precise as possible when describing your problem
If possible, try to reduce what doesn’t work to a simple reproducible example. If you can reproduce the problem using a very small data.frame
instead of your 50,000 rows and 10,000 columns one, provide the small one with the description of your problem. When appropriate, try to generalize what you are doing so even people who are not in your field can understand the question.
To share an object with someone else, if it’s relatively small, you can use the function dput()
. It will output R code that can be used to recreate the exact same object as the one in memory:
dput(head(iris)) # iris is an example data.frame that comes with R
## structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4),
## Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4,
## 1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2,
## 0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L,
## 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length",
## "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA,
## 6L), class = "data.frame")
If the object is larger, provide either the raw file (i.e., your CSV file) with your script up to the point of the error (and after removing everything that is not relevant to your issue). Alternatively, in particular if your questions is not related to a data.frame
, you can save any R object to a file. Note: for this example, the folder “/tmp” needs to already exist.
saveRDS(iris, file="/tmp/iris.rds")
The content of this file is however not human readable and cannot be posted directly on stackoverflow. It can however be sent to someone by email who can read it with this command:
some_data <- readRDS(file="~/Downloads/iris.rds")
Last, but certainly not least, always include the output of sessionInfo()
as it provides critical information about your platform, the versions of R and the packages that you are using, and other information that can be very helpful to understand your problem.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.2 backports_1.1.0 magrittr_1.5 rprojroot_1.2
## [5] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 Rcpp_0.12.12
## [9] stringi_1.1.5 rmarkdown_1.6 knitr_1.17 stringr_1.2.0
## [13] digest_0.6.12 evaluate_0.10.1
packageDescription("name-of-package")
. You may also want to try to email the author of the package directly.The Data 'Shop,
2017. License. Contributing.
Questions? Feedback?
Please file
an issue on GitHub.
On
Twitter: @123POW