Data Cleaning & Analytics

The Data 'Shop

3 + 5
12/7
# I'm adding 3 and 5. R is fun!
3+5
I'm adding 3 and 5. R is fun!
3+5
print("How are you?") + 5
12/7
 print("Good morning")
print(64)

# variable name that stores the character value "Jane"
name <- "Jane"
print(name)

# variable price that stores the value 3.99
price <- 3.99
print(price)

# working with environment
# remember what env does? it stores the objects you created. Let's see what the environment tab show us. So, how do you see the list on screen?

# list all objects in your environment
ls()

# how do you remove an object?
#rm(objectName)
rm(price)

#remove all objects, clear environment
rm(list=ls()) 

#my first R command 
print("Good morning")

#applying square root function
mass<-64 #is a variable
sqrt(mass) #function with argument provided
res<-sqrt(mass) #variable with a function as its value
# helpfunctions
?plot
help(mean)

# download the package to your machine - once per computer
install.packages("packagename")

#load package into your active session of R - once per session
libary(packagename)

length(age)
str(age)
is.integer(age)
typeof(age)
typeof(is.integer(age))
#check where you are
getwd()

#move to the correct folder, if needed
setwd(~/Desktop/UWI-Mona/data)

#load data
cdata <- read.csv("countries-BMI-1.0.csv", skip = 2, header = F)

#check the first few rows
head(cdata)

dim(m)  # tells you number of rows and columns in your matrix
#indicate which row and column
cdata[x, y]  #Structure: Row number comes before the comma; column number comes after the comma.

#some examples
cdata[5, 3] #Retrieves the data from the cell in the 5th row in the 3rd column of the data frame
cdata[c(5,6), 3] #Retrieves the data from the 2nd and 5th and 6th rows in the 3rd column. You can also use a colon to represent a range - i.e., cdata[c(5:6), 3]
cdata[, 1] #Retrieves the entire first column of data

#use head to show the first 6 rows in the output
head(cdata[, 1])

headers <- read.csv("countries-BMI-1.0.csv", nrows = 2, header = F)
headers_names <- sapply(headers, paste, collapse = "_")
    names(cdata) <- headers_names

    #see the result
    head(cdata)
#tidy/melt the data after the Country column. Save that action in a variable called longdata.
longdata <- melt(cdata, id.vars = c("_Country"))

#separate the year and sex into their own columns, using the underscore as the separator, and save that to a variable called countriesBMI2.
countriesBMI2 <- separate(data = longdata, col = variable,  into = c("Year", "Sex"), sep = "_")

#check your work
head(countriesBMI2)
#subset by column using a dollar sign and column name after the variable name
countriesBMI2$value[countriesBMI2$value == "No data"] <- NA
#we will update the countriesBMI2 variable; so, be sure to run the previous command before running this one.
countriesBMI2 <- separate(data = countriesBMI2, col=value, into = c("BMI", "Error"), sep = " ")

#check your work
head(countriesBMI2)
write.csv(countriesBMI2, file = "countries-BMI-2.0.csv")
#view the first 7 columns and 6 rows of the dataset
head(read.csv("jamaica-worldbank-data.csv", header = T)[, 1:5])

#to run the script from RStudio (or R), use the source() function
source("reshape_jamaica_data.R")

?barplot
args(lm)
??geom_point
help.search("kruskal")
dput(head(iris)) # iris is an example data.frame that comes with R
## structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), 
##     Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, 
##     1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, 
##     0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, 
##     1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", 
## "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
## 6L), class = "data.frame")
saveRDS(iris, file="/tmp/iris.rds")
some_data <- readRDS(file="~/Downloads/iris.rds")
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.2  backports_1.1.0 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.2     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.12   
##  [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.17      stringr_1.2.0  
## [13] digest_0.6.12   evaluate_0.10.1

Data Cleaning & Analytics

The Data 'Shop

Learning Objectives

About R

Why R?

About RStudio

The 4 windows of RStudio

Executing Commands

Using R

Assigning to Variables

Scripting

Functions

Reshaping Our Data in R

Packages

Data types and Data structures

Data Types

Data Structures

Data frames

Tip:

Subsetting

Reshaping/Transposing Data

Saving Files

Saving Your Dataset

Saving Your Script

Running Scripts

SEEKING HELP

I know the name of the function I want to use, but I’m not sure how to use it

I want to use a function that does X, there must be a function for it but I don’t know which one…

I am stuck… I get an error message that I don’t understand

How to Ask for Help

Where to Ask for Help

More Resources