1 First Impressions

Type values and mathematical formulas into R’s command prompt

1 + 1

## [1] 2

Assign values to symbols (variables)

x = 1
x + x

## [1] 2

Invoke functions such as c(), which takes any number of values and returns a single vector

x = c(1, 2, 3)
x

## [1] 1 2 3

R functions, such as sqrt(), often operate efficienty on vectors

y = sqrt(x)
y

## [1] 1.000000 1.414214 1.732051

There are often several ways to accomplish a task in R

x = c(1, 2, 3)
x

## [1] 1 2 3

x <- c(4, 5, 6)
x

## [1] 4 5 6

x <- 7:9
x

## [1] 7 8 9

10:12 -> x
x

## [1] 10 11 12

Sometimes R does ‘surprising’ things that can be fun to figure out

x <- c(1, 2, 3) -> y
x

## [1] 1 2 3

## [1] 1 2 3

2 R Data types: vector and list

‘Atomic’ vectors

Types include integer, numeric (float-point; real), complex, logical, character, raw (bytes)

people <- c("Brian", "Jim", "Herve", "Dan", "Val", "Martin")
people

## [1] "Brian"  "Jim"    "Herve"  "Dan"    "Val"    "Martin"

Atomic vectors can be named

population <- c(Buffalo=259000, Rochester=210000, `New York`=8400000)
population

##   Buffalo Rochester  New York 
##    259000    210000   8400000

log10(population)

##   Buffalo Rochester  New York 
##  5.413300  5.322219  6.924279

Statistical concepts like NA (not available)

truthiness <- c(TRUE, FALSE, NA)
truthiness

## [1]  TRUE FALSE    NA

Logical concepts like ‘and’ (&), ‘or’ (|), and ‘not’ (!)

!truthiness

## [1] FALSE  TRUE    NA

truthiness | !truthiness

## [1] TRUE TRUE   NA

truthiness & !truthiness

## [1] FALSE FALSE    NA

Numerical concepts like infinity (Inf) or not-a-number (NaN, e.g., 0 / 0)

undefined_numeric_values <- c(NA, 0/0, NaN, Inf, -Inf)
undefined_numeric_values

## [1]   NA  NaN  NaN  Inf -Inf

sqrt(undefined_numeric_values)

## Warning in sqrt(undefined_numeric_values): NaNs produced

## [1]  NA NaN NaN Inf NaN

Common string manipulations

toupper(people)

## [1] "BRIAN"  "JIM"    "HERVE"  "DAN"    "VAL"    "MARTIN"

substr(people, 1, 3)

## [1] "Bri" "Jim" "Her" "Dan" "Val" "Mar"

R is a green consumer – recylcing short vectors to align with long vectors

x <- 1:3
x * 2            # '2' (vector of length 1) recycled to c(2, 2, 2)

## [1] 2 4 6

truthiness | NA

## [1] TRUE   NA   NA

truthiness & NA

## [1]    NA FALSE    NA

It’s very common to nest operations, which can be simultaneously compact, confusing, and expressive ([: subset; <: less than)

substr(tolower(people), 1, 3)

## [1] "bri" "jim" "her" "dan" "val" "mar"

population[population < 1000000]

##   Buffalo Rochester 
##    259000    210000

Lists

The list type can contain other vectors, including other lists

frenemies = list(
    friends=c("Larry", "Richard", "Vivian"),
    enemies=c("Dick", "Mik")
)
frenemies

## $friends
## [1] "Larry"   "Richard" "Vivian" 
## 
## $enemies
## [1] "Dick" "Mik"

[ subsets one list to create another list, [[ extracts a list element

frenemies[1]

## $friends
## [1] "Larry"   "Richard" "Vivian"

frenemies[c("enemies", "friends")]

## $enemies
## [1] "Dick" "Mik" 
## 
## $friends
## [1] "Larry"   "Richard" "Vivian"

frenemies[["enemies"]]

## [1] "Dick" "Mik"

Factors

Character-like vectors, but with values restricted to specific levels

sex = factor(c("Male", "Male", "Female"),
             levels=c("Female", "Male", "Hermaphrodite"))
sex

## [1] Male   Male   Female
## Levels: Female Male Hermaphrodite

sex == "Female"

## [1] FALSE FALSE  TRUE

table(sex)

## sex
##        Female          Male Hermaphrodite 
##             1             2             0

sex[sex == "Female"]

## [1] Female
## Levels: Female Male Hermaphrodite

3 Classes: data.frame and beyond

Variables are often related to one another in a highly structured way, e.g., two ‘columns’ of data in a spreadsheet

x = rnorm(1000)       # 1000 random normal deviates
y = x + rnorm(1000)   # another 1000 deviates, as a function of x
plot(y ~ x)           # relationship bewteen x and y

Convenient to manipulate them together

data.frame(): like columns in a spreadsheet

df = data.frame(X=x, Y=y)
head(df)           # first 6 rows

##            X           Y
## 1 -1.7569371 -0.70884344
## 2 -1.6527157 -1.97487316
## 3 -0.5161684 -1.36055768
## 4  0.2218860  0.09724608
## 5 -0.6661832 -1.82587026
## 6 -0.5512824  0.71819197

plot(Y ~ X, df)    # same as above

See all data with View(df). Summarize data with summary(df)

summary(df)

##        X                  Y           
##  Min.   :-3.27963   Min.   :-5.20065  
##  1st Qu.:-0.71917   1st Qu.:-1.02837  
##  Median :-0.06830   Median :-0.08605  
##  Mean   :-0.06072   Mean   :-0.09962  
##  3rd Qu.: 0.64606   3rd Qu.: 0.90735  
##  Max.   : 2.77080   Max.   : 4.37988

Easy to manipulate data in a coordinated way, e.g., access column X with $ and subset for just those values greater than 0

positiveX = df[df$X > 0,]
head(positiveX)

##            X           Y
## 4  0.2218860  0.09724608
## 9  0.6701959  0.82361589
## 10 1.1216619  1.49955242
## 14 0.6156470  0.11297448
## 15 0.2805778 -1.84736727
## 16 0.7633320 -1.63962235

plot(Y ~ X, positiveX)

R is introspective – ask it about itself

class(df)

## [1] "data.frame"

dim(df)

## [1] 1000    2

colnames(df)

## [1] "X" "Y"

matrix() a related class, where all elements have the same type (a data.frame() requires elements within a column to be the same type, but elements between columns can be different types).

A scatterplot makes one want to fit a linear model (do a regression analysis)

Use a formula to describe the relationship between variables
Variables found in the second argument
```
fit <- lm(Y ~ X, df)
```
Visualize the points, and add the regression line
```
plot(Y ~ X, df)
abline(fit, col="red", lwd=3)
```

Summarize the fit as an ANOVA table

anova(fit)

## Analysis of Variance Table
## 
## Response: Y
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## X           1 1040.0 1039.96  1022.2 < 2.2e-16 ***
## Residuals 998 1015.4    1.02                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Introspection – what class is fit? What methods can I apply to an object of that class?

class(fit)

## [1] "lm"

methods(class=class(fit))

##  [1] add1           alias          anova          case.names     coerce         confint       
##  [7] cooks.distance deviance       dfbeta         dfbetas        drop1          dummy.coef    
## [13] effects        extractAIC     family         formula        hatvalues      influence     
## [19] initialize     kappa          labels         logLik         model.frame    model.matrix  
## [25] nobs           plot           predict        print          proj           qr            
## [31] residuals      rstandard      rstudent       show           simulate       slotsFromS3   
## [37] summary        variable.names vcov          
## see '?methods' for accessing help and source code

A.1 – Introduction to R

Martin Morgan [email protected]

16 - 17 May, 2016

Contents

1 First Impressions

2 R Data types: vector and list

3 Classes: data.frame and beyond

4 Help!