[1062BPY12001] Data analysis with R / week 2

R 語⾔言與資料分析
objects (物件) & numeric processes

Entering Input
在 R 的提⽰示符號 (prompt, “>”) 右⽅方鍵⼊入指令。符號 “<-“ 為設定運算⼦子 (the assignment operator).
R 語⾔言的語法決定指令是否完備
The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored.
> x <- 1
> print(x)
[1] 1
> x
[1] 1
> msg <- "hello"
> x <- ## Incomplete expression
> x <- “5 ## Incomplete expression
> x <- c(5,) ## Incomplete expression

R 物件命名規則
Case sensitive
• A and a are different
All alphanumeric symbols are allowed (A-Z, a-z, 0-9)
• “.”, “_”.
Name must start with “.” or a letter.
Do not use reserved keywords
❑ 錯誤命名
■ 3x
■ 3_x
■ 3-x
■ 3.x
■ .3variable
❑ 正確命名
■ x_3
■ x3
■ x.3
■ taiwan.taipei.x3
■ .variable

Evaluation
執⾏行指令 (evaluation) 之後，其結果會被回傳，可能會直接列印在 console 視窗，或者存⼊入某個變數
The [1] indicates that x is a vector and 5 is the first element.
> x <- 5
> x
[1] 5
> print(x)
[1] 5
## nothing printed
## auto-printing occurs
## explicit printing
The : operator is used to create integer sequences.
> x <- 1:20
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[16] 16 17 18 19 20
Printing

> 3 - 4
[1] -1
> 5 * 6
[1] 30
> 7 / 8
[1] 0.875
> 1 + 2 * 3
[1] 7
> (1 + 2) * 3
[1] 9
> 15 / 4
[1] 3.75
> 15 %% 4
[1] 3
> 2^2
[1] 4
> 2^0.5
[1] 1.414214
> 2^ 4.3
[1] 19.69831
> 2^-0.5
[1] 0.7071068
> log(4) # natural log
> log10(4) # log in base 10
> log(4,10) # same as above
> sqrt(9) # square root
> abs(3-4) # absolute value
> exp(1) # exponential
Using R as a Calculator

> squareroot(2)
Error: could not find function “squareroot”
> sqrt 2
Error: unexpected numeric constant in "sqrt 2"
> sqrt(-2)
[1] NaN
Warning message:
In sqrt(-2) : NaNs produced
> sqrt(2
+ )
[1] 1.414214
Warnings and Errors

Objects
R has five basic or “atomic” classes of objects:
· character
· numeric (real numbers)
· integer
· complex
· logical (True/False)
The most basic object is a vector
· A vector can only contain objects of the same class
· BUT: The one exception is a list, which is represented as a vector but can contain objects of
different classes (indeed, that’s usually why we use them)
Empty vectors can be created with the vector() function.

Numbers
· Numbers in R a generally treated as numeric objects (i.e. double precision real numbers)
· If you explicitly want an integer, you need to specify the L suffix
· Ex: Entering 1 gives you a numeric object; entering 1L explicitly gives you an integer.
· There is also a special number Inf which represents infinity; e.g. 1 / 0; Inf can be used in
ordinary calculations; e.g. 1 / Inf is 0
· The value NaN represents an undefined value (“not a number”); e.g. 0 / 0; NaN can also be
thought of as a missing value (more on that later)

Creating Vectors
The c() function can be used to create vectors of objects.
Using the vector() function
> x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
> x <- vector("numeric", length = 10)
> x
[1] 0 0 0 0 0 0 0 0 0 0
> x <- c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)
> x1 <- c(74, 122, 235, 111, 292)
> x2 <- c(111, 211, 133, 156, 79)
> x_all <- c(x1, x2)
The c() function can also combine data vectors. For example:

Mixing Objects
What about the following?
When different objects are mixed in a vector, coercion occurs so that every element in the vector is of
the same class.
> y <- c(1.7, "a") ## character
> y <- c(TRUE, 2) ## numeric
> y <- c("a", TRUE) ## character

Explicit Coercion
Objects can be explicitly coerced from one class to another using the as.* functions, if available.
> x <- 0:6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"

Explicit Coercion
Nonsensical coercion results in NAs.
> x <- c("a", "b", "c")
> as.numeric(x)
[1] NA NA NA
Warning message:
NAs introduced by coercion
> as.logical(x)
[1] NA NA NA
> as.complex(x)
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i

Attributes
R objects can have attributes
· names, dimnames
· dimensions (e.g. matrices, arrays)
· class
· length
· other user-defined attributes/metadata
Attributes of an object can be accessed using the attributes() function.

Matrices
Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2
(nrow, ncol).
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and
running down the columns.
> m <- 1:10
> m
[1] 1 2 3 4 5 6 7 8 9 10
> dim(m)
NULL
> attributes(m)
NULL
> dim(m) <- c(2, 5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> dim(m)
[1] 2 5
> attributes(m)
$dim
[1] 2 5

Matrices (cont’d)
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and
running down the columns.
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

cbind-ing and rbind-ing
Matrices can be created by column-binding or row-binding with cbind() and rbind().
> x <- 1:3
> y <- 10:12
> cbind(x, y)
x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
[,1] [,2] [,3]
x 1 2 3
y 10 11 12

Exercises
the number of whales beachings per year in Texas during the 1990s was
74 122 235 111 292 111 211 133 156 79
Store this data in R
> whales <- scan()
1: 74 122 235 111 292 111 211 133 156 79
11:
Read 10 items
> whales <- c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)
or

Exercises (cont’d)
using functions on a data vector
> sum(whales)
[1] 1524
> length(whales)
[1] 10
> sum(whales)/length(whales)
[1] 152.4
> mean(whales)
[1] 152.4
> sort(whales)
[1] 74 79 111 111 122 133 156 211 235 292
> min(whales)
[1] 74
> max(whales)
[1] 292
> range(whales)
[1] 74 292
> diff(whales)
[1] 48 113 -124 181 -181 100 -78 23 -77
> cumsum(whales)
[1] 74 196 431 542 834 945 1156 1289 1445 1524
# total number of beachings
# length of data vector
# average number of beachings
# `mean` function finds average
# the sorted values
# the minimum value
# the maximum value
# `ranges` returns both `min` and `max`
# `diff` returns differences
# cumulative sum

The variance: the average squared distance from the mean
> mean(whales)
[1] 152.4
> xbar = mean(whales)
> whales - xbar
[1] -78.4 -30.4 82.6 -41.4 139.6 -41.4 58.6 -19.4 3.6 -73.4
> (whales-xbar)^2
[1] 6146.56 924.16 6822.76 1713.96 19488.16 1713.96
[7] 3433.96 376.36 12.96 5387.56
> sum((whales-xbar)^2)
[1] 46020.4
> n = length(whales)
> n
[1] 10
> sum((whales-xbar)^2) / (n-1)
[1] 5113.378
> var(whales)
[1] 5113.378
> sqrt(sum((whales-xbar)^2) / (n-1))
[1] 71.50789
> sqrt(var(whales))
[1] 71.50789
> sd(whales)
[1] 71.50789
# variance
# variance
# standard deviation
# vectorized operation

the number of whales beachings per year in Florida during the 1990s was
89 254 306 292 274 233 294 204 204 90
> whales.fla <- scan()
1: 89 254 306 292 274 233 294 204 204 90
11:
Read 10 items
> whales + whales.fla
[1] 163 376 541 403 566 344 505 337 360 169
> whales - whales.fla
[1] -15 -132 -71 -181 18 -122 -83 -71 -48 -11
> t.test(whales, whales.fla, var.equal = T)
Two Sample t-test
data: whales and whales.fla
t = -2.1193, df = 18, p-value = 0.04823
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-142.5803558 -0.6196442
sample estimates:
mean of x mean of y
152.4 224.0

Creating structured data: `seq` & `rep`
simple sequences
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> x <- 1:10
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x <- rev(1:10)
> x
[1] 10 9 8 7 6 5 4 3 2 1
arithmetic sequences: a, a+h, a+2h, a+3h, …, a+(n-1)h
> a = 1; h = 4; n = 5;
> a + h*(0:(n-1))
[1] 1 5 9 13 17
> seq(1, 9, by =2)
[1] 1 3 5 7 9
> seq(1, 10, by =2)
[1] 1 3 5 7 9
> seq(2, 10, by =2)
[1] 2 4 6 8 10
the `seq()` function
# odd number
# as 11 > 10, 11 is not included
# even number

Creating structured data: `seq` & `rep`
the `rep()` function
> rep(1, 5)
[1] 1 1 1 1 1
> rep(1:5, 3)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
> days=c("mon","tue","wed","thu","fri","sat","sun")
> rep(days, 2)
[1] "mon" "tue" "wed" "thu" "fri" "sat" "sun" "mon" "tue" "wed"
[11] "thu" "fri" "sat" "sun"
# repeat the number `1` five times
# repeat the sequence of values 1-5, three times
# repeat the days of the week twice
Specifying pairs of equal-sized vectors. Each term of the first is repeated the corresponding number of times
in the second
> rep(c("long", "short"), c(1,2))
[1] "long" "short" “short"
> rep(1:4,c(2,2,2,2))
[1] 1 1 2 2 3 3 4 4
> rep(1:4,c(2,1,2,1))
[1] 1 1 2 3 3 4
> rep(1:8, 1:8)
[1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 7 8 8 8
[32] 8 8 8 8 8
> rep(rep(1:8, 1:8), 3)
[1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 7 8 8 8
[32] 8 8 8 8 8 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7
[63] 7 7 8 8 8 8 8 8 8 8 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6
[94] 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8
# 1 long and 2 short

> ebay = scan()
1: 88.8 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6
10:
Read 9 items
> length(ebay)
[1] 9
> ebay[1]
[1] 88.8
> ebay[9]
[1] 101.6
> ebay[length(ebay)]
[1] 101.6
Accessing Data: vector
eBay’s Friday stock price in two months:
88.8 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6
> ebay = scan()
1: 88.8 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6
10:
Read 9 items
# length of vector
# get the first value
# get the last value
# get the last value when the length is not know

> ebay[-1]
[1] 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6
> ebay[-(1:4)]
[1] 95.2 94.7 99.2 99.4 101.6
Accessing Data: vector (cont’d)
eBay’s Friday stock price in two months:
88.8 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6
> ebay[1:4]
[1] 88.8 88.3 90.2 93.5
> ebay[c(1,5,9)]
[1] 88.8 95.2 101.6
# get the first four values
# get the first, fifth, and ninth values
if ì` is between 1 and n (length of x), `x[i]` returns the i-th value of `x`
if ì` is beer than n, a value of NA is return
if ì` is negative and no less than -n, `x[i]` returns all but the i-th value of x.
# all but the first
# all but the 1st to 4th
> ebay[1] = 88.0
> ebay
[1] 88.0 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6
> ebay[10:13]=c(97.0, 99.3,102.0,101.8)
> ebay
[1] 88.0 88.3 90.2 93.5 95.2 94.7 99.2 99.4 101.6 97.0
[11] 99.3 102.0 101.8
assigning values to data vector
`x[i] = a` : assign a value of à` to the i-th element of x (i is positive)
if i is larger than the length of x, then x is enlarged

> ebay > 100
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[11] FALSE TRUE TRUE
> ebay[ebay>100]
[1] 101.6 102.0 101.8
> which(ebay > 100)
[1] 9 12 13
> ebay[c(9,12,13)]
[1] 101.6 102.0 101.8
> ebay[ebay>1000]
numeric(0)
> sum(ebay > 100)
[1] 3
> sum(ebay > 100)/length(ebay)
[1] 0.2308
Accessing Data: vector (cont’d)
which values of ebay are more than 100?
# number bigger than 100
# proportion bigger

> USPersonalExpenditure
1940 1945 1950 1955 1960
Food and Tobacco 22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health 3.530 5.760 9.71 14.0 21.10
Personal Care 1.040 1.980 2.45 3.4 5.40
Private Education 0.341 0.974 1.80 2.6 3.64
> dim(USPersonalExpenditure)
[1] 5 5
> ncol(USPersonalExpenditure)
[1] 5
> nrow(USPersonalExpenditure)
[1] 5
> colnames(USPersonalExpenditure)
[1] "1940" "1945" "1950" "1955" "1960"
> rownames(USPersonalExpenditure)
[1] "Food and Tobacco" "Household Operation" "Medical and Health"
[4] "Personal Care" "Private Education"
Accessing Data: matrix

1940 1945 1950 1955 1960
Food and Tobacco 22.200 44.500 59.60 73.2 86.80
Personal Care 1.040 1.980 2.45 3.4 5.40
Accessing Data: matrix (cont’d)
USPersonalExpenditure[1,]
looks at row no.1
USPersonalExpenditure[4:5,]
looks at row 4 through 5
USPersonalExpenditure[3,1]
looks at row 3, column 1 USPersonalExpenditure[,2]
looks at column 2
> USPersonalExpenditure[1,]
1940 1945 1950 1955 1960
22.2 44.5 59.6 73.2 86.8
> USPersonalExpenditure[,2]
Food and Tobacco Household Operation Medical and Health Personal Care Private Education
44.500 15.500 5.760 1.980 0.974
> USPersonalExpenditure[3,1]
[1] 3.53
> USPersonalExpenditure[4:5,]
1940 1945 1950 1955 1960
Personal Care 1.040 1.980 2.45 3.4 5.40
> USPersonalExpenditure[1,3]
[1] 59.6
> USPersonalExpenditure["Food and Tobacco", "1950"]
[1] 59.6
> USPersonalExpenditure[1, "1950"]
[1] 59.6

> USPersonalExpenditure[1, c(5, 3, 1)]
1960 1950 1940
86.8 59.6 22.2
>
> USPersonalExpenditure["Food and Tobacco", c("1960", "1950", "1940")]
1960 1950 1940
86.8 59.6 22.2
1940 1945 1950 1955 1960
Food and Tobacco 22.200 44.500 59.60 73.2 86.80
Personal Care 1.040 1.980 2.45 3.4 5.40

> sum(USPersonalExpenditure[,1])
[1] 37.611
[1] 68.714
[1] 102.56
[1] 129.7
[1] 163.14
Error in USPersonalExpenditure[, 6] : subscript out of bounds
> colSums(USPersonalExpenditure)
1940 1945 1950 1955 1960
37.611 68.714 102.560 129.700 163.140
> rowSums(USPersonalExpenditure)
Food and Tobacco Household Operation Medical and Health Personal Care Private Education
286.300 137.700 54.100 14.270 9.355

set.seed(689)
x = rnorm(1000)
head(x)
[1] 0.9684062 -2.1456719 0.5330228 -0.1597912 0.6806083 -0.7543219
hist(x)
set.seed(689)
x = rexp(1000)
head(x)
[1] 0.6671585 1.3973498 0.4059822 1.1404633 1.2143525 0.2488164
hist(x)
set.seed(689)
x = runif(1000)
head(x)
[1] 0.83357924 0.23074833 0.01594958 0.69043398 0.70299110 0.36182904
hist(x)
random numbers draw from a
standard normal distribution
random numbers draw from an
exponential distribution
random numbers draw from an
uniform distribution

set.seed(9.2)
x.samples <- matrix(rnorm(10000*30), nrow = 10000)
par(mfrow=c(1,2))
hist(x.samples)
hist(rowMeans(x.samples))
set.seed(9.2)
x.samples <- matrix(rexp(10000*30), nrow = 10000)
par(mfrow=c(1,2))
hist(x.samples)
set.seed(9.2)
x.samples <- matrix(runif(10000*30), nrow = 10000)
par(mfrow=c(1,2))
hist(x.samples)
# Central Limit Theorem (中央極限定理)
sampling distribution of mean
draw from an uniform distribution
sampling distribution of mean draw
from a standard normal distribution
sampling distribution of mean draw
from an exponential distribution
random numbers draw from a
standard normal distribution
random numbers draw from
an exponential distribution
random numbers draw from
an uniform distribution

[1062BPY12001] Data analysis with R / week 2

More Related Content

What's hot

Similar to [1062BPY12001] Data analysis with R / week 2

More from Kevin Chun-Hsien Hsu

Recently uploaded

[1062BPY12001] Data analysis with R / week 2