【R语言数据科学】：R语言基础-灵析社区

1. r语言基础

1.1 数据类型

R语言中有很多不同的类型。例如，我们需要区分数字与字符串，表格与简单的数字列表。functionclass可以帮助我们确定对象的类型：

> a<-2
> class(2)
[1] "numeric"

1.1.1 数据框(Data Frames)

在R中存储数据集最常见的方式是在数据框中。我们可以将数据框视为一个表，其中的行表示样本观测值，列表上不同变量。我们可以将不同数据类型组合成一个数据框。大部分数据分析都是从存储在数据框中的数据开始的。例如dslabs库中的murders数据集

# 加载数据框
library(dslabs)
data(murders)
class(murders)

'data.frame'

1.1.2 检查数据对象

使用str函数能返回更多关于数据对象的结构特征

str(murders)

'data.frame':	51 obs. of  5 variables:
 $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ abb       : chr  "AL" "AK" "AZ" "AR" ...
 $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
 $ population: num  4779736 710231 6392017 2915918 37253956 ...
 $ total     : num  135 19 232 93 1257 ...

上述告诉了我们这个数据集有五个变量，51个观测数据包含表头。我们可以使用head()显示前六行：

head(murders)

	state	abb	region	population	total
	<chr>	<chr>	<fct>	<dbl>	<dbl>
1	Alabama	AL	South	4779736	135
2	Alaska	AK	West	710231	19
3	Arizona	AZ	West	6392017	232
4	Arkansas	AR	South	2915918	93
5	California	CA	West	37253956	1257
6	Colorado	CO	West	5029196	65

1.1.3 访问器：$

为了访问数据框的某一特定变量，我们可以使用$

murders$population

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

4779736
710231
6392017
2915918
37253956
5029196
3574097
897934
601723
19687653
9920000
1360301
1567582
12830632
6483802
3046355
2853118
4339367
4533372
1328361
5773552
6547629
9883640
5303925
2967297
5988927
989415
1826341
2700551
1316470
8791894
2059179
19378102
9535483
672591
11536504
3751351
3831074
12702379
1052567
4625364
814180
6346105
25145561
2763885
625741
8001024
6724540
1852994
5686986
563626

得到某一特定变量名字，使用names()

names(murders)

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

'state'
'abb'
'region'
'population'
'total'

1.1.4 向量：数字、字符串和logical

murders$population返回一些列值就是一个向量，使用length可以得到向量长度

pop <- murders$population
length(pop)
class(pop)

'numeric'

上面是一个数值向量，下面我们分别看字符串和logical向量

class(murders$state)

'character'

class(pop>50)

'logical'

1.1.5 因子

在R语言中，使用factor存储分类型变量

class(murders$region)

'factor'

可以看出region是factor变量，我们可以使用levels看它有哪几类

levels(murders$region)

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

'Northeast'
'South'
'North Central'
'West'

1.1.6 列表

数据框可以看作是一个特殊的列表，使用list创建列表注意和python的区别

# 方法一
record <- list(name = "John Doe",
             student_id = 1234,
             grades = c(95, 82, 91, 97, 93),
             final_grade = "A")

# 方法2
record2 <- list("John Doe", 1234)
record2

'John Doe'
1234

1.1.7 矩阵

矩阵类似数据框，有行有列，但是里面必须是同样的类型，因此数据框更常用

mat <- matrix(1:12, 4,3)

mat

# 使用[]索引具体的值
mat[1,2]

1.2 向量

在R中，可用于存储数据的最基本对象是向量。正如我们所看到的，复杂的数据集通常可以分解为向量。例如，在数据框中，每列都是一个向量。

1.2.1 创建向量

我们使用c()来创建一个向量

x <- c(1,2,3)
x

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

1.2.2 命名

有时命名向量的每一个值很有用。例如在定义国家编码时

codes <- c(italy = 380, canada = 124, egypt = 818)
codes

.dl-inline {width: auto; margin:0; padding: 0} .dl-inline>dt, .dl-inline>dd {float: none; width: auto; display: inline-block} .dl-inline>dt::after {content: ":\0020"; padding-right: .5ex} .dl-inline>dt:not(:first-of-type) {padding-left: .5ex}

italy

380

canada

124

egypt

818

class(codes)

'numeric'

names(codes)

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

'italy'
'canada'
'egypt'

1.2.3 序列

还可以使用seq()序列函数创建向量

x <- seq(1,10)
x

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

seq()函数第一个参数表示开始值，第二个参数表示结束值，最后一个参数表示步长，默认为1。同时包含开始和结束，这一点和python的range函数有一点区别，python不包含结束值。

1.2.4 索引

我们可以通过索引访问向量中某一特定元素

x <-  codes[1]
x

italy: 380

codes[1:2]

italy

380

canada

124

1.3 排序

例如，以murders数据集为例，我们想要以枪杀案升序排序

sort(murders$total)

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

2
4
5
5
7
8
11
12
12
16
19
21
22
27
32
36
38
53
63
65
67
84
93
93
97
97
99
111
116
118
120
135
142
207
219
232
246
250
286
293
310
321
351
364
376
413
457
517
669
805
1257

但是这样我们并不能得到是哪些州枪击案最多

1.3.1 order

使用order函数可以得到变量排序的索引，例如

x <- c(31,4,15,92,65)
sort(x)

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

# 使用order
idx <- order(x)
idx

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

x[idx]

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

可以看到结果和sort函数一样，下面我们对murders数据集使用order

ind <- order(murders$total)
murders$abb[ind]

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

'VT'
'ND'
'NH'
'WY'
'HI'
'SD'
'ME'
'ID'
'MT'
'RI'
'AK'
'IA'
'UT'
'WV'
'NE'
'OR'
'DE'
'MN'
'KS'
'CO'
'NM'
'NV'
'AR'
'WA'
'CT'
'WI'
'DC'
'OK'
'KY'
'MA'
'MS'
'AL'
'IN'
'SC'
'TN'
'AZ'
'NJ'
'VA'
'NC'
'MD'
'OH'
'MO'
'LA'
'IL'
'GA'
'MI'
'PA'
'NY'
'FL'
'TX'
'CA'

这样可以得到哪些州的枪杀较多

1.3.2 max和which.max

如果我们只对最大值感兴趣，那么可以使用max

max(murders$total)

1257

和order类似，which.max()返回最大值的索引

i_max <- which.max(murders$total)
murders[i_max, ]

min和which.min是同样的结果

1.3.3 rank

和order、sort，rank不是针对数据排序的，而是返回每个值的秩

x <- c(31,4,15,92,64)
rank(x)

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

1.3.4 小心recycling

有点时候数据长度不匹配，可能导致在向量计算的时候自动循环使用而出现问题

x <- c(1,2,3)
y <- c(10,20,30,40,50,60)
x+y

.list-inline {list-style: none; margin:0; padding: 0} .list-inline>li {display: inline-block} .list-inline>li:not(:last-child)::after {content: "\00b7"; padding: 0 .5ex}

上述两个向量长度不相等，但是发现x循环使用了两次，这有点类似python中的广播机制