参考资料: Data Analysis and Prediction Algorithms with R
data.table库是用于数据整理和分析的,在第三章中我们介绍了dplyr包来进行数据处理。本章介绍在data.table中如何实现相同的功能
data.table是一个单独的库。需要单独安装导入。本章介绍一些与第三章:R语言数据处理相关的方法: mutate,filter,select,group_by等
首先我们使用setDT函数将数据框装换为一个data.table,否则 后面的操作可能会失效
library(tidyverse)
library(data.table)
library(dslabs)murders <- copy(murders)
murders <- setDT(murders)对数据进行选择指定列,在使用dplyr时,我们是这样写的
select(murders, state, region) %>% head()| state | region |
|---|---|
| <chr> | <fct> |
| Alabama | South |
| Alaska | West |
| Arizona | West |
| Arkansas | South |
| California | West |
| Colorado | West |
下面我们演示一下在data.table中是如何使用的
murders[, c('state', 'region')] %>% head()| state | region |
|---|---|
| <chr> | <fct> |
| Alabama | South |
| Alaska | West |
| Arizona | West |
| Arkansas | South |
| California | West |
| Colorado | West |
也可以直接使用.()来进行访问相应变量
murders[,.(state, rate)] %>% head()| state | rate |
|---|---|
| <chr> | <dbl> |
| Alabama | 0.2824424 |
| Alaska | 0.2675186 |
| Arizona | 0.3629527 |
| Arkansas | 0.3189390 |
| California | 0.3374138 |
| Colorado | 0.1292453 |
我们在dplyr中使用mutate函数
murders %>% mutate(murders, rate = total / population * 10^5) %>% head()| state | abb | region | population | total | rate |
|---|---|---|---|---|---|
| <chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> |
| Alabama | AL | South | 4779736 | 135 | 2.824424 |
| Alaska | AK | West | 710231 | 19 | 2.675186 |
| Arizona | AZ | West | 6392017 | 232 | 3.629527 |
| Arkansas | AR | South | 2915918 | 93 | 3.189390 |
| California | CA | West | 37253956 | 1257 | 3.374138 |
| Colorado | CO | West | 5029196 | 65 | 1.292453 |
在data.table中,我们使用:=来定义新的一列,这样能节约电脑内存
murders[, rate := total/population * 10 ^5] %>% head()| state | abb | region | population | total | rate |
|---|---|---|---|---|---|
| <chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> |
| Alabama | AL | South | 4779736 | 135 | 2.824424 |
| Alaska | AK | West | 710231 | 19 | 2.675186 |
| Arizona | AZ | West | 6392017 | 232 | 3.629527 |
| Arkansas | AR | South | 2915918 | 93 | 3.189390 |
| California | CA | West | 37253956 | 1257 | 3.374138 |
| Colorado | CO | West | 5029196 | 65 | 1.292453 |
同样我们可以使用:=定义多个列
murders[, ':='(rate=total / population * 10000, rank = rank(population))] %>% head()| state | abb | region | population | total | rate | rank |
|---|---|---|---|---|---|---|
| <chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> |
| Alabama | AL | South | 4779736 | 135 | 0.2824424 | 29 |
| Alaska | AK | West | 710231 | 19 | 0.2675186 | 5 |
| Arizona | AZ | West | 6392017 | 232 | 0.3629527 | 36 |
| Arkansas | AR | South | 2915918 | 93 | 0.3189390 | 20 |
| California | CA | West | 37253956 | 1257 | 0.3374138 | 51 |
| Colorado | CO | West | 5029196 | 65 | 0.1292453 | 30 |
data.table包的设计是为了避免浪费内存。因此我们可以复制一个表
x <- data.table(a=1)
y <- xy实际是x的引用,而不是一个新对象,相当于是x的另一个名字。只有当改变y的时候,才会生成一个新对象 然而在使用:=函数是,即便改变x也不会生成一个新的y对象,有时候我们不希望改变原来的对象,此时需要用copy()函数
x [,a:=2]
y| a |
|---|
| <dbl> |
| 2 |
z = copy(x)
x[,a:=3]
z| a |
|---|
| <dbl> |
| 1 |
在dplyr中,我们通过下述代码过滤
filter(murders, rate <= 0.7) %>% head()| state | abb | region | population | total | rate | rank |
|---|---|---|---|---|---|---|
| <chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> |
| Alabama | AL | South | 4779736 | 135 | 0.2824424 | 29 |
| Alaska | AK | West | 710231 | 19 | 0.2675186 | 5 |
| Arizona | AZ | West | 6392017 | 232 | 0.3629527 | 36 |
| Arkansas | AR | South | 2915918 | 93 | 0.3189390 | 20 |
| California | CA | West | 37253956 | 1257 | 0.3374138 | 51 |
| Colorado | CO | West | 5029196 | 65 | 0.1292453 | 30 |
在data.table中,我们可以直接使用索引
murders[rate<=0.7,.(state, rate)] %>% head()| state | rate |
|---|---|
| <chr> | <dbl> |
| Alabama | 0.2824424 |
| Alaska | 0.2675186 |
| Arizona | 0.3629527 |
| Arkansas | 0.3189390 |
| California | 0.3374138 |
| Colorado | 0.1292453 |
和第三章一样,我们使用heights数据集为例
data(heights)
# 将数据转换为data.table对象
heights <- setDT(heights)在data.table中,我们可以使用.()函数来直接访问相应的变量。因此我们可以在原来dplyr中简化代码如下
s <- heights[, .(average = mean(height), standard_deviation = sd(height))]
s| average | standard_deviation |
|---|---|
| <dbl> | <dbl> |
| 68.32301 | 4.078617 |
下面假设我们要查询女性的平均身高和标准差
s <- heights[sex == 'Female', .(avg = mean(height), standard_deviation = sd(height))]
s| avg | standard_deviation |
|---|---|
| <dbl> | <dbl> |
| 64.93942 | 3.760656 |
还记得在第三章中,我们定义了如下函数
median_min_max <- function(x){
qs <- quantile(x, c(0.5,0,1))
data.frame(median=qs[1], min = qs[2], max = qs[3])
}heights[,.(median_min_max(height))]| median | min | max |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 68.5 | 50 | 82.67717 |
在dplyr中我们使用group_by来进行分组,在data.table中,我们使用by进行分组
heights[,.(avg = mean(height), standard_deviation=sd(height)), by = sex]| sex | avg | standard_deviation |
|---|---|---|
| <fct> | <dbl> | <dbl> |
| Male | 69.31475 | 3.611024 |
| Female | 64.93942 | 3.760656 |
我们可以使用与筛选相同的方法对行进行排序。以下是按谋杀率排序的州:
murders[order(population)] %>% head()| state | abb | region | population | total | rate | rank |
|---|---|---|---|---|---|---|
| <chr> | <chr> | <fct> | <dbl> | <dbl> | <dbl> | <dbl> |
| Wyoming | WY | West | 563626 | 5 | 0.08871131 | 1 |
| District of Columbia | DC | South | 601723 | 99 | 1.64527532 | 2 |
| Vermont | VT | Northeast | 625741 | 2 | 0.03196211 | 3 |
| North Dakota | ND | North Central | 672591 | 4 | 0.05947151 | 4 |
| Alaska | AK | West | 710231 | 19 | 0.26751860 | 5 |
| South Dakota | SD | North Central | 814180 | 8 | 0.09825837 | 6 |
阅读量:2050
点赞量:0
收藏量:0