我对data.table
相对比较熟悉,对data.table
不太dplyr
。我已经阅读了一些在 SO 上弹出的dplyr
小插曲和示例,到目前为止,我的结论是:
data.table
和dplyr
的速度相当,除非有许多组(即 > 10-100K),并且在其他情况下(请参见下面的基准) dplyr
具有更易于访问的语法dplyr
提取(或将)潜在的数据库交互在我看来 2. 不太重要,因为我对data.table
相当熟悉,尽管我知道这对于data.table
和data.table
来说都是一个很大的因素。我想避免争论哪个更直观,因为这与从已经熟悉data.table
的人的角度来看我提出的具体问题无关。我还想避免讨论 “更直观” 如何导致更快的分析(当然是正确的,但又不是我对此最感兴趣的内容)。
我想知道的是:
最近的一个SO 问题使我对此进行了更多思考,因为直到那一点之前,我认为dplyr
所提供的dplyr
不会超出我在data.table
已经可以做的data.table
。这是dplyr
解决方案(Q 末的数据):
dat %.%
group_by(name, job) %.%
filter(job != "Boss" | year == min(year)) %.%
mutate(cumu_job2 = cumsum(job2))
这比我尝试破解data.table
解决方案要好得多。就是说,好的data.table
解决方案也相当不错(感谢 Jean-Robert,Arun,并且请注意,在这里,我赞成使用单一语句而不是严格的最佳解决方案):
setDT(dat)[,
.SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by=list(id, job)
]
后者的语法可能看起来很深奥,但是如果您习惯了data.table
(即,不使用一些更深奥的技巧),则实际上非常简单。
理想情况下,我希望看到一些很好的示例,例如dplyr
或data.table
方法更简洁或性能更好。
dplyr
不允许返回任意行的分组操作(来自eddi 的问题 ,请注意:这看起来将在dplyr 0.5 中实现,而且,@ beginneR 显示了在 @eddi 问题的答案中使用do
的潜在解决方法) 。 data.table
支持滚动连接 (感谢 @dholstius)以及重叠连接 data.table
内部优化形式的表达式DT[col == value]
或DT[col %in% values]
,它使用二进制搜索 ,同时使用相同的基础 R 语法速度通过自动索引 。 有关更多详细信息和微小基准, 请参见此处 。 dplyr
提供的功能标准评估版本(例如regroup
, summarize_each_
),可以简化程序中使用的dplyr
(注意程序中使用的data.table
是绝对有可能,只是需要一些认真思考,置换 / 报价,等等,至少据我所知) data.table
变得相当快。 data.table
比dplyr
更好(在软件包和最新版本的 R 中都进行了最新增强)。同样,当尝试获取唯一值时,基准测试可以使data.table
快 6 倍。 data.table
速度快 75%,而在较小版本的dplyr
速度快 40%( 来自评论的另一个 SO 问题 ,感谢 danas)。 data.table
的主要作者, data.table
对data.table
, dplyr
和 python pandas
分组操作进行了基准测试,最多可进行 20 亿行(RAM 中约 100GB) 。 data.table
约快 8 倍这是我在问题部分显示的第一个示例。
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob",
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L,
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager",
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager",
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,
-16L))
我们至少需要涵盖以下几个方面,以提供全面的答案 / 比较(没有特别的重要性顺序): Speed
, Memory usage
, Syntax
和Features
。
我的意图是从 data.table 的角度尽可能清晰地涵盖其中的每一个。
注意:除非另有明确说明,否则通过引用 dplyr 来引用 dplyr 的 data.frame 接口,其内部使用 Rcpp 编写在 C ++ 中。
data.table 语法在形式上是一致的DT[i, j, by]
。使i
, j
和by
在一起是设计使然。通过将相关的操作放在一起,它可以轻松优化操作以提高速度 ,更重要的是提高内存使用率 ,还提供一些强大的功能 ,同时保持语法的一致性。
已经显示数据的问题中添加了相当多的基准测试(尽管主要针对分组操作)。随着按分组分组的组和 / 或行的数量增加,table 比 dplyr 更快 ,其中包括Matt 的基准测试(从1000 万 20 亿行 ( 100 亿至 1000 万个组 )中的20 亿行 (RAM 中为 100GB)和不同的分组列,这也比较了pandas
。另请参阅更新的基准测试 ,其中还包括Spark
和pydatatable
。
On benchmarks, it would be great to cover these remaining aspects as well:
Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z]
type operations.
Benchmark other operations such as update and joins.
Also benchmark memory footprint for each operation in addition to runtime.
Operations involving filter()
or slice()
in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.
Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.
data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).
# sub-assign by reference, updates 'y' in-place
DT[x >= 1L, y := NA]
But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
# copies the entire 'y' column
ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.
Therefore we are working towards exporting shallow()
function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:
foo <- function(DT) {
DT = shallow(DT) ## shallow copy DT
DT[, newcol := 1L] ## does not affect the original DT
DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT
DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will
## also get modified.
}
By not using shallow()
, the old functionality is retained:
bar <- function(DT) {
DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference
DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT.
}
By creating a shallow copy using shallow()
, we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.
Also, once
shallow()
is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.But it will still lack many features that data.table provides, including (sub)-assignment by reference.
Aggregate while joining:
Suppose you have two data.tables as follows:
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
# x y z
# 1: 1 a 1
# 2: 1 a 2
# 3: 1 b 3
# 4: 1 b 4
# 5: 2 a 5
# 6: 2 a 6
# 7: 2 b 7
# 8: 2 b 8
DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
# x y mul
# 1: 1 a 4
# 2: 2 b 3
And you would like to get sum(z) * mul
for each row in DT2
while joining by columns x,y
. We can either:
1) aggregate DT1
to get sum(z)
, 2) perform a join and 3) multiply (or)
# data.table way
DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
# dplyr equivalent
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
right_join(DF2) %>% mutate(z = z * mul)
2) do it all in one go (using by = .EACHI
feature):
DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
What is the advantage?
We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at j
in (2).
Check this post for a detailed explanation of by = .EACHI
. No intermediate results are materialised, and the join+aggregate is performed all in one go.
Have a look at this, this and this posts for real usage scenarios.
In dplyr
you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).
Update and joins:
Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates DT1
's column col
with mul
from DT2
on those rows where DT2
's key column matches DT1
. I don't think there is an exact equivalent of this operation in dplyr
, i.e., without avoiding a *_join
operation, which would have to copy the entire DT1
just to add a new column to it, which is unnecessary.
Check this post for a real usage scenario.
To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!
Let's now look at syntax. Hadley commented here:
Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...
I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.
We will work with the dummy data shown below:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
Basic aggregation/update operations.
# case (a)
DT[, sum(y), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
DT[, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
# case (b)
DT[x > 2, sum(y), by = z]
DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
DT[x > 2, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
# case (c)
DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use filter()
in dplyr while summarising. But while updating, we had to move the logic inside mutate()
. In data.table however, we express both operations with the same logic - operate on rows where x > 2
, but in first case, get sum(y)
, whereas in the second case update those rows for y
with its cumulative sum.
This is what we mean when we say the DT[i, j, by]
form is consistent.
Similarly in case (c), when we have if-else
condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if
condition satisfies and skip otherwise, we cannot use summarise()
directly (AFAICT). We have to filter()
first and then summarise because summarise()
always expects a single value.
While it returns the same result, using filter()
here makes the actual operation less obvious.
It might very well be possible to use filter()
in the first case as well (does not seem obvious to me), but my point is that we should not have to.
Aggregation / update on multiple columns
# case (a)
DT[, lapply(.SD, sum), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
DT[, (cols) := lapply(.SD, sum), by = z]
ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
# case (b)
DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
# case (c)
DT[, c(.N, lapply(.SD, sum)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply()
, whereas dplyr
introduces *_each()
along with a bunch of functions to funs()
.
data.table's :=
requires column names to be provided, whereas dplyr generates it automatically.
In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return n()
as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j
. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c()
to concatenate .N
to a list
which returns a list
.
Note: Once again, in data.table, all we need to do is return a list in
j
. Each element of the list will become a column in result. You can usec()
,as.list()
,lapply()
,list()
etc... base functions to accomplish this, without having to learn any new functions.You will need to learn just the special variables -
.N
and.SD
at least. The equivalent in dplyr aren()
and.
Joins
dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by]
(and with reason). It also provides an equivalent merge.data.table()
function as an alternative.
setkey(DT1, x, y)
# 1. normal join
DT1[DT2] ## data.table syntax
left_join(DT2, DT1) ## dplyr syntax
# 2. select columns while join
DT1[DT2, .(z, i.mul)]
left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
# 3. aggregate while join
DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
# 4. update while join
DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
??
# 5. rolling join
DT1[DT2, roll = -Inf]
??
# 6. other arguments to control output
DT1[DT2, mult = "first"]
??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by]
, or merge()
which is similar to base R.
However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to select()
first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.
data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI
feature. Why materialse the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has mult =
argument which selects first, last or all matches (6).
data.table has allow.cartesian = TRUE
argument to protect from accidental invalid joins.
Once again, the syntax is consistent with
DT[i, j, by]
with additional arguments allowing for controlling the output further.
do()
...
dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do()
. You have to know beforehand about all your functions return value.
DT[, list(x[1], y[1]), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
DT[, list(x[1:2], y[1]), by = z]
DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
DT[, quantile(x, 0.25), by = z]
DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
DT[, quantile(x, c(0.25, 0.75)), by = z]
DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
DT[, as.list(summary(x)), by = z]
DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
.SD
's equivalent is .
In data.table, you can throw pretty much anything in j
- the only thing to remember is for it to return a list so that each element of the list gets converted to a column.
In dplyr, cannot do that. Have to resort to do()
depending on how sure you are as to whether your function would always return a single value. And it is quite slow.
Once again, data.table's syntax is consistent with
DT[i, j, by]
. We can just keep throwing expressions inj
without having to worry about these things.
Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...
To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about"harder to read/learn"syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.
data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.
But one should also consider the number of features that dplyr lacks in comparison to data.table.
I have pointed out most of the features here and also in this post. In addition:
fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping: dplyr
automatically sorts the results by grouping variables during summarise()
, which may not be always desirable.
Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.
Non-equi joins: Allows joins using other operators <=, <, >, >=
along with all other advantages of data.table joins.
Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder()
function in data.table that allows really fast reordering of data.tables by reference.
dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table
provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff
, fintersect
, funion
and fsetequal
with additional all
argument (as in SQL).
data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame
compatibility when passed to any R package. dplyr changes base functions filter
, lag
and [
which can cause problems; e.g. here and here.
Finally:
On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
OpenMP
.这是我从 dplyr 角度寻求全面答案的尝试,遵循了 Arun 答案的概述(但根据不同的优先级进行了重新排列)。
语法有一定的主观性,但是我坚持我的说法,即 data.table 的简洁性使得学习和阅读变得更加困难。部分原因是 dplyr 解决了一个简单得多的问题!
dplyr 为您做的一件非常重要的事情是,它限制了您的选择。我声称大多数单表问题都可以通过仅五个关键动词过滤,选择,变异,排列和汇总以及 “按组” 副词来解决。当您学习数据操作时,该约束是一个很大的帮助,因为它可以帮助您对问题进行思考。在 dplyr 中,每个动词都映射到一个函数。每个功能都可以完成一项工作,并且易于孤立地理解。
通过将这些简单的操作与%>%
一起管道化,可以创建复杂性。这是来自 Arun 链接到的帖子之一的示例:
diamonds %>%
filter(cut != "Fair") %>%
group_by(cut) %>%
summarize(
AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n()
) %>%
arrange(desc(Count))
即使您以前从未见过 dplyr(甚至 R!),您仍然可以了解正在发生的事情,因为这些功能都是英语动词。英语动词的缺点是比[
需要更多的键入,但我认为可以通过更好的自动完成功能在很大程度上缓解这种情况。
这是等效的 data.table 代码:
diamondsDT <- data.table(diamonds)
diamondsDT[
cut != "Fair",
.(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N
),
by = cut
][
order(-Count)
]
除非您已经熟悉 data.table,否则很难遵循此代码。 (我也想不出如何缩进重复的[
以我认为不错的方式)。就我个人而言,当我看我六个月前编写的代码时,就像看一个陌生人编写的代码一样,因此我更喜欢直接的(如果是冗长的)代码。
我认为另外两个次要因素会稍微降低可读性:
由于几乎每个数据表操作都使用[
您需要其他上下文来了解正在发生的事情。例如, x[y]
联接两个数据表还是从数据框中提取列?这只是一个小问题,因为在编写良好的代码中,变量名称应说明正在发生的事情。
我喜欢group_by()
是 dplyr 中的单独操作。它从根本上改变了计算,因此我认为在略读代码时应该很明显,而且发现group_by()
比[.data.table
的by
参数要[.data.table
。
我也喜欢管道不仅仅限于一个包装。您可以先使用tidyr整理数据,然后在ggvis 中完成绘图。而且,您不仅限于我编写的程序包 - 任何人都可以编写一个函数,该函数构成数据操纵管道的无缝部分。实际上,我宁愿使用%>%
重写以前的 data.table 代码:
diamonds %>%
data.table() %>%
.[cut != "Fair",
.(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N
),
by = cut
] %>%
.[order(-Count)]
用%>%
进行管道传递的想法不仅限于数据帧,还可以很容易地推广到其他环境: 交互式 Web 图形 , Web 抓取 , 要点 , 运行时合同 ...)
我将它们汇总在一起,因为对我而言,它们并不那么重要。大多数 R 用户使用的行数据不足 100 万行,而 dplyr 足够快地处理您不知道处理时间的数据大小。我们对 dplyr 进行优化以提高在中等数据上的表现力;随时使用 data.table 获取更大数据上的原始速度。
dplyr 的灵活性还意味着您可以使用相同的语法轻松地调整性能特征。如果带有数据帧后端的 dplyr 的性能不足以使您满意,则可以使用 data.table 后端(尽管功能有所限制)。如果您正在使用的数据不适合存储在内存中,则可以使用数据库后端。
综上所述,长期来看,dplyr 的性能会更好。我们一定会实现 data.table 的一些很棒的主意,例如基数排序以及对联接和过滤器使用相同的索引。我们还在致力于并行化,因此我们可以利用多个内核。
我们计划在 2015 年进行的一些工作:
readr
包,类似于fread()
,可以很容易地将文件从磁盘取到内存中。
更加灵活的联接,包括对非等联接的支持。
更灵活的分组,例如引导程序样本,汇总等
我还将投入时间来改进 R 的数据库连接器 ,与Web api 进行通信的能力,并使其更容易抓取 html 页面 。
dplyr
绝对可以做data.table
无法完成的工作。 你的观点 3
dplyr 提取(或将)潜在的数据库交互
是您自己问题的直接答案,但没有达到足够高的水平。 dplyr
是多种数据存储机制的可扩展前端,其中data.table
是对单个数据机制的扩展。
将dplyr
视为后端不可知接口,所有目标都使用相同的语法,您可以在其中随意扩展目标和处理程序。从dplyr
角度来看, data.table
是这些目标之一。
您(永远希望)不会有一天data.table
尝试转换查询以创建可用于磁盘或网络数据存储的 SQL 语句。
dplyr
能做的事情data.table
不会或不可能做的一样好。 基于内存工作的设计, data.table
可能有本身延伸到查询比的并行处理更加困难的时候dplyr
。
对于熟悉软件包的人来说 ,是否有分析任务更容易用一个或另一个软件包编写代码(即,要求的击键与要求的深奥程度的某种组合,每个击键都是一件好事)。
这看似有些微不足道,但真正的答案是否定的。 熟悉工具的人似乎使用的是他们最熟悉的工具,或者实际上是一种适合手头工作的工具。话虽这么说,有时您想要提供一种特定的可读性,有时需要一种性能,并且当您需要两者的足够高的水平时,您可能只需要另一个工具来配合已经必须进行的更加清晰的抽象。
在一个程序包中,是否有比其他程序更有效地执行分析任务(即大于 2 倍)?
再说一次data.table
的是在一切有效的它在那里过人之处dplyr
得到的在某些方面被限制在底层数据存储和处理注册的负担。
这意味着当您遇到性能问题与data.table
你可以相当肯定它在你的查询功能,如果它实际上是一个瓶颈data.table
那么你已经赢得了自己在提交报告的喜悦。当dplyr
使用data.table
作为后端时,也是如此。您可能会看到dplyr
一些开销,但很可能是您的查询。
如果dplyr
存在性能问题,则可以通过注册用于混合评估的函数或(对于数据库而言)在执行之前操纵生成的查询来解决这些问题。
另请参见关于何时 plyr 比 data.table 更好的公认答案?