协慌网

登录 贡献 社区

data.table vs dplyr:一个人可以做得很好,另一个人不能做得不好吗?

总览

我对data.table相对比较熟悉,对data.table不太dplyr 。我已经阅读了一些在 SO 上弹出的dplyr小插曲和示例,到目前为止,我的结论是:

  1. data.tabledplyr的速度相当,除非有许多组(即 > 10-100K),并且在其他情况下(请参见下面的基准)
  2. dplyr具有更易于访问的语法
  3. dplyr提取(或将)潜在的数据库交互
  4. 功能上有一些细微的差异(请参见下面的 “示例 / 用法”)

在我看来 2. 不太重要,因为我对data.table相当熟悉,尽管我知道这对于data.tabledata.table来说都是一个很大的因素。我想避免争论哪个更直观,因为这与从已经熟悉data.table的人的角度来看我提出的具体问题无关。我还想避免讨论 “更直观” 如何导致更快的分析(当然是正确的,但又不是我对此最感兴趣的内容)。

我想知道的是:

  1. 对于熟悉软件包的人来说,是否有分析任务更容易用一个或另一个软件包编写代码(即,要求的击键与要求的深奥程度的某种组合,每个击键次数都是一件好事)。
  2. 在一个程序包中,是否有比其他程序更有效地执行分析任务(即大于 2 倍)?

最近的一个SO 问题使我对此进行了更多思考,因为直到那一点之前,我认为dplyr所提供的dplyr不会超出我在data.table已经可以做的data.table 。这是dplyr解决方案(Q 末的数据):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

这比我尝试破解data.table解决方案要好得多。就是说,好的data.table解决方案也相当不错(感谢 Jean-Robert,Arun,并且请注意,在这里,我赞成使用单一语句而不是严格的最佳解决方案):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

后者的语法可能看起来很深奥,但是如果您习惯了data.table (即,不使用一些更深奥的技巧),则实际上非常简单。

理想情况下,我希望看到一些很好的示例,例如dplyrdata.table方法更简洁或性能更好。

例子

  • dplyr不允许返回任意行的分组操作(来自eddi 的问题 ,请注意:这看起来将在dplyr 0.5 中实现,而且,@ beginneR 显示了在 @eddi 问题的答案中使用do的潜在解决方法) 。
  • data.table支持滚动连接 (感谢 @dholstius)以及重叠连接
  • data.table内部优化形式的表达式DT[col == value]DT[col %in% values] ,它使用二进制搜索 ,同时使用相同的基础 R 语法速度通过自动索引有关更多详细信息和微小基准, 请参见此处
  • dplyr提供的功能标准评估版本(例如regroupsummarize_each_ ),可以简化程序中使用的dplyr (注意程序中使用的data.table是绝对有可能,只是需要一些认真思考,置换 / 报价,等等,至少据我所知)

数据

这是我在问题部分显示的第一个示例。

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

答案

我们至少需要涵盖以下几个方面,以提供全面的答案 / 比较(没有特别的重要性顺序): SpeedMemory usageSyntaxFeatures

我的意图是从 data.table 的角度尽可能清晰地涵盖其中的每一个。

注意:除非另有明确说明,否则通过引用 dplyr 来引用 dplyr 的 data.frame 接口,其内部使用 Rcpp 编写在 C ++ 中。


data.table 语法在形式上是一致的DT[i, j, by] 。使ijby在一起是设计使然。通过将相关的操作放在一起,它可以轻松优化操作以提高速度 ,更重要的是提高内存使用率 ,还提供一些强大的功能 ,同时保持语法的一致性。

1. 速度

已经显示数据的问题中添加了相当多的基准测试(尽管主要针对分组操作)。随着按分组分组的组和 / 或行的数量增加,table 比 dplyr 更快 ,其中包括Matt 的基准测试(1000 万 20 亿行100 亿至 1000 万个组 )中的20 亿行 (RAM 中为 100GB)和不同的分组列,这也比较了pandas 。另请参阅更新的基准测试 ,其中还包括Sparkpydatatable

On benchmarks, it would be great to cover these remaining aspects as well:

  • Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

  • Benchmark other operations such as update and joins.

  • Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

    # sub-assign by reference, updates 'y' in-place
    DT[x >= 1L, y := NA]

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

    # copies the entire 'y' column
    ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

    foo <- function(DT) {
        DT = shallow(DT)          ## shallow copy DT
        DT[, newcol := 1L]        ## does not affect the original DT 
        DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
        DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                  ## also get modified.
    }

    By not using shallow(), the old functionality is retained:

    bar <- function(DT) {
        DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
        DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
    }

    By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

    But it will still lack many features that data.table provides, including (sub)-assignment by reference.

  3. Aggregate while joining:

    Suppose you have two data.tables as follows:

    DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
    #    x y z
    # 1: 1 a 1
    # 2: 1 a 2
    # 3: 1 b 3
    # 4: 1 b 4
    # 5: 2 a 5
    # 6: 2 a 6
    # 7: 2 b 7
    # 8: 2 b 8
    DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
    #    x y mul
    # 1: 1 a   4
    # 2: 2 b   3

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:

    • 1) aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

      # data.table way
      DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
      
      # dplyr equivalent
      DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
          right_join(DF2) %>% mutate(z = z * mul)
    • 2) do it all in one go (using by = .EACHI feature):

      DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

    What is the advantage?

    • We don't have to allocate memory for the intermediate result.

    • We don't have to group/hash twice (one for aggregation and other for joining).

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

    Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

    DT1[DT2, col := i.mul]

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
  1. Basic aggregation/update operations.

    # case (a)
    DT[, sum(y), by = z]                       ## data.table syntax
    DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
    DT[, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
    # case (b)
    DT[x > 2, sum(y), by = z]
    DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
    DT[x > 2, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
    
    # case (c)
    DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
    DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
    DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
    DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation / update on multiple columns

    # case (a)
    DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
    DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
    DT[, (cols) := lapply(.SD, sum), by = z]
    ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
    # case (b)
    DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
    DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
    # case (c)
    DT[, c(.N, lapply(.SD, sum)), by = z]     
    DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

    setkey(DT1, x, y)
    
    # 1. normal join
    DT1[DT2]            ## data.table syntax
    left_join(DT2, DT1) ## dplyr syntax
    
    # 2. select columns while join    
    DT1[DT2, .(z, i.mul)]
    left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
    
    # 3. aggregate while join
    DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
    DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
        inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
    # 4. update while join
    DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
    ??
    
    # 5. rolling join
    DT1[DT2, roll = -Inf]
    ??
    
    # 6. other arguments to control output
    DT1[DT2, mult = "first"]
    ??
    • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

    • However dplyr joins do just that. Nothing more. Nothing less.

    • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

    • data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?

    • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

    • data.table also has mult = argument which selects first, last or all matches (6).

    • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

    DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
    DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
    DT[, list(x[1:2], y[1]), by = z]
    DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
    DT[, quantile(x, 0.25), by = z]
    DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
    DT[, quantile(x, c(0.25, 0.75)), by = z]
    DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
    DT[, as.list(summary(x)), by = z]
    DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    • .SD's equivalent is .

    • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

    • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about"harder to read/learn"syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.

  • Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

  • Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

  • setorder() function in data.table that allows really fast reordering of data.tables by reference.

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


Finally:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

这是我从 dplyr 角度寻求全面答案的尝试,遵循了 Arun 答案的概述(但根据不同的优先级进行了重新排列)。

句法

语法有一定的主观性,但是我坚持我的说法,即 data.table 的简洁性使得学习和阅读变得更加困难。部分原因是 dplyr 解决了一个简单得多的问题!

dplyr 为您做的一件非常重要的事情是,它限制了您的选择。我声称大多数单表问题都可以通过仅五个关键动词过滤,选择,变异,排列和汇总以及 “按组” 副词来解决。当您学习数据操作时,该约束是一个很大的帮助,因为它可以帮助您对问题进行思考。在 dplyr 中,每个动词都映射到一个函数。每个功能都可以完成一项工作,并且易于孤立地理解。

通过将这些简单的操作与%>%一起管道化,可以创建复杂性。这是来自 Arun 链接到的帖子之一的示例:

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

即使您以前从未见过 dplyr(甚至 R!),您仍然可以了解正在发生的事情,因为这些功能都是英语动词。英语动词的缺点是比[需要更多的键入,但我认为可以通过更好的自动完成功能在很大程度上缓解这种情况。

这是等效的 data.table 代码:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

除非您已经熟悉 data.table,否则很难遵循此代码。 (我也想不出如何缩进重复的[以我认为不错的方式)。就我个人而言,当我看我六个月前编写的代码时,就像看一个陌生人编写的代码一样,因此我更喜欢直接的(如果是冗长的)代码。

我认为另外两个次要因素会稍微降低可读性:

  • 由于几乎每个数据表操作都使用[您需要其他上下文来了解正在发生的事情。例如, x[y]联接两个数据表还是从数据框中提取列?这只是一个小问题,因为在编写良好的代码中,变量名称应说明正在发生的事情。

  • 我喜欢group_by()是 dplyr 中的单独操作。它从根本上改变了计算,因此我认为在略读代码时应该很明显,而且发现group_by()[.data.tableby参数要[.data.table

我也喜欢管道不仅仅限于一个包装。您可以先使用tidyr整理数据,然后在ggvis 中完成绘图。而且,您不仅限于我编写的程序包 - 任何人都可以编写一个函数,该函数构成数据操纵管道的无缝部分。实际上,我宁愿使用%>%重写以前的 data.table 代码:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

%>%进行管道传递的想法不仅限于数据帧,还可以很容易地推广到其他环境: 交互式 Web 图形Web 抓取要点运行时合同 ...)

内存和性能

我将它们汇总在一起,因为对我而言,它们并不那么重要。大多数 R 用户使用的行数据不足 100 万行,而 dplyr 足够快地处理您不知道处理时间的数据大小。我们对 dplyr 进行优化以提高在中等数据上的表现力;随时使用 data.table 获取更大数据上的原始速度。

dplyr 的灵活性还意味着您可以使用相同的语法轻松地调整性能特征。如果带有数据帧后端的 dplyr 的性能不足以使您满意,则可以使用 data.table 后端(尽管功能有所限制)。如果您正在使用的数据不适合存储在内存中,则可以使用数据库后端。

综上所述,长期来看,dplyr 的性能会更好。我们一定会实现 data.table 的一些很棒的主意,例如基数排序以及对联接和过滤器使用相同的索引。我们还在致力于并行化,因此我们可以利用多个内核。

特征

我们计划在 2015 年进行的一些工作:

  • readr包,类似于fread() ,可以很容易地将文件从磁盘取到内存中。

  • 更加灵活的联接,包括对非等联接的支持。

  • 更灵活的分组,例如引导程序样本,汇总等

我还将投入时间来改进 R 的数据库连接器 ,与Web api 进行通信的能力,并使其更容易抓取 html 页面

直接回答问题标题 ...

dplyr 绝对可以做data.table无法完成的工作。

你的观点 3

dplyr 提取(或将)潜在的数据库交互

是您自己问题的直接答案,但没有达到足够高的水平。 dplyr是多种数据存储机制的可扩展前端,其中data.table是对单个数据机制的扩展。

dplyr视为后端不可知接口,所有目标都使用相同的语法,您可以在其中随意扩展目标和处理程序。从dplyr角度来看, data.table是这些目标之一。

您(永远希望)不会有一天data.table尝试转换查询以创建可用于磁盘或网络数据存储的 SQL 语句。

dplyr能做的事情data.table不会或不可能做的一样好。

基于内存工作的设计, data.table可能有本身延伸到查询比的并行处理更加困难的时候dplyr


针对体内问题...

用法

对于熟悉软件包的人来说 ,是否有分析任务更容易用一个或另一个软件包编写代码(即,要求的击键与要求的深奥程度的某种组合,每个击键都是一件好事)。

这看似有些微不足道,但真正的答案是否定的。 熟悉工具的人似乎使用的是他们最熟悉的工具,或者实际上是一种适合手头工作的工具。话虽这么说,有时您想要提供一种特定的可读性,有时需要一种性能,并且当您需要两者的足够高的水平时,您可能只需要另一个工具来配合已经必须进行的更加清晰的抽象。

性能

在一个程序包中,是否有比其他程序更有效地执行分析任务(即大于 2 倍)?

再说一次data.table的是在一切有效的在那里过人之处dplyr得到的在某些方面被限制在底层数据存储和处理注册的负担。

这意味着当您遇到性能问题与data.table你可以相当肯定它在你的查询功能,如果它实际上一个瓶颈data.table那么你已经赢得了自己在提交报告的喜悦。当dplyr使用data.table作为后端时,也是如此。您可能会看到dplyr 一些开销,但很可能是您的查询。

如果dplyr存在性能问题,则可以通过注册用于混合评估的函数或(对于数据库而言)在执行之前操纵生成的查询来解决这些问题。

另请参见关于何时 plyr 比 data.table 更好的公认答案