每当我想在 R 中做一些 “map”py 时,我通常会尝试在apply
系列中使用一个函数。
但是,我从来没有完全理解它们之间的区别 - 如何 { sapply
, lapply
等} 将函数应用于输入 / 分组输入,输出将是什么样的,甚至输入可以是什么 - 所以我经常只是经历它们直到我得到我想要的东西。
有人可以解释如何使用哪一个?
我当前(可能不正确 / 不完整)的理解是......
sapply(vec, f)
:输入是一个向量。输出是矢量 / 矩阵,其中元素i
是f(vec[i])
,如果f
具有多元素输出,则为您提供矩阵
lapply(vec, f)
:和sapply
,但输出是一个列表?
apply(matrix, 1/2, f)
:input 是一个矩阵。 output 是一个向量,其中元素i
是 f(矩阵的 row / col i) tapply(vector, grouping, f)
:output 是一个矩阵 / 数组,其中矩阵 / 数组中的元素是向量的分组g
处的f
的值, g
被推送到行 / 列名称by(dataframe, grouping, f)
:让g
成为一个分组。将f
应用于组 / 数据帧的每一列。漂亮打印分组和每列的f
值。 aggregate(matrix, grouping, f)
类似于by
但是代替漂亮打印输出,合计枝一切成数据帧。 附带问题:我还没有学过 plyr 或 reshape - 会plyr
reshape
plyr
或reshape
所有这些?
R has many *apply functions which are ably described in the help files (e.g. ?apply
). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.
Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix
M <- matrix(seq(1,16), 4, 4)
# apply min to rows
apply(M, 1, min)
[1] 1 2 3 4
# apply max to columns
apply(M, 2, max)
[1] 4 8 12 16
# 3 dimensional array
M <- array( seq(32), dim = c(4,4,2))
# Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
apply(M, 1, sum)
# Result is one-dimensional
[1] 120 128 136 144
# Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
apply(M, c(1,2), sum)
# Result is two-dimensional
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to
investigate the highly optimized, lightning-quick colMeans
,
rowMeans
, colSums
, rowSums
.
lapply - When you want to apply a function to each element of a list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel
back their code and you will often find lapply
underneath.
x <- list(a = 1, b = 1:3, c = 10:100)
lapply(x, FUN = length)
$a
[1] 1
$b
[1] 3
$c
[1] 91
lapply(x, FUN = sum)
$a
[1] 1
$b
[1] 6
$c
[1] 5005
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
If you find yourself typing unlist(lapply(...))
, stop and consider
sapply
.
x <- list(a = 1, b = 1:3, c = 10:100)
# Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005
In more advanced uses of sapply
it will attempt to coerce the
result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply
will use them as columns of a matrix:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix, sapply
will do essentially the same thing, treating each returned matrix as a single long vector:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use sapply
but perhaps need to
squeeze some more speed out of your code.
For vapply
, you basically give R an example of what sort of thing
your function will return, which can save some time coercing returned
values to fit in a single atomic vector.
x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
mapply - For when you have several data structures (e.g.
vectors, lists) and you want to apply a function to the 1st elements
of each, and then the 2nd elements of each, etc., coercing the result
to a vector/array as in sapply
.
This is multivariate in the sense that your function must accept multiple arguments.
#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
Map - A wrapper to mapply
with SIMPLIFY = FALSE
, so it is guaranteed to return a list.
Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply
is best illustrated with a user-defined function to apply:
# Append ! to string, otherwise increment
myFun <- function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")
tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in x
within each subgroup defined by y
:
tapply(x, y, sum)
a b c d e
10 26 42 58 74
More complex examples can be handled where the subgroups are defined
by the unique combinations of a list of several factors. tapply
is
similar in spirit to the split-apply-combine functions that are
common in R (aggregate
, by
, ave
, ddply
, etc.) Hence its
black sheep status.
在旁注中,这里是各种plyr
函数如何对应 base *apply
函数(从 plyr 网页的介绍到 plyr 文档http://had.co.nz/plyr/ )
Base function Input Output plyr function
---------------------------------------
aggregate d d ddply + colwise
apply a a/l aaply / alply
by d l dlply
lapply l l llply
mapply a a/l maply / mlply
replicate r a/l raply / rlply
sapply l a laply
plyr
的目标之一是为每个函数提供一致的命名约定,对函数名中的输入和输出数据类型进行编码。它还提供输出的一致性,因为dlply()
输出很容易传递给ldply()
以产生有用的输出等。
从概念上讲,学习plyr
并不比理解 base *apply
函数困难。
在我的日常使用中, plyr
和reshape
函数已经取代了几乎所有这些函数。但是,从介绍到 Plyr 文件:
相关函数
tapply
和sweep
在plyr
没有相应的函数,并且仍然有用。merge
对于将摘要与原始数据相结合非常有用。
来自http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy 的幻灯片 21:
(希望很明显, apply
对应aaply
的aaply
和aggregate
对应于ddply
的ddply
等。如果你没有从这张图片中得到它,那么同一幻灯片的幻灯片 20 将会澄清。)
(左边是输入,顶部是输出)