randomForest(randomForest)
randomForest()所属R语言包:randomForest
Classification and Regression with Random Forest
随机森林分类与回归
译者:生物统计家园网 机器人LoveR
描述----------Description----------
randomForest implements Breiman's random forest algorithm (based on Breiman and Cutler's original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.
randomForest实现布雷曼博士的随机森林算法(Breiman和卡特勒的最初的Fortran代码的基础上)用于分类和回归。用于评估的数据点之间的接近程度,它也可以被用来在无监督模式。
用法----------Usage----------
## S3 method for class 'formula'[类formula的方法]
randomForest(formula, data=NULL, ..., subset, na.action=na.fail)
## Default S3 method:[默认方法]
randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL,
importance=FALSE, localImp=FALSE, nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, ...)
## S3 method for class 'randomForest'
print(x, ...)
参数----------Arguments----------
参数:data
an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from.
一个可选的数据框包含在模型中的变量。默认情况下,变量是从randomForest被称为从环境。
参数:subset
an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
索引向量表示应该使用哪些行。 (注:如果给定,该参数必须被命名。)
参数:na.action
A function to specify the action to be taken if NAs are found. (NOTE: If given, this argument must be named.)
一个函数来指定要采取的行动,如果NAS的。 (注:如果给定,该参数必须被命名。)
参数:x, formula
a data frame or a matrix of predictors, or a formula describing the model to be fitted (for the print method, an randomForest object).
一个数据框或矩阵的预测值,或嵌合(print方法,一个randomForest对象)描述的模型的公式。
参数:y
A response vector. If a factor, classification is assumed, otherwise regression is assumed. If omitted, randomForest will run in unsupervised mode.
响应向量。如果假设的一个因素,分类,否则假定回归。如果省略,randomForest会在无人监督的模式运行。
参数:xtest
a data frame or matrix (like x) containing predictors for the test set.
一个数据框或矩阵(如x),测试组的预测。
参数:ytest
response for the test set.
响应的测试集。
参数:ntree
Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
树木生长的数量。这不应该被设置得小一些,以确保每一个输入行被预测至少几十倍。
参数:mtry
Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
随机取样,每个分割的候选人数的变量。需要注意的是,默认值是不同的分类(SQRT(p)其中p是数中的变量x)和回归(P / 3)
参数:replace
Should sampling of cases be done with or without replacement?
采样的情况下,带或不带更换吗?
参数:classwt
Priors of the classes. Need not add up to one. Ignored for regression.
先验的类。不用加了一个。被忽略的回归。
参数:cutoff
(Classification only) A vector of length equal to number of classes. The "winning" class for an observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is the number of classes (i.e., majority vote wins).
(分类)的矢量长度相等的班级数目。观察“殊荣的”类是有票的截止比值最高的比例。默认值是1 / k,其中k是的类的数量(即多数票胜)。
参数:strata
A (factor) variable that is used for stratified sampling.
A(因子),用于分层抽样的变量。
参数:sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
尺寸(S)的样品进行绘制。进行分类,,如果sampsize是一个向量的长度地层的数量,然后取样的地层分层,和sampsize的元素表示从地层中得出的数字。
参数:nodesize
Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5).
最小尺寸的终端节点。设置此数量较大,造成较小的树木生长(从而需要更少的时间)。请注意,默认值是不同的分类(1)和回归(5)。
参数:maxnodes
Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than maximum possible, a warning is issued.
能有最大数量的终端节点在森林中的树木。如果没有给出,树木生长的最大可能(受到的限制nodesize)。如果集大于最大可能的,则发出一个警告。
参数:importance
Should importance of predictors be assessed?
如果预测变量的重要性进行评估呢?
参数:localImp
Should casewise importance measure be computed? (Setting this to TRUE will override importance.)
观察值重要性度量计算? (设置为TRUE,将覆盖importance。)
参数:nPerm
Number of times the OOB data are permuted per tree for assessing variable importance. Number larger than 1 gives slightly more stable estimate, but not very effective. Currently only implemented for regression.
OOB数据的次数置换每棵树的评估变量重要性。数大于1,稍微稳定的估计,但不是很有效。目前,只有实现回归。
参数:proximity
Should proximity measure among the rows be calculated?
应接近测量计算各行吗?
参数:oob.prox
Should proximity be calculated only on “out-of-bag” data?
接近的“包”数据只计算?
参数:norm.votes
If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs). Ignored for regression.
如果TRUE(默认),投票的最终结果表示为分数。如果FALSE,原始的票数返回(可结合不同的运行结果)。被忽略的回归。
参数:do.trace
If set to TRUE, give a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees.
如果设置为TRUE,提供更详细的输出randomForest运行。如果设置为某个整数,然后运行打印输出每do.trace树木。
参数:keep.forest
If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.
如果设置为FALSE,森林将不会被保留在输出对象。如果xtest给定的,默认为FALSE。
参数:corr.bias
perform bias correction for regression? Note: Experimental. Use at your own risk.
执行偏差校正的回归?注:实验。使用您自己的风险。
参数:keep.inbag
Should an n by ntree matrix be returned that keeps track of which samples are “in-bag” in which trees (but not how many times, if sampling with replacement)
如果n的ntree矩阵返回跟踪的样本是“包”树(但没有多少倍,如果抽样更换)
参数:...
optional parameters to be passed to the low level function randomForest.default.
可选的参数被传递到低级别的功能randomForest.default。
值----------Value----------
An object of class randomForest, which is a list with the following components:
类的一个对象randomForest,这是以下组件的列表:
参数:call
the original call to randomForest
原来的呼叫randomForest
参数:type
one of regression, classification, or unsupervised.
regression,classification或unsupervised之一。
参数:predicted
the predicted values of the input data based on out-of-bag samples.
所输入的数据的预测值的基础上,满分袋样品。
参数:importance
a matrix with nclass + 2 (for classification) or two (for regression) columns. For classification, the first nclass columns are the class-specific measures computed as mean descrease in accuracy. The nclass + 1st column is the mean descrease in accuracy over all classes. The last column is the mean decrease in Gini index. For Regression, the first column is the mean decrease in accuracy and the second the mean decrease in MSE. If importance=FALSE, the last measure is still returned as a vector.
nclass+ 2(分类)或两个(回归)列矩阵。对于分类,第一nclass列计算为平均descrease准确性类的具体措施。 nclass +第一列是所有类的平均descrease精度。最后一列是基尼系数平均降低。回归分析,第一列是平均降低准确性和第二平均降低MSE。如果importance=FALSE,最后的措施仍然是作为一个向量返回。
参数:importanceSD
The “standard errors” of the permutation-based importance measure. For classification, a p by nclass + 1 matrix corresponding to the first nclass + 1 columns of the importance matrix. For regression, a length p vector.
“标准错误”的排列为基础的重要性措施。对于分类,pnclass + 1第一nclass + 1的重要性矩阵的列相对应的矩阵。对于回归,p矢量的长度。
参数:localImp
a p by n matrix containing the casewise importance measures, the [i,j] element of which is the importance of i-th variable on the j-th case. NULL if localImp=FALSE.
AP×n矩阵包含观察值的重要性措施的[i,j]的元素,这是上的第j个的情况下的第i个变量的重要性。 NULL如果localImp=FALSE。
参数:ntree
number of trees grown.
树木的数量。
参数:mtry
number of predictors sampled for spliting at each node.
抽样劈裂在每个节点的预测数。
参数:forest
(a list that contains the entire forest; NULL if randomForest is run in unsupervised mode or if keep.forest=FALSE.
(一个列表,包含了整个森林; NULL如果randomForest在无人监督的模式下运行,或者如果keep.forest=FALSE。
参数:err.rate
(classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th.
(分类)的矢量误差率的预测,对输入的数据外(OOB)的错误率的所有树的第i个,第i个元素。
参数:confusion
(classification only) the confusion matrix of the prediction (based on OOB data).
(分类)的混淆矩阵的预测(根据OOB数据)。
参数:votes
(classification only) a matrix with one row for each input data point and one column for each class, giving the fraction or number of (OOB) "votes" from the random forest.
(分类)为每个输入数据点,并与一列,一列每类矩阵,给人的比例或数量(OOB)“票”从随机森林。
参数:oob.times
number of times cases are "out-of-bag" (and thus used in computing OOB error estimate)
倍情况下的数量是“袋(和这样在计算OOB误差估计使用)
参数:proximity
if proximity=TRUE when randomForest is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes).
如果proximity=TRUE当randomForest被称为接近措施,一个矩阵中的输入(根据数据点,对在同一个终端节点的频率上)。
参数:mse
(regression only) vector of mean square errors: sum of squared residuals divided by n.
(回归)的均方误差向量残差平方的总和除以n。
参数:rsq
(regression only) “pseudo R-squared”: 1 - mse / Var(y).
(回归)“伪R平方”:1 - mse在/ var(Y)。
参数:test
if test set is given (through the xtest or additionally ytest arguments), this component is a list which contains the corresponding predicted, err.rate, confusion, votes (for classification) or predicted, mse and rsq (for regression) for the test set. If proximity=TRUE, there is also a component, proximity, which contains the proximity among the test set as well as proximity between test and training data.
如果测试仪给定的(通过xtest还ytest参数),这个组件是一个列表,其中包含相应的predicted,err.rate,confusion, votes(分类)或predicted,mse和rsq(回归),测试组。如果proximity=TRUE,还存在的一个组件,proximity,其中包含的接近以及设置为接近测试和训练数据之间产生的测试。
注意----------Note----------
The forest structure is slightly different between classification and regression. For details on how the trees are stored, see the help page for getTree.
forest分类和回归之间的结构略有不同。树存储的详细信息,请参阅帮助页面getTree。
If xtest is given, prediction of the test set is done “in place” as the trees are grown. If ytest is also given, and do.trace is set to some positive integer, then for every do.trace trees, the test set error is printed. Results for the test set is returned in the test component of the resulting randomForest object. For classification, the votes component (for training or test set data) contain the votes the cases received for the classes. If norm.votes=TRUE, the fraction is given, which can be taken as predicted probabilities for the classes.
如果xtest,测试集的预测是“到位”的树木生长。如果ytest也给出do.trace被设置为正整数,然后为每do.trace树木,测试集打印错误。 test组件所产生的randomForest对象,测试组的结果中返回。对于分类,votes组件(训练或测试数据集)包含的类的情况下收到的选票。如果norm.votes=TRUE,给出分数,这可以被视为预测概率为类。
For large data sets, especially those with large number of variables, calling randomForest via the formula interface is not advised: There may be too much overhead in handling the formula.
对于大型数据集,特别是那些有大量的变量,调用randomForest不建议通过公式接口:有可能是开销太大,在处理公式。
The “local” (or casewise) variable importance is computed as follows: For classification, it is the increase in percent of times a case is OOB and misclassified when the variable is permuted. For regression, it is the average increase in squared OOB residuals when the variable is permuted.
“本地”(或观察值)变量重要性的计算方法如下:对于分类,它是增加%的情况下是OOB错误分类变量时置换。对于回归,它是平方OOB残差变量时置换的平均增幅。
(作者)----------Author(s)----------
Andy Liaw <a href="mailto:andy\_liaw@merck.com">andy\_liaw@merck.com</a> and Matthew Wiener
<a href="mailto:matthew\_wiener@merck.com">matthew\_wiener@merck.com</a>, based on original Fortran code by
Leo Breiman and Adele Cutler.
参考文献----------References----------
5-32.
Random Forests V3.1”, http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf.
参见----------See Also----------
predict.randomForest, varImpPlot
predict.randomForest,varImpPlot
实例----------Examples----------
## Classification:[#分类:]
##data(iris)[#数据(IRIS)]
set.seed(71)
iris.rf <- randomForest(Species ~ ., data=iris, importance=TRUE,
proximity=TRUE)
print(iris.rf)
## Look at variable importance:[#看看变量的重要性:]
round(importance(iris.rf), 2)
## Do MDS on 1 - proximity:[#1 - 接近MDS:]
iris.mds <- cmdscale(1 - iris.rf$proximity, eig=TRUE)
op <- par(pty="s")
pairs(cbind(iris[,1:4], iris.mds$points), cex=0.6, gap=0,
col=c("red", "green", "blue")[as.numeric(iris$Species)],
main="Iris Data: Predictors and MDS of Proximity Based on RandomForest")
par(op)
print(iris.mds$GOF)
## The `unsupervised' case:[#无人监督的情况:]
set.seed(17)
iris.urf <- randomForest(iris[, -5])
MDSplot(iris.urf, iris$Species)
## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.[#分层抽样:平局20,30,和20增长每棵树的品种。]
(iris.rf2 <- randomForest(iris[1:4], iris$Species,
sampsize=c(20, 30, 20)))
## Regression:[#回归:]
## data(airquality)[#数据(airquality)]
set.seed(131)
ozone.rf <- randomForest(Ozone ~ ., data=airquality, mtry=3,
importance=TRUE, na.action=na.omit)
print(ozone.rf)
## Show "importance" of variables: higher value mean more important:[#显示的“重要性”的变量:较高的值意味着更重要的:]
round(importance(ozone.rf), 2)
## "x" can be a matrix instead of a data frame:[#“x”可以是一个矩阵,而不是一个数据框:]
set.seed(17)
x <- matrix(runif(5e2), 100)
y <- gl(2, 50)
(myrf <- randomForest(x, y))
(predict(myrf, x))
## "complicated" formula:[#“复杂”的公式计算:]
(swiss.rf <- randomForest(sqrt(Fertility) ~ . - Catholic + I(Catholic < 50),
data=swiss))
(predict(swiss.rf, swiss))
## Test use of 32-level factor as a predictor:[测试使用32级的因素作为预测:]
set.seed(1)
x <- data.frame(x1=gl(32, 5), x2=runif(160), y=rnorm(160))
(rf1 <- randomForest(x[-3], x[[3]], ntree=10))
## Grow no more than 4 nodes per tree:[#增长不超过4个节点,每棵树:]
(treesize(randomForest(Species ~ ., data=iris, maxnodes=4, ntree=30)))
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|