varSelRF(varSelRF)
varSelRF()所属R语言包:varSelRF
Variable selection from random forests using OOB error
从随机森林的变量选择,使用OOB错误
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Using the OOB error as minimization criterion, carry out variable elimination from random forest, by successively eliminating the least important variables (with importance as returned from random forest).
使用OOB错误最小化准则,进行变量消除随机森林,通过连续消除的最重要的变量返回随机森林的重要性。
用法----------Usage----------
varSelRF(xdata, Class, c.sd = 1, mtryFactor = 1, ntree = 5000,
ntreeIterat = 2000, vars.drop.num = NULL, vars.drop.frac = 0.2,
whole.range = TRUE, recompute.var.imp = FALSE, verbose = FALSE,
returnFirstForest = TRUE, fitted.rf = NULL, keep.forest = FALSE)
参数----------Arguments----------
参数:xdata
A data frame or matrix, with subjects/cases in rows and variables in columns. NAs not allowed.
一个数据框或矩阵的行和列中的变量,与科目/箱。来港定居不允许的。
参数:Class
The dependent variable; must be a factor.
因变量;必须是一个因素。
参数:c.sd
The factor that multiplies the sd. to decide on stopping the tierations or choosing the final solution. See reference for details.
的因素乘以SD。决定,上停止的tierations,或选择最终的解决方案。有关详细信息,请参见参考。
参数:mtryFactor
The multiplication factor of √{number.of.variables} for the number of variables to use for the ntry argument of randomForest.
乘法因子√{number.of.variables}的变量数的n请尝试使用参数randomForest。
参数:ntree
The number of trees to use for the first forest; same as ntree for randomForest.
第一个森林的树木数量使用;一样ntree为randomForest。
参数:ntreeIterat
The number of trees to use (ntree of randomForest) for all additional forests.
使用(ntree的randomForest)的所有其他森林树木的数量。
参数:vars.drop.num
The number of variables to exclude at each iteration.
的变量的数量,在每一次迭代中排除。
参数:vars.drop.frac
The fraction of variables, from those in the previous forest, to exclude at each iteration.
变量的馏分,从那些在前面的森林,在每次迭代中排除。
参数:whole.range
If TRUE continue dropping variables until a forest with only two variables is built, and choose the best model from the complete series of models. If FALSE, stop the iterations if the current OOB error becomes larger than the initial OOB error (plus c.sd*OOB standard error) or if the current OOB error becoems larger than the previous OOB error (plus c.sd*OOB standard error).
如果TRUE继续下降,直到只有两个变量建立森林的变量,并选择最好的模式从完整的系列车型。如果为FALSE,停止迭代,如果当前的OOB错误变得大于初始OOB错误(OOB和和另加c.sd *标准误差)或,如果当前OOB错误becoems大于以前的OOB错误(OOB和和另加c.sd *标准误差)。
参数:recompute.var.imp
If TRUE recompute variable importances at each new iteration.
如果是TRUE重新计算在每个新的迭代变量的重要性。
参数:verbose
Give more information about what is being done.
提供更多的信息,正在做什么。
参数:returnFirstForest
If TRUE the random forest from the complete set of variables is returned.
如果为true,随机变量的完整的森林从被返回。
参数:fitted.rf
An (optional) object of class randomForest previously fitted. In this case, the ntree and mtryFactor arguments are obtained from the fitted object, not the arguments to this function.
(可选的)类randomForest对象预先安装。在这种情况下,得到的ntree和mtryFactor论据从拟合的对象,而不是为此功能的参数。
参数:keep.forest
Same argument as in randomForest function. If the forest is kept, it will be returned as part of the "rf.model" component of the output. Beware that setting this to TRUE can lead to very large memory consumption.
相同的参数randomForest功能。如果森林被保留,它将被返回的“rf.model”组件的输出的一部分。要注意的是设置为true,可能会导致内存消耗非常大。
Details
详细信息----------Details----------
With the default parameters, we examine all forest that result from eliminating, iteratively, a fraction, vars.drop.frac, of the least important variables used in the previous iteration. By default, vars.frac.drop = 0.2 which allows for relatively fast operation, is coherent with the idea of an “aggressive variable selection” approach, and increases the resolution as the number of variables considered becomes smaller. By default, we do not recalculate variable importances at each step (recompute.var.imp = FALSE) as <CITE>Svetnik et al. 2004</CITE> mention severe overfitting resulting from recalculating variable importances. After fitting all forests, we examine the OOB error rates from all the fitted random forests. We choose the solution with the smallest number of genes whose error rate is within c.sd standard errors of the minimum error rate of all forests. (The standard error is calculated using the expression for a biomial error count [√{p (1-p) * 1/N}]). Setting c.sd = 0 is the same as selecting the set of genes that leads to the smallest error rate. Setting c.sd = 1 is similar to the common “1 s.e. rule”, used in the classification trees literature; this strategy can lead to solutions with fewer genes than selecting the solution with the smallest error rate, while achieving an error rate that is not different, within sampling error, from the “best solution”.
默认的参数,我们检查所有的森林,从而消除重复的一小部分,vars.drop.frac,在上一次迭代中使用的最重要的变量。默认情况下,vars.frac.drop = 0.2可以比较快的操作,是连贯的想法的一个“积极的变量选择”的方式,考虑的变量的数量变小,提高了分辨率。默认情况下,我们不重新计算变量重要性的每一步(recompute.var.imp = FALSE)<CITE> Svetnik等。 2004 </ CITE>“提导致严重的过度拟合重新计算变量重要性。适合所有的森林后,我们检查所有的拟合随机森林的OOB错误率。我们选择的解决方案,以最少数量的基因,其错误率是内c.sd标准误差的最小错误率的所有森林。 (标准误差计算表达为一个biomial的错误计数[√{p (1-p) * 1/N}方向])。设置c.sd = 0选择导致最小误差率的基因组是相同的。设置c.sd = 1是类似于常见的“1本身排除“,在分类树文学比选择的解决方案具有最小的误差率,而实现的错误率是不一样的,在抽样误差之内,从”最好的解决方案,这种策略会导致较少的基因的解决方案“ 。
The use of ntree = 5000 and ntreeIterat = 2000 is discussed in longer detail in the references. Essentially, more iterations rarely seem to lead (with 9 different microarray data sets) to improved solutions.
使用ntree = 5000和ntreeIterat = 2000进行了讨论参考文献中不再详述。从本质上讲,的迭代似乎很少导致(有9种不同的基因芯片数据集),以更好的解决方案。
The measure of variable importance used is based on the decrease of classification accuracy when values of a variable in a node of a tree are permuted randomly (see references); we use the unscaled version —see our paper and supplementary material.
措施使用的变量重要性的基础上降低分类精度的一个节点树中的变量的值时,随机置换(请参阅参考资料),我们使用不成比例的版本,请参阅我们的纸和补充材料。
值----------Value----------
An object of class "varSelRF": a list with components: <table summary="R valueblock"> <tr valign="top"><td>selec.history</td> <td> A data frame where the selection history is stored. The components are:
的类“varSelRF”的对象与组件的列表:<table summary="R valueblock"> <tr valign="top"> <TD> selec.history</ TD> <td>一个数据框地方选择历史被存储。这些组件是:
Number.VariablesThe number of variables examined.
检查变量Number.VariablesThe。
Vars.in.ForestThe actual variables that were in the forest at that stage.
Vars.in.ForestThe在这个阶段,在森林中的实际变量。
OOBOut of bag error rate.
包错误率OOBOut的。
sd.OOBStandard deviation of the error rate. </td></tr> <tr valign="top"><td>rf.model</td> <td> The final, selected, random forest (only if whole.range = FALSE). (If you set whole.range = TRUE, the final model always contains exactly two variables. This is unlikely to be the forest that interests you).</td></tr> <tr valign="top"><td>selected.vars</td> <td> The variables finally selected.</td></tr> <tr valign="top"><td>selected.model</td> <td> Same as above, but ordered alphabetically and concatenated with a "+" for easier display.</td></tr> <tr valign="top"><td>best.model.nvars</td> <td> The number of variables in the finally selected model.</td></tr> <tr valign="top"><td>initialImportance</td> <td> The importances of variables, before any variable deletion.</td></tr> <tr valign="top"><td>initialOrderedImportances</td> <td> Same as above but ordered in by decreasing importance.</td></tr> <tr valign="top"><td>ntree</td> <td> The ntree argument.</td></tr> <tr valign="top"><td>ntreeIterat</td> <td> The ntreeIterat argument.</td></tr> <tr valign="top"><td>mtryFactor</td> <td> The mtryFactor argument.</td></tr> <tr valign="top"><td>firstForest</td> <td> The first forest (before any variable selection) fitted.</td></tr> </table>
错误率sd.OOBStandard偏差。 </ TD> </ TR> <tr valign="top"> <TD>rf.model</ TD> <td>在决赛中,选定,随机森林(只有whole.range = FALSE)。 (如果你设置whole.range = TRUE,最终模型包含两个变量,这是不可能的森林是你的利益)。</ TD> </ TR> <tr valign="top"> <TD> selected.vars</ TD> <TD>最终选择的变量。</ TD> </ TR> <tr valign="top"> <TD>selected.model </ TD> <TD>相同以上,但按字母顺序排列,然后连接起来,更容易显示一个“+”</ TD> </ TR> <tr valign="top"> <TD>best.model.nvars </ TD> <TD>的数量最终选择模型中的变量。</ TD> </ TR> <tr valign="top"> <TD>initialImportance </ TD> <TD>的重要性的变量,变量之前删除。</ TD> </ TR> <tr valign="top"> <TD>initialOrderedImportances </ TD> <TD>相同,但有序的重要性递减。</ TD> </ TR> <TR VALIGN =“”> <TD>ntree </ TD> <TD>的ntree参数。</ TD> </ TR> <tr valign="top"> <TD><X > </ TD> <TD>ntreeIterat参数。</ TD> </ TR> <tr valign="top"> <TD> ntreeIterat</ TD> <TD> mtryFactor的说法。</ TD> </ TR> <tr valign="top"> <TD>mtryFactor </ TD> <TD>首个森林变量之前,选择安装。</ TD> </ TR> </ TABLE>
(作者)----------Author(s)----------
Ramon Diaz-Uriarte <a href="mailto:rdiaz02@gmail.com">rdiaz02@gmail.com</a>
参考文献----------References----------
Breiman, L. (2001) Random forests. Machine Learning, 45, 5–32.
Diaz-Uriarte, R. and Alvarez de Andres, S. (2005) Variable selection from random forests: application to gene expression data. Tech. report. http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html
Svetnik, V., Liaw, A. , Tong, C & Wang, T. (2004) Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules. Pp. 334-343 in F. Roli, J. Kittler, and T. Windeatt (eds.). Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9-11 June 2004, Cagliari, Italy. Lecture Notes in Computer Science, vol. 3077. Berlin: Springer.
参见----------See Also----------
randomForest, plot.varSelRF,
randomForest,plot.varSelRF,
实例----------Examples----------
set.seed(1)
x <- matrix(rnorm(25 * 30), ncol = 30)
colnames(x) <- paste("v", 1:30, sep = "")
x[1:10, 1:2] <- x[1:10, 1:2] + 1
x[1:4, 5] <- x[1:4, 5] - 1.5
x[5:10, 8] <- x[5:10, 8] + 1.4
cl <- factor(c(rep("A", 10), rep("B", 15)))
rf.vs1 <- varSelRF(x, cl, ntree = 500, ntreeIterat = 300,
vars.drop.frac = 0.2)
rf.vs1
plot(rf.vs1)
#### Using the final, fitted model to predict other data[###最后,拟合模型来预测其他数据]
## Simulate new data[#模拟新数据]
set.seed(2)
x.new <- matrix(rnorm(25 * 30), ncol = 30)
colnames(x.new) <- paste("v", 1:30, sep = "")
x.new[1:10, 1:2] <- x.new[1:10, 1:2] + 1
x.new[1:10, 5] <- x.new[1:10, 5] - 0.5
## Fit with whole.range = FALSE and keep.forest = TRUE[#适合whole.range = FALSE和keep.forest的= TRUE]
set.seed(3)
rf.vs2 <- varSelRF(x, cl, ntree = 3000, ntreeIterat = 2000,
vars.drop.frac = 0.3, whole.range = FALSE,
keep.forest = TRUE)
## To obtain predictions from a data set, you must specify the[#要获得预测的数据集,你必须指定]
## same variables as those used in the final model[#最终模型中所使用的相同的变量]
rf.vs2$selected.vars
predict(rf.vs2$rf.model,
newdata = subset(x.new, select = rf.vs2$selected.vars))
predict(rf.vs2$rf.model,
newdata = subset(x.new, select = rf.vs2$selected.vars),
type = "prob")
## If you had not kept the forest (keep.forest) you could also try[#如果你没有保持森林(keep.forest),你也可以尝试]
randomForest(y = cl, x = subset(x, select = rf.vs2$selected.vars),
ntree = rf.vs2$ntreeIterat,
xtest = subset(x, select = rf.vs2$selected.vars))$test
## but here the forest is built new (with only the selected variables)[#但这里的森林是建立新的(只对选定的变量)]
## so results need not be the same[#所以结果是不一样的]
## CAUTION: You will NOT want this (these are similar to resubstitution[#注意:你不会想(这些类似resubstitution的]
## predictions)[#预测)]
predict(rf.vs2$rf.model, newdata = subset(x, select = rf.vs2$selected.vars))
## nor these (read help of predict.randomForest for why these[#也没有这些(帮助predict.randomForest为什么这些]
## predictions are different from those from previous command)[#预测从以前的命令是不同的)]
predict(rf.vs2$rf.model)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|