RRF(RRF)
RRF()所属R语言包:RRF
Feature Selection with Regularized Random Forest
正则随机森林的特征选择
译者:生物统计家园网 机器人LoveR
描述----------Description----------
RRF implements the regularized random forest algorithm. It is based on the randomForest R package by Andy Liaw, Matthew Wiener, Leo Breiman and Adele Cutler.
RRF实现了正的随机森林算法。它是基于对R封装的randomForest的廖安迪,马修·维纳,布雷曼博士狮子座和阿黛尔卡特勒。
用法----------Usage----------
## S3 method for class 'formula'[类formula的方法]
RRF(formula, data=NULL, ..., subset, na.action=na.fail)
## Default S3 method:[默认方法]
RRF(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL,
importance=FALSE, localImp=FALSE, nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, coefReg=0.8, flagReg=1, ...)
## S3 method for class 'RRF'
print(x, ...)
参数----------Arguments----------
参数:data
an optional data frame containing the variables in the model. By default the variables are taken from the environment which RRF is called from.
一个可选的数据框包含在模型中的变量。默认情况下,变量是从RRF被称为从环境。
参数:subset
an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
索引向量表示应该使用哪些行。 (注:如果给定,该参数必须被命名。)
参数:na.action
A function to specify the action to be taken if NAs are found. (NOTE: If given, this argument must be named.)
一个函数来指定要采取的行动,如果NAS的。 (注:如果给定,该参数必须被命名。)
参数:x, formula
a data frame or a matrix of predictors, or a formula describing the model to be fitted (for the print method, an RRF object).
一个数据框或矩阵的预测值,或嵌合(print方法,一个RRF对象)描述的模型的公式。
参数:y
A response vector. If a factor, classification is assumed, otherwise regression is assumed. If omitted, RRF will run in unsupervised mode.
响应向量。如果假设的一个因素,分类,否则假定回归。如果省略,RRF会在无人监督的模式运行。
参数:xtest
a data frame or matrix (like x) containing predictors for the test set.
一个数据框或矩阵(如x),测试组的预测。
参数:ytest
response for the test set.
响应的测试集。
参数:ntree
Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
树木生长的数量。这不应该被设置得小一些,以确保每一个输入行被预测至少几十倍。
参数:mtry
Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
随机取样,每个分割的候选人数的变量。需要注意的是,默认值是不同的分类(SQRT(p)其中p是数中的变量x)和回归(P / 3)
参数:replace
Should sampling of cases be done with or without replacement?
采样的情况下,带或不带更换吗?
参数:classwt
Priors of the classes. Need not add up to one. Ignored for regression.
先验的类。不用加了一个。被忽略的回归。
参数:cutoff
(Classification only) A vector of length equal to number of classes. The "winning" class for an observation is the one with the maximum ratio of proportion of votes to cutoff. Default is 1/k where k is the number of classes (i.e., majority vote wins).
(分类)的矢量长度相等的班级数目。观察“殊荣的”类是有票的截止比值最高的比例。默认值是1 / k,其中k是的类的数量(即多数票胜)。
参数:strata
A (factor) variable that is used for stratified sampling.
A(因子),用于分层抽样的变量。
参数:sampsize
Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
尺寸(S)的样品进行绘制。进行分类,,如果sampsize是一个向量的长度地层的数量,然后取样的地层分层,和sampsize的元素表示从地层中得出的数字。
参数:nodesize
Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5).
最小尺寸的终端节点。设置此数量较大,造成较小的树木生长(从而需要更少的时间)。请注意,默认值是不同的分类(1)和回归(5)。
参数:maxnodes
Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than maximum possible, a warning is issued.
能有最大数量的终端节点在森林中的树木。如果没有给出,树木生长的最大可能(受到的限制nodesize)。如果集大于最大可能的,则发出一个警告。
参数:importance
Should importance of predictors be assessed?
如果预测变量的重要性进行评估呢?
参数:localImp
Should casewise importance measure be computed? (Setting this to TRUE will override importance.)
观察值重要性度量计算? (设置为TRUE,将覆盖importance。)
参数:nPerm
Number of times the OOB data are permuted per tree for assessing variable importance. Number larger than 1 gives slightly more stable estimate, but not very effective. Currently only implemented for regression.
OOB数据的次数置换每棵树的评估变量重要性。数大于1,稍微稳定的估计,但不是很有效。目前,只有实现回归。
参数:proximity
Should proximity measure among the rows be calculated?
应接近测量计算各行吗?
参数:oob.prox
Should proximity be calculated only on “out-of-bag” data?
接近的“包”数据只计算?
参数:norm.votes
If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs). Ignored for regression.
如果TRUE(默认),投票的最终结果表示为分数。如果FALSE,原始的票数返回(可结合不同的运行结果)。被忽略的回归。
参数:do.trace
If set to TRUE, give a more verbose output as RRF is run. If set to some integer, then running output is printed for every do.trace trees.
如果设置为TRUE,提供更详细的输出RRF运行。如果设置为某个整数,然后运行打印输出每do.trace树木。
参数:keep.forest
If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.
如果设置为FALSE,森林将不会被保留在输出对象。如果xtest给定的,默认为FALSE。
参数:corr.bias
perform bias correction for regression? Note: Experimental. Use at your own risk.
执行偏差校正的回归?注:实验。使用您自己的风险。
参数:keep.inbag
Should an n by ntree matrix be returned that keeps track of which samples are “in-bag” in which trees (but not how many times, if sampling with replacement)
如果n的ntree矩阵返回跟踪的样本是“包”树(但没有多少倍,如果抽样更换)
参数:coefReg
the coefficient(s) of regularization. A smaller coefficient may lead to a smaller feature subset, i.e. there are fewer variables with non-zero importance scores. coefReg must be either a single value (all variables have the same coefficient) or a numeric vector of length equal to the number of predictor variables. default: 0.8
正规化系数(S)。一个较小的系数可能会导致一个更小的特征子集,即有较少的变量非零的重要性评分。 coefReg必须是一个单一的值(所有的变量具有相同的系数)或预测变量的数目的长度等于一个数值向量。默认值:0.8
参数:flagReg
1: with regularization; 0: without regularization. default: 1
1:0:与正规化正规化。默认值:1
参数:...
optional parameters to be passed to the low level function RRF.default.
可选的参数被传递到低级别的功能RRF.default。
值----------Value----------
An object of class RRF, which is a list with the following components:
类的一个对象RRF,这是以下组件的列表:
参数:call
the original call to RRF
原来的呼叫RRF
参数:type
one of regression, classification, or unsupervised.
regression,classification或unsupervised之一。
参数:predicted
the predicted values of the input data based on out-of-bag samples.
所输入的数据的预测值的基础上,满分袋样品。
参数:importance
a matrix with nclass + 2 (for classification) or two (for regression) columns. For classification, the first nclass columns are the class-specific measures computed as mean descrease in accuracy. The nclass + 1st column is the mean descrease in accuracy over all classes. The last column is the mean decrease in Gini index. For Regression, the first column is the mean decrease in accuracy and the second the mean decrease in MSE. If importance=FALSE, the last measure is still returned as a vector.
nclass+ 2(分类)或两个(回归)列矩阵。对于分类,第一nclass列计算为平均descrease准确性类的具体措施。 nclass +第一列是所有类的平均descrease精度。最后一列是基尼系数平均降低。回归分析,第一列是平均降低准确性和第二平均降低MSE。如果importance=FALSE,最后的措施仍然是作为一个向量返回。
参数:importanceSD
The “standard errors” of the permutation-based importance measure. For classification, a p by nclass + 1 matrix corresponding to the first nclass + 1 columns of the importance matrix. For regression, a length p vector.
“标准错误”的排列为基础的重要性措施。对于分类,pnclass + 1第一nclass + 1的重要性矩阵的列相对应的矩阵。对于回归,p矢量的长度。
参数:localImp
a p by n matrix containing the casewise importance measures, the [i,j] element of which is the importance of i-th variable on the j-th case. NULL if localImp=FALSE.
AP×n矩阵包含观察值的重要性措施的[i,j]的元素,这是上的第j个的情况下的第i个变量的重要性。 NULL如果localImp=FALSE。
参数:ntree
number of trees grown.
树木的数量。
参数:mtry
number of predictors sampled for spliting at each node.
抽样劈裂在每个节点的预测数。
参数:forest
(a list that contains the entire forest; NULL if RRF is run in unsupervised mode or if keep.forest=FALSE.
(一个列表,包含了整个森林; NULL如果RRF在无人监督的模式下运行,或者如果keep.forest=FALSE。
参数:err.rate
(classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th.
(分类)的矢量误差率的预测,对输入的数据外(OOB)的错误率的所有树的第i个,第i个元素。
参数:confusion
(classification only) the confusion matrix of the prediction (based on OOB data).
(分类)的混淆矩阵的预测(根据OOB数据)。
参数:votes
(classification only) a matrix with one row for each input data point and one column for each class, giving the fraction or number of (OOB) "votes" from the random forest.
(分类)为每个输入数据点,并与一列,一列每类矩阵,给人的比例或数量(OOB)“票”从随机森林。
参数:oob.times
number of times cases are "out-of-bag" (and thus used in computing OOB error estimate)
倍情况下的数量是“袋(和这样在计算OOB误差估计使用)
参数:proximity
if proximity=TRUE when RRF is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes).
如果proximity=TRUE当RRF被称为接近措施,一个矩阵中的输入(根据数据点,对在同一个终端节点的频率上)。
参数:mse
(regression only) vector of mean square errors: sum of squared residuals divided by n.
(回归)的均方误差向量残差平方的总和除以n。
参数:rsq
(regression only) “pseudo R-squared”: 1 - mse / Var(y).
(回归)“伪R平方”:1 - mse在/ var(Y)。
参数:test
if test set is given (through the xtest or additionally ytest arguments), this component is a list which contains the corresponding predicted, err.rate, confusion, votes (for classification) or predicted, mse and rsq (for regression) for the test set. If proximity=TRUE, there is also a component, proximity, which contains the proximity among the test set as well as proximity between test and training data.
如果测试仪给定的(通过xtest还ytest参数),这个组件是一个列表,其中包含相应的predicted,err.rate,confusion, votes(分类)或predicted,mse和rsq(回归),测试组。如果proximity=TRUE,还存在的一个组件,proximity,其中包含的接近以及设置为接近测试和训练数据之间产生的测试。
注意----------Note----------
For large data sets, especially those with large number of variables, calling RRF via the formula interface is not advised: There may be too much overhead in handling the formula.
对于大型数据集,特别是那些有大量的变量,通过公式接口调用RRF是不妥当的:有可能是开销太大,在处理公式。
(作者)----------Author(s)----------
Houtao Deng <a href="mailto:softwaredeng@gmail.com">softwaredeng@gmail.com</a>, based on the randomForest R package by Andy Liaw, Matthew Wiener, Leo Breiman and Adele Cutler.
参考文献----------References----------
实例----------Examples----------
#-----Example 1 -----[-----例1 -----]
library(RRF);set.seed(1)
#only the first and last features are needed[只有第一和最后的功能需要]
X=matrix(runif(200*500), ncol=500)
class = (X[,1])^2 + (X[,500])^2
class[class>median(class)]=1;class[class<=median(class)]=0
#ordinary random forest. [普通的随机森林。]
rf <- RRF(X,as.factor(class), flagReg = 0)
impRF=rf$importance
impRF=impRF[,"MeanDecreaseGini"]
x11();plot(impRF,xlab="variable index",ylab="importance")
title("RRF(X,as.factor(class), flagReg = 0)")
#mark the first and last variables[标记的第一个和最后一个变量]
text(30, impRF[1], "V1", cex=1)
text(460, impRF[500], "V500", cex=1)
#regularized random forest without using the global importance[规范不使用随机森林的全球重要性]
rrf <- RRF(X,as.factor(class))
imp=rrf$importance
imp=imp[,"MeanDecreaseGini"]
x11();plot(imp,xlab="variable index",ylab="importance")
title("RRF(X,as.factor(class))")
text(30, imp[1], "V1", cex=1)
text(460, imp[500], "V500", cex=1)
#regularized random forest using the importance scores[使用正则随机森林的重要性评分]
#from ordinary random forest[从普通的随机森林]
imp=impRF/(max(impRF))#normalize the importance score[标准化的重要性评分]
coefReg=0.9*0.8+0.1*imp #weighted average[加权平均]
rrf <- RRF(X,as.factor(class),coefReg=coefReg)
imp=rrf$importance
imp=imp[,"MeanDecreaseGini"]
x11();plot(imp,xlab="variable index",ylab="importance")
title("RRF(X,as.factor(class),coefReg=coefReg)")
text(30, imp[1], "V1", cex=1)
text(460, imp[500], "V500", cex=1)
#smaller regularization coefficients [正规化系数较小]
#eliminate more features [消除更多的功能]
imp=impRF/(max(impRF))#normalize the importance score[标准化的重要性评分]
coefReg=0.9*0.5+0.1*imp #weighted average[加权平均]
rrf <- RRF(X,as.factor(class),coefReg=coefReg)
imp=rrf$importance
imp=imp[,"MeanDecreaseGini"]
x11();plot(imp,xlab="variable index",ylab="importance")
title("RRF(X,as.factor(class),coefReg=coefReg)")
text(30, imp[1], "V1", cex=1)
text(460, imp[500], "V500", cex=1)
#setting a large mtry may eliminate more features [设置一个大mtry可能消除更多的功能]
imp=impRF/(max(impRF))#normalize the importance score[标准化的重要性评分]
coefReg=0.9*0.8+0.1*imp #weighted average[加权平均]
rrf <- RRF(X,as.factor(class),mtry=ncol(X),coefReg=coefReg)
imp=rrf$importance
imp=imp[,"MeanDecreaseGini"]
x11();plot(imp,xlab="variable index",ylab="importance")
title("RRF(X,as.factor(class),mtry=ncol(X),coefReg=coefReg)")
text(30, imp[1], "V1", cex=1)
text(460, imp[500], "V500", cex=1)
#-----Example 2 XOR learning-----[-----例2 XOR学习-----]
set.seed(1)
#only the first 3 features are needed[第3功能]
#and each individual feature is not useful[每个人的特点是没有用的]
bSample = sample(0:1,20000,replace=TRUE)
X=matrix(bSample,ncol=40)
class = xor(xor(X[,1],X[,2]),X[,3])
#ordinary random forest. [普通的随机森林。]
rf <- RRF(X,as.factor(class), flagReg = 0,importance=TRUE)
impRF=rf$importance
impRF=impRF[,"MeanDecreaseAccuracy"] # can use gini decrease as well[可以使用基尼减少]
x11();plot(impRF,xlab="variable index",ylab="importance")
title("RRF(X,as.factor(class),flagReg = 0,importance=TRUE)")
#regularized random forest using the global importance[使用正则随机森林的全球重要性]
#also setting a large mtry may eliminate more features[设置大mtry也可以消除更多的功能]
imp=impRF/(max(impRF))#normalize the importance score[标准化的重要性评分]
coefReg=0.9*0.8+0.1*imp #weighted average[加权平均]
rrf <- RRF(X,as.factor(class),mtry=ncol(X),coefReg=coefReg)
imp=rrf$importance
imp=imp[,"MeanDecreaseGini"]
x11();plot(imp,xlab="variable index",ylab="importance")
title("RRF(X,as.factor(class),mtry=ncol(X),coefReg=coefReg)")
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|