nearZeroVar(caret)
nearZeroVar()所属R语言包:caret
Identification of near zero variance predictors
鉴定近零方差预测
译者:生物统计家园网 机器人LoveR
描述----------Description----------
nearZeroVar diagnoses predictors that have one unique value (i.e. are zero variance predictors) or predictors that are have both of the following characteristics: they have very few unique values relative to the number of samples and the ratio of the frequency of the most common value to the frequency of the second most common value is large. checkConditionalX looks at the distribution of the columns of x conditioned on the levels of y and identifies columns of x that are sparse within groups of y.
nearZeroVar诊断预测因子,有一个唯一的值(即均为零方差预测因子)或预测因子,同时具有以下特点:它们很少有唯一值的频率相对的样本数的比例最常见的第二个最常见值的频率值是大的。 checkConditionalX着眼于分布的列的x空调y上的水平和识别的列组内xy是稀疏的。
用法----------Usage----------
nearZeroVar(x, freqCut = 95/5, uniqueCut = 10, saveMetrics = FALSE)
checkConditionalX(x, y)
checkResamples(index, x, y)
参数----------Arguments----------
参数:x
a numeric vector or matrix, or a data frame with all numeric data
一个数值向量或矩阵,或与所有的数字数据的数据框
参数:freqCut
the cutoff for the ratio of the most common value to the second most common value
第二个最常见值的比率的最常见值的截止
参数:uniqueCut
the cutoff for the percentage of distinct values out of the number of total samples
截止,满分总样本数的不同的值的百分比
参数:saveMetrics
a logical. If false, the positions of the zero- or near-zero predictors is returned. If true, a data frame with predictor information is returned.
一个逻辑。如果为false,将返回零或接近零的位置的预测。如果为true,则返回一个数据框的预测信息。
参数:y
a factor vector with at least two levels
具有至少两个水平的因子矢量
参数:index
a list. Each element corresponds to the training set samples in x for a given resample
一个列表。对于给定的重采样的训练集样本中的每一个元素对应x
Details
详细信息----------Details----------
For example, an example of near zero variance predictor is one that, for 1000 samples, has two distinct values and 999 of them are a single value.
例如,接近零的方差的预测器的一个例子是,对1000个样本中,有两个不同的值,999是一个单一的值。
To be flagged, first the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio”) must be above freqCut. Secondly, the “percent of unique values,” the number of unique values divided by the total number of samples (times 100), must also be below uniqueCut.
要被标记的频率,首先在第二最频值(称为“频率比”)的最普遍的值必须高于freqCut。其次,“%的唯一的值,”独特的值除以采样的总数(乘以100)的数目,也必须低于uniqueCut。
In the above example, the frequency ratio is 999 and the unique value percentage is 0.0001.
另外,在上述的例子中,频率比为999和独特的值的百分比是0.0001。
Checking the conditional distribution of x may be needed for some models, such as naive Bayes where the conditional distributions should have at least one data point within a class.
检查的条件分布x可能需要一些模型,如朴素贝叶斯的条件分布在一个类中至少有一个数据点。
值----------Value----------
For nearZeroVar: if saveMetrics = FALSE, a vector of integers corresponding to the column positions of the problematic predictors. If saveMetrics = TRUE, a data frame with columns:
对于nearZeroVar:如果saveMetrics = FALSE,对应的问题的预测因子的列位置的矢量的整数。如果saveMetrics = TRUE,一个数据框中的列:
参数:freqRatio
the ratio of frequencies for the most common value over the second most common value
最常用的值的频率比第二个最常见的值
参数:percentUnique
the percentage of unique data points out of the total number of data points
独特的数据点的数据点的总数的百分比
参数:zeroVar
a vector of logicals for whether the predictor has only one distinct value
一个向量,逻辑值预测只有一个独特的价值
参数:nzv
a vector of logicals for whether the predictor is a near zero variance predictor
一个向量,逻辑值的预测是接近零的方差预测
For checkResamples or checkConditionalX, a vector of column indicators for predictors with empty conditional distributions in at least one class of y.
对于checkResamples或checkConditionalX,一个列向量的预测指标与空的条件分布在至少一类y。
(作者)----------Author(s)----------
Max Kuhn, with speed improvements to nearZeroVar by Allan Engelhardt
实例----------Examples----------
nearZeroVar(iris[, -5], saveMetrics = TRUE)
data(BloodBrain)
nearZeroVar(bbbDescr)
set.seed(1)
classes <- factor(rep(letters[1:3], each = 30))
x <- data.frame(x1 = rep(c(0, 1), 45),
x2 = c(rep(0, 10), rep(1, 80)))
lapply(x, table, y = classes)
checkConditionalX(x, classes)
folds <- createFolds(classes, k = 3, returnTrain = TRUE)
x$x3 <- x$x1
x$x3[folds[[1]]] <- 0
checkResamples(folds, x, classes)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|