R语言 pcaMethods包 kEstimate()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-2-26 10:48:56

kEstimate(pcaMethods)
kEstimate()所属R语言包：pcaMethods

                                    Estimate best number of Components for missing value
                                       缺失值估计最好的元件数量

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

Perform cross validation to estimate the optimal number of components for missing value estimation. Cross validation is
执行交叉验证组件的最佳估算缺失值估计。交叉验证

用法----------Usage----------

参数----------Arguments----------

参数：Matrix
matrix – numeric matrix containing observations in rows and variables in columns
matrix - 数字矩阵中的行和列变量中包含意见

参数：method
character – of the methods found with pcaMethods() The option llsImputeAll calls llsImpute with the allVariables = TRUE parameter.
character - 与pcaMethods（）方法的选项llsImputeAll呼吁= TRUE参数allVariables llsImpute。

参数：evalPcs
numeric – The principal components to use for cross validation or the number of neighbour variables if used with llsImpute.  Should be an array containing integer values, eg. evalPcs = 1:10 or evalPcs = c(2,5,8). The NRMSEP or Q2 is calculated for each component.
numeric - 的主要组成部分，用于交叉验证或邻居变量的数目，如果与llsImpute使用。应该是一个数组，包含整数值，例如。 evalPcs = 1:10或evalPcs = c(2,5,8)。，NRMSEP或Q2计算每个组件。

参数：segs
numeric – number of segments for cross validation
numeric - 交叉验证段数

参数：nruncv
numeric – Times the whole cross validation is repeated
重复整个交叉验证的numeric - 时报

参数：em
character – The error measure. This can be nrmsep or q2
character - 错误的措施。这可能是nrmsep或Q2

参数：allVariables
boolean – If TRUE, the NRMSEP is calculated for all variables, If FALSE, only the incomplete ones are included. You maybe want to do this to compare several methods on a  complete data set.
boolean - 如果是TRUE，NRMSEP计算所有变量，如果为FALSE，只有不完整的包括。你也许要做到这一点，比较完整的数据集上的几个方法。

参数：verbose
boolean – If TRUE, some output like the variable indexes are printed to the console each iteration.
boolean - 如果为TRUE，有些像变量指标的输出打印到控制台每次迭代。

参数：...
Further arguments to pca or nni </table>
pca或nni</表>进一步论据

Details

详情----------Details----------

The assumption hereby is that variables that are highly correlated in a distinct region (here the non-missing observations) are also correlated in another (here the missing observations).  This also implies that the complete subset must be large enough to be representative.  For each incomplete variable, the available values are divided into a user defined number of cv-segments. The segments have equal size, but are chosen from a random equal distribution. The non-missing values of the variable are covered completely.  PPCA, BPCA, SVDimpute, Nipals PCA, llsImpute an NLPCA may be used for imputation.
现假设是，在不同区域高度相关的变量（在这里是不可缺失的意见），在另一个相关（在这里失踪的意见）。这也意味着，完整的子集必须足够大，有代表性。对于每一个不完整的变量，可用的值分为定义的CV-段的用户数量。段有大小相等，但选择从一个随机的平等分配。非缺失值的变量被完全覆盖。申诉机关，BPCA，SVDimpute，Nipals PCA，llsImpute 1 NLPCA可用于归集。

The whole cross validation is repeated several times so, depending on the parameters, the calculations can take very long time.  As error measure the NRMSEP (see Feten et. al, 2005) or the Q2 distance is used.  The NRMSEP basically normalises the RMSD between original data and estimate by the variable-wise variance. The reason for this is that a higher variance will generally lead to a higher estimation error.  If the number of samples is small, the variable - wise variance may become an unstable criterion and the Q2 distance should be used instead. Also if variance normalisation was applied previously.
整个交叉验证，反复几次，取决于参数，计算可能需要很长的时间。作为误差测度的NRMSEP（参阅Feten等人，2005年）或第二季度的距离。的NRMSEP，基本上normalises之间的原始数据和明智的变量方差估计的RMSD。这样做的原因是较高的变异通常会导致更高的估计误差。如果样本数量小，变 - 明智的变异可能成为一个不稳定的准则，并应改用第二季的距离。此外，如果采用方差标准化以前。

The method proceeds variable - wise, the NRMSEP / Q2 distance is calculated for each incomplete variable and averaged afterwards. This allows to easily see for wich set of variables missing value imputation makes senes and for wich set no imputation or something like mean-imputation should be used.  Use kEstimateFast or Q2 if you are not interested in variable wise CV performance estimates.
收益变量的方法 - 明智的，的NRMSEP / Q2的距离计算每个完整的变量后，平均。这可以很容易地看到完全剥离变量的缺失值插补集使的senes完全剥离的设置应使用没有诋毁或东西像平均归集。使用kEstimateFast或Q2如果你是在变聪明的CV性能估计不感兴趣。

Run time may be very high on large data sets. Especially when used with complex methods like BPCA or Nipals PCA.  For PPCA, BPCA, Nipals PCA and NLPCA the estimation method is called (v\_miss * segs * nruncv) times as the error for all numbers of principal components can be calculated at once.  For LLSimpute and SVDimpute this is not possible, and the method is called (v\_miss * segs * nruncv * length(evalPcs)) times. This should still be fast for LLSimpute because the method allows to choose to only do the estimation for one particular variable.  This saves a lot of iterations.  Here, v\_miss is the number of variables showing missing values.
运行时间可能会非常高的大型数据集。尤其是当使用复杂BPCA或Nipals PCA的等方法。对于申诉机关，BPCA，Nipals PCA和NLPCA估计方法被称为(v\_miss * segs * nruncv)为主要成分的所有数字的错误的时候，可以一次计算。为LLSimpute和SVDimpute，这是不可能的，该方法被称为(v\_miss * segs * nruncv * length(evalPcs))倍。这仍然应该为LLSimpute快，因为该方法可以选择只做一个特定的变量的估计。这样可以节省大量的迭代。在这里，v\_miss显示缺失值的变量数目。

As cross validation is done variable-wise, in this function Q2 is defined on single variables, not on the entire data set. This is Q2 is calculated as as sum(x - xe)^2 \ sum(x^2), where x is the currently used variable and xe it's estimate. The values are then averaged over all variables. The NRMSEP is already defined variable-wise. For a single variable it is then sqrt(sum(x - xe)^2 \ (n * var(x))), where x is the variable and xe it's estimate, n is the length of x.  The variable wise estimation
作为交叉验证完成可变明智的，在这个函数的第二季度对单个变量定义，而不是整个数据集。这是第二季度sum(x - xe)^2 \ sum(x^2)计算，其中x是目前使用的变量和XE它的估计。对所有变量的值，然后平均。已定义的NRMSEP变聪明。为一个单一的变量，它是那么sqrt(sum(x - xe)^2 \ (n * var(x)))，其中x是变量和XE它的估计，n是x的长度。明智的变量估计

值----------Value----------

A list with:
与名单：

参数：bestNPcs
number of PCs or k for which the minimal average NRMSEP or the maximal Q2 was obtained.
为最小的平均NRMSEP或获得的最大的第二季度的个人电脑或K数。

参数：eError
an array of of size length(evalPcs). Contains the average error of the cross validation runs for each number of components.
长度大小的数组（evalPcs）。包含交叉验证运行的平均误差为每个组件的数量。

参数：variableWiseError
Matrix of size incomplete_variables x length(evalPcs).  Contains the NRMSEP or Q2 distance for each variable and each number of PCs. This allows to easily see for wich variables imputation makes sense and for which one it should not be done or mean imputation should be used.
大小的矩阵incomplete_variablesx长度（evalPcs）。包含每个变量和每个电脑数量NRMSEP或Q2距离。这可以很容易地看到完全剥离的变量归集有意义的，其中之一是不应该做或平均归集应使用。

参数：evalPcs
The evaluated numbers of components or number of neighbours  (the same as the evalPcs input parameter).
组件或邻居的数量（作为evalPcs输入参数相同）的评估数字。

参数：variableIx
Index of the incomplete variables. This can be used to map  the variable wise error to the original data. </table>
指标不全变量。这可以用来映射变聪明错误的原始数据。 </ TABLE>

作者（S）----------Author(s)----------

Wolfram Stacklies

参见----------See Also----------

kEstimateFast, Q2, pca, nni.
kEstimateFast, Q2, pca, nni。

举例----------Examples----------

data(metaboliteData)
# Do cross validation with ppca for component 2:4[做交叉验证组件2时04分与PPCA]
esti <- kEstimate(metaboliteData, method = "ppca", evalPcs = 2:4, nruncv=1, em="nrmsep")
# Plot the average NRMSEP[绘制的平均NRMSEP]
barplot(drop(esti$eError), xlab = "Components",ylab = "NRMSEP (1 iterations)")
# The best result was obtained for this number of PCs:[这种电脑数量，获得的最好成绩：]
print(esti$bestNPcs)
# Now have a look at the variable wise estimation error[现在有一个明智的变量估计误差]
barplot(drop(esti$variableWiseError[, which(esti$evalPcs == esti$bestNPcs)]),

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

lhq213 · 发表于 2016-9-11 14:21:15

xiexie谢谢

lhq213 · 发表于 2016-9-11 14:21:18

xiexie谢谢

账号		自动登录	找回密码
密码			注册