knncatimputeLarge(scrime)
knncatimputeLarge()所属R语言包:scrime
Missing Value Imputation with kNN for High-Dimensional Data
高维数据的KNN缺失值插补
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Imputes missing values in a high-dimensional matrix composed of categorical variables using k Nearest Neighbors.
插补缺失值的分类变量使用k近邻组成的高维矩阵。
用法----------Usage----------
knncatimputeLarge(data, mat.na = NULL, fac = NULL, fac.na = NULL,
nn = 3, distance = c("smc", "cohen", "snp1norm", "pcc"),
n.num = 100, use.weights = TRUE, verbose = FALSE)
参数----------Arguments----------
参数:data
a numeric matrix consisting of integers between 1 and n.cat, where n.cat is maximum number of levels the categorical variables can take. If mat.na is specified, data is assumed to contain only non-missing data, and the rows of data are used to impute the missing values in mat.na. Otherwise, data is also allowed to contain missing values, and the missing values in the rows of data are imputed by employing the rows of data showing no missing values.
一个数字矩阵之间的整数n.cat,其中n.cat是最大数量的等级分类变量可以。如果mat.na指定,data被假定只包含非缺失的数据,并且行data用于意指在mat.na缺少的值。否则,data也允许包含缺失值和遗漏值的行data估算采用data没有缺失值的行。
Each row of data represents one of the objects that should be used to identify the k nearest neighbors, i.e.\ if the k nearest variables should be used to replace the missing values, then each row must represent one of the variables. If the k nearest observations should be used to impute the missing values, then each row must correspond to one of the observations.
data的每一行代表一个对象,应该被用来识别k近邻,即\当k。最接近的变量应该用来替换丢失的值,那么每个行代表的变量之一。如果k最近的观测资料应归咎于缺少的值,然后每行必须符合的意见。
参数:mat.na
a numeric matrix containing missing values. Must have the same number of columns as data. All non-missing values must be integers between 1 and n.cat. If NULL, data is assumed to also contain the rows with missing values.
一个数字矩阵缺失值的。必须具有相同的列数data。所有非缺失值必须是整数1和n.cat。如果NULL,data假设也包含缺失值的行。
参数:fac
a numeric or character vector of length nrow{data} specifying the values of a factor used to split data into subsets. If, e.g., the values of fac are given by the chromosomes to which the SNPs represented by the rows of data belong, then k nearest neighbors is applied chromosomewise to the missing values in mat.na (or data). If NULL, no such splitting is done. Must be specified, if fac.na is specified.
一个数字或字符向量的长度nrow{data}指定的值用于分裂data为子集的一个因素。如果,例如,值fac给出由染色体的单核苷酸多态性所表示的行data属于,然后k近邻施加chromosomewise缺少的值在 mat.na(data“)。如果NULL,是没有这样的分裂。必须指定的,,如果fac.na指定。
参数:fac.na
a numeric or character vector of length nrow{mat.na} specifying the values of a factor by which mat.na is split into subsets. Each possible value of fac.na must be at least nn times in fac. Must be specified, if fac and mat.na is specified. If both fac and fac.na are NULL, then no splitting is done.
一个数字或字符的矢量的长度nrow{mat.na}指定的值通过该mat.na被分成子集的一个因素。 fac.na每个可能的值必须是至少nn次fac。必须指定,如果fac和mat.na指定。如果这两个fac和fac.na是NULL,然后没有分裂完成。
参数:nn
an integer specifying k, i.e.\ the number of nearest neighbors, used to impute the missing values.
一个整数,指定k,即\最近的邻居,用于归咎于缺少的值的数量。
参数:distance
character string naming the distance measure used in k Nearest Neighbors. Must be either "smc" (default), "cohen", "snp1norm" (which denotes the Manhattan distance for SNPs), or "pcc".
字符串命名k近邻距离测量。必须是"smc"(默认),"cohen","snp1norm"(表示单核苷酸多态性的曼哈顿距离),或"pcc"。
参数:n.num
an integer giving the number of rows of mat.na considered simultaneously when replacing the missing values in mat.na.
一个整数,给出的行数mat.na同时考虑更换时的遗漏值mat.na。
参数:use.weights
should weighted k nearest neighbors be used to impute the missing values? If TRUE, the votes of the nearest neighbors are weighted by the reciprocal of their distances to the variable (or observation) whose missing values are imputed.
加权k最近的邻居应该归咎于缺少的值吗? TRUE如果,最近的邻居的选票它们的距离的倒数加权的变量(或观察)的遗漏值的估算的。
参数:verbose
should more information about the progress of the imputation be printed?
应更多信息的归集的进展印制的?
值----------Value----------
If mat.na = NULL, then a matrix of the same size as data in which the missing values have been replaced. If mat.na has been specified, then a matrix of the same size as mat.na in which the missing values have been replaced.
如果mat.na = NULL,那么相同的大小的矩阵作为data,其中缺失的数值已被替换。如果mat.na已被指定,那么mat.na缺少的值已被替换的大小相同的矩阵。
注意----------Note----------
While in knncatimpute all variable/rows are considered when replacing missing values, knncatimputeLarge only considers the rows with no missing values when searching for the k nearest neighbors.
虽然在knncatimpute所有的变量/列时,会考虑替换缺失值,knncatimputeLarge只考虑没有缺失值的行搜索时为k最近的邻居。
(作者)----------Author(s)----------
Holger Schwender, <a href="mailto:holger.schwender@udo.edu">holger.schwender@udo.edu</a>
参考文献----------References----------
Schwender, H. and Ickstadt, K.\ (2008). Imputing Missing Genotypes with <code>k</code> Nearest Neighbors. Technical Report, SFB 475, Department of Statistics, University of Dortmund. Appears soon.
参见----------See Also----------
knncatimpute, gknn, smc, pcc
knncatimpute,gknn,smc,pcc
实例----------Examples----------
# Generate a data set consisting of 100 columns and 2000 rows (actually,[生成数据集由100列和2000行(实际上,]
# knncatimputeLarge is made for much larger data sets), where the values[knncatimputeLarge作出更大的数据集),其中的值]
# are randomly drawn from the integers 1, 2, and 3.[随机抽取从整数1,2,和3。]
# Afterwards, remove 200 of the observations randomly.[之后,删除200的意见随机的。]
mat <- matrix(sample(3, 200000, TRUE), 2000)
mat[sample(200000, 20)] <- NA
# Apply knncatimputeLarge to mat to remove the missing values.[应用knncatimputeLarge垫去除的遗漏值。]
mat2 <- knncatimputeLarge(mat)
sum(is.na(mat))
sum(is.na(mat2))
# Now assume that the first 100 rows belong to SNPs from chromosome 1,[现在,假设在第100行属于1号染色体SNP分析,]
# the second 100 rows to SNPs from chromosome 2, and so on.[第二个100行的SNPs从2号染色体,并依此类推。]
chromosome <- rep(1:20, e = 100)
# Apply knncatimputeLarge to mat chromosomewise, i.e. only consider[应用knncatimputeLarge到垫chromosomewise,即只考虑]
# the SNPs that belong to the same chromosome when replacing missing[的单核苷酸多态性时,属于同一染色体替换缺失]
# genotypes.[基因型。]
mat4 <- knncatimputeLarge(mat, fac = chromosome)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|