gsMMD2.default(GeneSelectMMD)
gsMMD2.default()所属R语言包:GeneSelectMMD
Gene selection based on a mixture of marginal distributions
基于边缘分布的混合物的基因选择
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Gene selection based on the marginal distributions of gene profiles that characterized by a mixture of three-component multivariate distributions. Input is a data matrix. The user needs to provide initial gene cluster membership.
基因选择的基础上,由三分量的多元分布的混合特征的基因概况边际分布。输入的是一个数据矩阵。用户需要提供初始基因簇成员。
用法----------Usage----------
gsMMD2.default(X,
memSubjects,
memIni,
maxFlag = TRUE,
thrshPostProb = 0.5,
geneNames = NULL,
alpha = 0.05,
transformFlag = FALSE,
transformMethod = "boxcox",
scaleFlag = TRUE,
criterion = c("cor", "skewness", "kurtosis"),
minL = -10,
maxL = 10,
stepL = 0.1,
eps = 0.001,
ITMAX = 100,
plotFlag = FALSE,
quiet=TRUE)
参数----------Arguments----------
参数:X
a data matrix. The rows of the matrix are genes. The columns of the matrix are subjects.
一个数据矩阵。矩阵的行基因。矩阵的列科。
参数:memSubjects
a vector of membership of subjects. memSubjects[i]=1 means the i-th subject belongs to diseased group, 0 otherwise.
一个科目的成员的向量。 memSubjects[i]=1指i个主题属于病变组,0否则。
参数:memIni
a vector of user-provided gene cluster membership.
一个向量的用户提供基因簇成员。
参数:maxFlag
logical. Indicate how to assign gene class membership. maxFlag=TRUE means that a gene will be assigned to a class in which the posterior probability of the gene belongs to this class is maximum. maxFlag=FALSE means that a gene will be assigned to class 1 if the posterior probability of the gene belongs to class 1 is greater than thrshPostProb. Similarly, a gene will be assigned to class 1 if the posterior probability of the gene belongs to class 1 is greater than thrshPostProb. If the posterior probability is less than thrshPostProb, the gene will be assigned to class 2 (non-differentially expressed gene group).
逻辑。说明如何分配基因类成员。 maxFlag= TRUE,将被分配到一个基因类基因的后验概率属于这一类的手段是最大的。 maxFlag= FALSE表示,一个基因将被分配到该基因的后验概率,属于第1类1级,如果是更大的比thrshPostProb。同样,一个基因将被分配到第1类基因的后验概率,属于第1类是比thrshPostProb更大。如果后验概率比thrshPostProb少,该基因将被分配到第2类(非的差异表达的基因组)。
参数:thrshPostProb
threshold for posterior probabilities. For example, if the posterior probability that a gene belongs to cluster 1 given its gene expression levels is larger than thrshPostProb, then this gene will be assigned to cluster 1.
阈值后验概率。例如,如果一个基因属于第1组给予其基因表达水平的后验概率比thrshPostProb大,那么这种基因将被分配到聚类1。
参数:geneNames
an optional character vector of gene names
基因名称可选的特征向量
参数:alpha
significant level which is equal to 1-conf.level, conf.level is the argument for the function t.test.
显着性水平,这等于1-conf.level,conf.level功能t.test的说法。
参数:transformFlag
logical. Indicate if data transformation is needed
逻辑。表明,如果需要的数据转换
参数:transformMethod
method for transforming data. Available methods include "boxcox", "log2", "log10", "log", "none".
转换数据的方法。可用的方法包括“boxcox”,“的log2”,“LOG10”,“log”,“无”。
参数:scaleFlag
logical. Indicate if gene profiles are to be scaled. If transformFlag=TRUE and scaleFlag=TRUE, then scaling is performed after transformation. To avoid linear dependence of tissue samples after scaling gene profiles, we delete one tissue sample after scaling (c.f. details).
逻辑。表明,如果基因剖面进行缩放。如果transformFlag=TRUE和scaleFlag=TRUE,然后缩放改造后执行。为了避免组织样本的线性关系,洗牙后基因型材,我们删除一个缩放后的组织样本(比照细节)。
参数:criterion
if transformFlag=TRUE, criterion indicates what criterion to determine if data looks like normal. “cor” means using Pearson's correlation. The idea is that the observed quantiles after transformation should be close to theoretical normal quantiles. So we can use Pearson's correlation to check if the scatter plot of theoretical normal quantiles versus observed quantiles is a straightline. “skewness” means using skewness measure to check if the distribution of the transformed data are close to normal distribution; “kurtosis” means using kurtosis measure to check normality.
如果transformFlag=TRUE,criterion表示什么标准来确定数据是否看起来像正常。 “心病”是指使用Pearson相关。的想法是,改造后的观测位数应该是接近理论的正常位数。因此,我们可以使用Pearson相关检查的理论与观测到的位数正常位数散点图是直线。 “偏”是指使用偏斜的措施,以检查是否转换后的数据分布接近正态分布;“峰度”是指使用峭度的措施,以检查正常。
参数:minL
lower limit for the lambda parameter used in Box-Cox transformation
lambda参数下限使用Box-Cox变换
参数:maxL
upper limit for the lambda parameter used in Box-Cox transformation
Box-Cox变换lambda参数的上限
参数:stepL
step increase when searching the optimal lambda parameter used in Box-Cox transformation
寻找最佳lambda Box-Cox变换中使用的参数时,逐步增加
参数:eps
a small positive value. If the absolute value of a value is smaller than eps, this value is regarded as zero.
一个小的正价值。如果一个值的绝对值比eps小,这个值被视为零。
参数:ITMAX
maximum iteration allowed for iterations in the EM algorithm
EM算法迭代允许的最大迭代
参数:plotFlag
logical. Indicate if the Box-Cox normality plot should be output.
逻辑。表明,如果箱考克斯常态图应该输出。
参数:quiet
logical. Indicate if intermediate results should be printed out.
逻辑。指出,如果中间结果应被打印出来。
Details
详情----------Details----------
We assume that the distribution of gene expression profiles is a mixture of 3-component multivariate normal distributions ∑_{k=1}^{3} π_k f_k(x|θ). Each component distribution f_k corresponds to a gene cluster. The 3 components correspond to 3 gene clusters: (1) up-regulated gene cluster, (2) non-differentially expressed gene cluster, and (3) down-regulated gene cluster. The model parameter vector is θ=(π_1, π_2, π_3, μ_{c1}, σ^2_{c1}, ρ_{c1}, μ_{n1}, σ^2_{n1}, ρ_{n1}, μ_2, σ^2_2, ρ_2, μ_{c3}, σ^2_{c3}, ρ_{c3}, μ_{n3}, σ^2_{n3}, ρ_{n3}. where π_1, π_2, and π_3 are the mixing proportions; μ_{c1}, σ^2_{c1}, and ρ_{c1} are the marginal mean, variance, and correlation of gene expression levels of cluster 1 (up-regulated genes) for diseased subjects; μ_{n1}, σ^2_{n1}, and ρ_{n1} are the marginal mean, variance, and correlation of gene expression levels of cluster 1 (up-regulated genes) for non-diseased subjects; μ_2, σ^2_2, and ρ_2 are the marginal mean, variance, and correlation of gene expression levels of cluster 2 (non-differentially expressed genes); μ_{c3}, σ^2_{c3}, and ρ_{c3} are the marginal mean, variance, and correlation of gene expression levels of cluster 3 (up-regulated genes) for diseased subjects; μ_{n3}, σ^2_{n3}, and ρ_{n3} are the marginal mean, variance, and correlation of gene expression levels of cluster 3 (up-regulated genes) for non-diseased subjects.
我们认为基因表达谱的分布是3分量的多元正态分布∑_{k=1}^{3} π_k f_k(x|θ)的混合物。每个组件分配f_k对应一个基因簇。 3组件对应3个基因簇:(1)上调的基因簇,(2)非差异表达的基因簇,(3)下调基因簇。模型参数向量θ=(π_1,π_2,π_3,μ_{c1},σ^2_{c1},ρ_{c1},μ_{n1},<X >,σ^2_{n1},ρ_{n1},μ_2,σ^2_2,ρ_2,μ_{c3},σ^2_{c3},ρ_{c3} μ_{n3},σ^2_{n3}。 ρ_{n3},π_1,π_2混合比例; π_3μ_{c1},σ^2_{c1}边际均值,方差,聚类1的基因表达水平为患病科目(上调基因)的相关性;ρ_{c1},μ_{n1},σ^2_{n1}边际均值,方差,基因表达水平的相关性聚类1非患病科目(上调基因);ρ_{n1},μ_2,σ^2_2边际均值,方差,和聚类2的基因表达水平的相关性(非差异表达的基因);ρ_2,μ_{c3},σ^2_{c3}边际均值,方差,和聚类3病科(上调基因)的基因表达水平的相关性; ρ_{c3},μ_{n3},σ^2_{n3}边际均值,方差,和3组非患病者(上调基因)的基因表达水平的相关性。
Note that genes in cluster 2 are non-differentially expressed across abnormal and normal tissue samples. Hence there are only 3 parameters for cluster 2.
请注意,在聚类2基因非跨越异常和正常组织样本的差异表达。因此,只有3个2组参数。
To make sure the identifiability, we set the following contraints: μ_{c1}>μ_{n1} and μ_{c3}<μ_{n3}.
的辨识,以确保我们设置以下contraints:μ_{c1}>μ_{n1}和μ_{c3}<μ_{n3}。
To make sure the marginal covariance matrices are poisitive definite, we set the following contraints: -1/(n_c-1)<ρ_{c1}<1, -1/(n_n-1)<ρ_{n1}<1, -1/(n-1)<ρ_{2}<1, -1/(n_c-1)<ρ_{c3}<1, -1/(n_n-1)<ρ_{n3}<1.
以确保边际协方差矩阵是poisitive明确,我们设置了以下contraints:-1/(n_c-1)<ρ_{c1}<1,-1/(n_n-1)<ρ_{n1}<1,-1/(n-1)<ρ_{2}<1,-1/(n_c-1)<ρ_{c3}<1,-1/(n_n-1)<ρ_{n3}<1。
We also has the following constraints for the mixing proportion: π_3=1-π_1-π_2, π_k>0, k=1,2,3.
我们也有混合的比例以下的限制:π_3=1-π_1-π_2,π_k>0,k=1,2,3。
We apply the EM algorithm to estimate the model parameters. We regard the cluster membership of genes as missing values.
我们应用EM算法来估计模型参数。我们认为,基因缺失值的聚类成员。
To facilitate the estimation of the parameters, we reparametrize the parameter vector as θ^*=(π_1, π_2, μ_{c1}, σ^2_{c1}, r_{c1}, δ_{n1}, σ^2_{n1}, r_{n1}, μ_2, σ^2_2, r_2, μ_{c3}, σ^2_{c3}, r_{c3}, δ_{n3}, σ^2_{n3}, r_{n3}), where μ_{n1}=μ_{c1}-\exp(δ_{n1}), μ_{n3}=μ_{c3}+\exp(δ_{n3}), ρ_{c1}=(\exp(r_{c1})-1/(n_c-1))/(1+\exp(r_{c1})), ρ_{n1}=(\exp(r_{n1})-1/(n_n-1))/(1+\exp(r_{n1})), ρ_{2}=(\exp(r_{2})-1/(n-1))/(1+\exp(r_{2})), ρ_{c3}=(\exp(r_{c3})-1/(n_c-1))/(1+\exp(r_{c3})), ρ_{n3}=(\exp(r_{n3})-1/(n_n-1))/(1+\exp(r_{n3})).
为了方便参数的估计,我们reparametrize参数向量为θ^*=(π_1,π_2,μ_{c1},σ^2_{c1},r_{c1},δ_{n1} σ^2_{n1},r_{n1},μ_2,σ^2_2,r_2,μ_{c3},σ^2_{c3},r_{c3},δ_{n3},σ^2_{n3},r_{n3}),其中μ_{n1}=μ_{c1}-\exp(δ_{n1}),μ_{n3}=μ_{c3}+\exp(δ_{n3}),ρ_{c1}=(\exp(r_{c1})-1/(n_c-1))/(1+\exp(r_{c1})),ρ_{n1}=(\exp(r_{n1})-1/(n_n-1))/(1+\exp(r_{n1})),ρ_{2}=(\exp(r_{2})-1/(n-1))/(1+\exp(r_{2})),<X >,ρ_{c3}=(\exp(r_{c3})-1/(n_c-1))/(1+\exp(r_{c3}))。
Given a gene, the expression levels of the gene are assumed independent. However, after scaling, the scaled expression levels of the gene are no longer independent and the rank r^*=r-1 of the covariance matrix for the scaled gene profile will be one less than the rank r for the un-scaled gene profile Hence the covariance matrix of the gene profile will no longer be positive-definite. To avoid this problem, we delete a tissue sample after scaling since its information has been incorrporated by other scaled tissue samples. We arbitrarily select the tissue sample, which has the biggest label number, from the tissue sample group that has larger size than the other tissue sample group. For example, if there are 6 cancer tissue samples and 10 normal tissue samples, we delete the 10-th normal tissue sample after scaling.
由于一个基因,该基因的表达水平是独立的假设。然而,洗牙后,规模的基因表达水平不再是独立的,排名r^*=r-1规模基因表达谱的协方差矩阵将是一个比排名r联合国规模少基因表达谱,因此基因表达谱的协方差矩阵将不再是正定的。为了避免这个问题,我们删除缩放后的组织样本,因为它的信息已被其他规模的组织样本incorrporated。我们从组织样本组具有较大的规模比其他组织样本组的组织样本,其中最大的标号,任意选择。例如,如果有6个癌组织标本和10例正常组织标本,我们删除缩放后的10个正常组织样本。
值----------Value----------
A list contains 13 elements.
一个列表包含13个元素。
参数:dat
the (transformed) microarray data matrix. If tranformation performed, then dat will be different from the input microarray data matrix.
(转换)芯片的数据矩阵。如果穿越 - 执行,然后dat将芯片从输入数据矩阵不同。
参数:memSubjects
the same as the input memSubjects.
作为输入memSubjects。
参数:memGenes
a vector of cluster membership of genes. 1 means up-regulated gene; 2 means non-differentially expressed gene; 3 means down-regulated gene.
聚类成员的基因向量。 1意味着上调基因;2意味着非差异表达基因;3意味着下调基因。
参数:memGenes2
an variant of the vector of cluster membership of genes. 1 means differentially expressed gene; 0 means non-differentially expressed gene.
聚类成员的基因向量的一个变种。 1意味着差异表达的基因;0是指非差异表达基因。
参数:para
parameter estimates (c.f. details).
参数估计(C.F.详情)。
参数:llkh
value of the loglikelihood function.
的loglikelihood函数值。
参数:wiMat
posterior probability that a gene belongs to a cluster given the expression levels of this gene. Column i is for cluster i.
一个基因属于聚类的后验概率获得该基因的表达水平。专栏中,我是第一组。
参数:memIni
the initial cluster membership of genes.
初始的聚类成员的基因。
参数:paraIni
the parameter estimates based on initial gene cluster membership.
根据最初的基因簇成员的参数估计。
参数:llkhIni
the value of loglikelihood function.
loglikelihood功能的价值。
参数:lambda
the parameter used to do Box-Cox transformation
用来做Box-Cox变换参数
参数:paraRP
parameter estimates for reparametrized parameter vector (c.f. details).
参数估计为reparametrized参数向量(比照细节)。
参数:paraIniRP
the parameter estimates for reparametrized parameter vector based on initial gene cluster membership.
参数估计初始基因簇成员reparametrized参数向量的基础。
注意----------Note----------
The speed of the program is slow for large data sets.
该方案的速度是缓慢的大型数据集。
作者(S)----------Author(s)----------
Weiliang Qiu <a href="mailto:stwxq@channing.harvard.edu">stwxq@channing.harvard.edu</a>,
Wenqing He <a href="mailto:whe@stats.uwo.ca">whe@stats.uwo.ca</a>,
Xiaogang Wang <a href="mailto:stevenw@mathstat.yorku.ca">stevenw@mathstat.yorku.ca</a>,
Ross Lazarus <a href="mailto:ross.lazarus@channing.harvard.edu">ross.lazarus@channing.harvard.edu</a>
参考文献----------References----------
A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples. The International Journal of Biostatistics. 4(1):Article 20. http://www.bepress.com/ijb/vol4/iss1/20
参见----------See Also----------
gsMMD, gsMMD.default, gsMMD2
gsMMD,gsMMD.default,gsMMD2
举例----------Examples----------
## Not run: [#无法运行:]
library(ALL)
data(ALL)
eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"]
mat <- exprs(eSet1)
mem.str <- as.character(eSet1$BT)
nSubjects <- length(mem.str)
memSubjects <- rep(0, nSubjects)
# B3 coded as 0, T2 coded as 1[B3的编码为0,T2的编码为1]
memSubjects[mem.str == "T2"] <- 1
myWilcox <-
function(x, memSubjects, alpha = 0.05)
{
xc <- x[memSubjects == 1]
xn <- x[memSubjects == 0]
m <- sum(memSubjects == 1)
res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha)
res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2)
names(res2) <- c("p.value", "statistic")
return(res2)
}
tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects))
colnames(tmp) <- c("p.value", "statistic")
memIni <- rep(2, nrow(mat))
memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1
memIni[tmp[, 1] < 0.05 & tmp[,2] < 0] <- 3
cat("initial gene cluster size>>\n"); print(table(memIni)); cat("\n");
obj.gsMMD <- gsMMD2.default(mat, memSubjects, memIni = memIni,
transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE)
round(obj.gsMMD$para, 3)
## End(Not run)[#结束(不运行)]
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|