estimateDispersions(DESeq)
estimateDispersions()所属R语言包:DESeq
Estimate and fit dispersions for a CountDataSet.
估计和适合分散的CountDataSet。
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This function obtains dispersion estimates for a count data set. For each condition (or collectively for all conditions, see 'method' argument below) it first computes for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression a dispersion-mean relationship and finally chooses for each gene a dispersion parameter that will be used in subsequent tests from the empirical and the fitted value according to the 'sharingMode' argument.
此功能获得计数数据集的色散估计。为每个状态(或集体所有的条件,请参见“方法”下面的参数),它首先为每一个基因的实证色散值(又名原工程兵值)计算,然后通过回归适合分散意味着关系和最终选择每个基因色散参数将被用于从经验和拟合值在随后的测试,根据“sharingMode”的说法。
用法----------Usage----------
## S4 method for signature 'CountDataSet'
estimateDispersions( object,
method = c( "pooled", "per-condition", "blind" ),
sharingMode = c( "maximum", "fit-only", "gene-est-only" ),
fitType = c("parametric", "local"),
locfit_extra_args=list(), lp_extra_args=list(),
modelFrame = NULL, modelFormula = count ~ condition, ... )
参数----------Arguments----------
参数:object
a CountDataSet with size factors.
CountDataSet大小的因素。
参数:method
There are three ways how the empirical dispersion can be computed:
有三种方法可以计算经验分散:
pooled - Use the samples from all conditions with replicates to estimate a single pooled empirical dispersion value, called "pooled", and assign it to all samples.
pooled - 从所有条件使用的样本复制到估计单一汇集经验的色散值,被称为“汇集”,将它分配给所有的样品。
per-condition - For each condition with replicates, compute a gene's empirical dispersion value by considering the data from samples for this condition. For samples of unreplicated conditions, the maximum of empirical dispersion values from the other conditions is used. If object has a multivariate design (i.e., if a data frame was passed instead of a factor for the condition argument in newCountDataSet), this method is not available. (Note: This method was called “normal” in previous versions.)
per-condition - 对于每一个与复制的条件,考虑这种情况下,从样本数据计算一个基因的实证色散值。样品未复制的条件,使用的其他条件的实证色散值最大。如果object有一个多元的设计(也就是说,如果一个数据框为condition参数的因素,而不是通过newCountDataSet),这种方法是不可用。 (注:此方法被称为“正常”,在以前的版本。)
blind - Ignore the sample labels and compute a gene's empirical dispersion value as if all samples were replicates of a single condition. This can be done even if there are no biological replicates. This method can lead to loss of power; see the vignette for details. The single estimated dispersion condition is called "blind" and used for all samples.
blind - 忽略样品的标签和一个基因的实证色散值计算,如果所有样品进行了一个单一的条件复制。可以做到这一点,即使有没有生物复制。这种方法可能会导致功率损耗;看到细节暗角。估计单分散状态被称为“盲点”和用于所有样品。
参数:sharingMode
After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The sharingMode argument specifies which of these two values will be written to the featureData's disp_ columns and hence will be used by the functions nbinomTest and fitNbinomGLMs.
经验的色散值已计算每个基因后,平均色散关系是安装,以减少分散估计的变异基因之间的信息共享。在那之后,每一个基因,我们有两个值:经验值(仅此基因的数据得出),拟合值(即,类似这种基因的人),平均表达基因的典型的色散值。 sharingMode参数指定这两个值将被写入featureData的disp_列,因此将由职能nbinomTest和fitNbinomGLMs 。
fit-only - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability.
fit-only - 只使用拟合值,即,经验值仅用于输入配件,然后被忽略。这只是很少重复使用,当你是不是太关心误报从分散离群,即基因与一个不同寻常的高变异性。
maximum - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates.
maximum - 两个值的最大值。这是保守或谨慎的选择,建议一旦你有至少三个或四个重复,即使只有两个复制可能。
gene-est-only - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.
gene-est-only - 无接头或共享,只能使用经验值。这种方法是可取的,复制的数量大时,有足够经验的色散值可靠。如果复制的数量小,此选项可能会导致很多情况下,其中一个基因的分散意外估计不足,在随后的测试中也会出现假阳性。
参数:fitType
parametric - Fit a dispersion-mean relation of the form dispersion = asymptDisp + extraPois / mean via a robust gamma-family GLM. The coefficients asymptDisp and extraPois are given in the attribute coefficients of the dispFunc in the fitInfo (see below).
parametric - 一个强大的的伽玛家庭的GLM通过适合的形式dispersion = asymptDisp + extraPois / mean平均色散关系。系数asymptDisp和extraPois属性coefficientsdispFunc的fitInfo(见下文)。
local - Use the locfit package to fit a dispersion-mean relation, as described in the DESeq paper.
local - 使用locfit的包,以适应平均色散关系,在DESeq纸。
参数:locfit_extra_args, lp_extra_args
(only for fitType=local) Options to be passed to the locfit and to the lp function of the locfit package. Use this to adjust the local fitting. For example, you may pass a value for nn different from the default (0.7) if the fit seems too smooth or too rough by setting lp_extra_agrs=list(nn=0.9). As another example, you can set locfit_extra_args=list(maxk=200) if you get the error that locfit ran out of nodes. See the documentation of the locfit package for details. In most cases, you will not need to provide these parameters, as the defaults seem to work quite well.
(只适用于fitType=local)选项将被传递给locfit和lp功能的locfit包。使用此调整本地的装修。例如,你可以通过nn(0.7)从不同的默认值,如果合适,似乎设置lp_extra_agrs=list(nn=0.9)太光滑或太粗糙。另一个例子是,你可以设置locfit_extra_args=list(maxk=200)如果你得到错误,locfit跑出节点。看到locfit包细节的文档。在大多数情况下,你将不需要提供这些参数,默认似乎工作得很好。
参数:modelFrame
By default, the information in conditions(object) or pData(object) is used to determine which samples are replicates (see newCountDataSet). For method="pooled", a data frame can be passed here, and all rows that are identical in this data frame are considered to indicate replicate samples in object. For method="pooled-CR", the data frame is used in the fits. For the other methods, this argument is ignored.
默认情况下,conditions(object)或pData(object)是用来确定样本复制的信息(见newCountDataSet)。 method="pooled",数据框可以通过这里,并在此数据框是相同的所有行被视为表明object的复制样本。 method="pooled-CR",数据框是在配合使用。对于其他方法,则忽略此参数。
参数:modelFormula
For method="pooled-CR", this is the formual used for the dispersion fits. For all other methods, this argument is ignored.
method="pooled-CR",这是用于分散适合的公式在。对于所有其他方法,则忽略此参数。
参数:...
extra arguments are ignored
额外的参数将被忽略
Details
详情----------Details----------
Behaviour for method="per-condition": For each replicated condition, a list, named with the condition's name, is placed in the environment object@fitInfo. This list has five named elements: The vector perGeneDispEsts contains the empirical dispersions. The function dispFunc is the fitted function, i.e., it takes as its argument a normalized mean expression value and returns the corresponding fitted dispersion. The values fitted according to this function are in the third element fittedDispEst, a vector of the same length as perGeneDispEsts. The fourt element, df, is an integer, indicating the number of degrees of freedom of the per-gene estimation. The fifth element, sharingMode, stores the value of the sharingMode argument to esimateDispersions.
行为method="per-condition":对于每个复制的条件,列表,符合条件的名字命名,被放置在环境object@fitInfo。这个名单有五个命名元素:矢量perGeneDispEsts包含实证分散的。的功能dispFunc是拟合函数,即,它需要一个规范化的平均表现值作为它的参数,并返回相应的拟合色散。安装此功能的值,第三个元素是fittedDispEst,为perGeneDispEsts相同长度的向量。 4吨元素,df,是一个整数,表示每个基因的估计自由度的数目。第五元素,sharingMode,商店sharingModeesimateDispersions参数值。
Behaviour for method="blind" and method="pooled": Only one list is produced, named "blind" or "pooled" and placed in object@fitInfo.
行为为method="blind"和method="pooled":只有一个列表,名为"blind"或"pooled"和object@fitInfo放在。
For each list in the fitInfo environment, the dispersion values that are intended to be used in subsequent testing are computed according to the value of sharingMode and are placed in the featureData data frame, in a column named with the same name, prefixed with "disp_".
为每个列表fitInfo环境,拟在随后的测试中使用的色散值是根据的sharingMode值计算,并放置在featureData数据框,在列具有相同的名称命名,前面加上“disp_”。
Then, the dispTable (see there) is filled to assign to each condition the appropriate dispersion column in the phenoData frame.
然后,dispTable看到有填充适当分散列在phenoData帧分配给每个条件。
Note: Up to DESeq version 1.4.x (Bioconductor release 2.8), this function was called estimateVarianceFunctions, stored its result differently and had did not yet have the arguments sharingMode and fitType. estimatevarianceFunction's behaviour corresponded to the settings sharingMode="fit-only" and fitType="local". Note that these are not the default, because the new functionalities sharingMode="maximum" and fitType="local" tend to give better results in standard cases as they are more robust
注:最多的版本1.4.x的DESeq(Bioconductor版本2.8),这个功能被称为estimateVarianceFunctions,存储不同,其结果并没有尚未有论据sharingMode和fitType。对应设置estimatevarianceFunction和sharingMode="fit-only"fitType="local"的行为。请注意,这些都不是默认的,因为新的功能sharingMode =“最大”和fitType =“本地”往往给标准的情况下更好的结果,因为他们是更强大的
值----------Value----------
The CountDataSet cds, with the slots fitInfo and featureData updated as described in Details.
CountDataSet光盘插槽,fitInfo和featureData更新详细描述。
作者(S)----------Author(s)----------
Simon Anders, sanders@fs.tum.de
举例----------Examples----------
cds <- makeExampleCountDataSet()
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
str( fitInfo( cds ) )
head( fData( cds ) )
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|