找回密码
 注册
查看: 416|回复: 0

R语言 seqinr包 G+C Content()函数中文帮助文档(中英文对照)

[复制链接]
发表于 2012-9-30 01:17:50 | 显示全部楼层 |阅读模式
G+C Content(seqinr)
G+C Content()所属R语言包:seqinr

                                        Calculates the fractional G+C content of nucleic acid sequences.
                                         计算分数的G+ C含量的核酸序列。

                                         译者:生物统计家园网 机器人LoveR

描述----------Description----------

Calculates the fraction of G+C bases of the input nucleic acid sequence(s). It reads in nucleic acid sequences, sums the number of 'g' and 'c' bases and writes out the result as the fraction (in the interval 0.0 to 1.0) to the total number of 'a', 'c', 'g' and 't' bases.  Global G+C content GC, G+C in the first position of the codon bases GC1, G+C in the second position of the codon bases GC2, and G+C in the third position of the codon bases GC3 can be computed. All functions can take ambiguous bases into account when requested.
计算输入的核酸序列(s)的(G+ C)碱的馏分。它读取在核酸序列中,总结g和c的碱基的数目,写出总数的一,c的克的馏分(在0.0到1.0的时间间隔),以将结果作为和t碱基。全球G+ C含量GC,(G+ C)在第一位置的密码子碱基GC1,在第二位置的密码子碱的(G+ C)GC2,和(G+ C)在的密码子的第三个位置碱基GC3可以计算。所有的功能都可以采取模棱两可的碱基时考虑到。


用法----------Usage----------


GC(seq, forceToLower = TRUE, exact = FALSE, NA.GC = NA, oldGC = FALSE)
GC1(seq, frame = 0, ...)
GC2(seq, frame = 0, ...)
GC3(seq, frame = 0, ...)
GCpos(seq, pos, frame = 0, ...)



参数----------Arguments----------

参数:seq
a nucleic acid sequence as a vector of single characters
作为向量的单个字符的核酸序列


参数:frame
for coding sequences, an integer (0, 1, 2) giving the frame
一个整数(0,1,2)编码序列,给帧


参数:forceToLower
logical. if TRUE  force sequence characters in lower-case. Turn this to FALSE to save time if your sequence is already in lower-case (cpu time is approximately divided by 3 when turned off)
逻辑。如果TRUE力量的序列字符在较低的情况下。这FALSE打开,以节省时间,如果你的序列是在较低的情况下(CPU时间大约是除以3时关闭)


参数:exact
logical: if TRUE ambiguous bases are taken into account when computing the G+C content (see details). Turn this to FALSE to save time if your you can neglect ambiguous bases in your sequence (cpu time is approximately divided by 3 when turned off)  
逻辑:如果TRUE暧昧碱基时,考虑到计算的G+ C含量(见详情)。开启至FALSE如果你的你可以忽略模糊碱基序列(CPU时间大约是除以3时关闭,以节省时间)


参数:NA.GC
what should be returned when the GC is impossible to compute from data, for instance with NNNNNNN. This behaviour could be different when argument exact is TRUE, for instance the G+C content of WWSS is NA by default, but is 0.5 when exact is set to TRUE
什么应该返回时,GC是无法计算的数据,例如与NNNNNNN。此行为可能会是不同的,当参数exactTRUE,例如在G+ C含量WWSS的是NA默认情况下,但是为0.5时exact设置为 TRUE


参数:...
arguments passed to the function GC
参数传递到的功能GC


参数:pos
for coding sequences, the codon position (1, 2, 3) that should be taken into account to compute the G+C content
编码序列的密码子的位置(1,2,3),应考虑计算的G+ C含量


参数:oldGC
logical defaulting to FALSE: should the GC content computed as in seqinR <= 1.0-6, that is as the sum of 'g' and 'c' bases divided by the length of the sequence. As from seqinR >= 1.1-3, this argument is deprecated and a warning is issued.
逻辑拖欠FALSE:GC含量应在seqinR计算<= 1.0-6,即是作为的总和,g和c的碱基序列的长度除以。从seqinR> = 1.1-3,这种说法已过时,并发出警告。


Details

详细信息----------Details----------

When exact is set to TRUE the G+C content is estimated  with ambiguous bases taken into account. Note that this is time expensive. A first pass is made on non-ambiguous bases to estimate the probabilities of the four bases in the sequence. They are then used to weight the contributions of ambiguous bases to the G+C content. Let note nx the total number of base 'x' in the sequence. For instance suppose that there are nb bases 'b'. 'b' stands for "not a", that is for 'c', 'g' or 't'. The contribution of 'b' bases to the GC base count will be:
当exact设置为TRUE的G+ C含量估计与暧昧碱基考虑。注意,这是时间昂贵。第一遍是不明确的基础上,估计序列中的四个碱基的概率。然后,他们用来加权模糊碱基的G+ C含量的贡献。让我们注意NX碱基X序列中的总人数。例如,假设有NB碱基B。 “B”代表“不是”,那就是C,g或T。 B碱基的GC碱基数的贡献将是:

nb*(nc + ng)/(nc + ng + nt)
*注:(NC + NG)/(NC + NG + NT)

The contribution of 'b' bases to the AT base count will be:
B碱基AT碱基数的贡献将是:

nb*nt/(nc + ng + nt)
*注:NT /(NC + NG + NT)

All ambiguous bases contributions to the AT and GC counts are weighted is similar way and then the G+C content is computed as ngc/(nat + ngc).
所有模糊碱基AT和GC计数加权的贡献是类似的方式,然后为NGC /(的NAT + NGC)计算的G+ C含量。


值----------Value----------

GC returns the fraction of G+C (in [0,1]) as a numeric vector of length one. GCpos returns GC at position pos. GC1, GC2, GC3 are wrappers for GCpos with the argument pos set to 1, 2, and 3, respectively. NA is returned when seq is NA. NA.GC defaulting to NA is returned when the G+C content  can not be computed from data.
GC返回的G+ C([0,1])为一个数值向量的长度为一的比例。 GCpos返回GC在位置pos。 GC1,GC2,GC3包装GCpos参数pos设置为1,2,3,。 NA时,返回seq是NA。 NA.GC违约NA时,返回的G+ C含量不能计算的数据。


(作者)----------Author(s)----------


D. Charif and L. Palmeira and J.R. Lobry



参考文献----------References----------


http://codonw.sourceforge.net/.

参见----------See Also----------

You can use s2c to convert a string into a vetor of single character and tolower to convert upper-case characters into lower-case characters. Do not confuse with gc for garbage collection.
您可以使用s2c字符串转换成一个,通过根癌农杆菌介导的单字符和tolower转换为大写字符,小写字符。不要混淆gc垃圾收集。


实例----------Examples----------


   mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
   GC(mysequence)  # 0.4761905[0.4761905]
   GC1(mysequence) # 0.6428571[0.6428571]
   GC2(mysequence) # 0.3571429[0.3571429]
   GC3(mysequence) # 0.4285714[0.4285714]
#[]
# With upper-case characters:[大写字符:]
#[]
  myUCsequence <- s2c("GGGGGGGGGA")
  GC(myUCsequence) # 0.9[0.9]
#[]
# With ambiguous bases:[有暧昧碱基:]
#[]
  GC(s2c("acgt")) # 0.5[0.5]
  GC(s2c("acgtssss")) # 0.5[0.5]
  GC(s2c("acgtssss"), exact = TRUE) # 0.75[0.75]
#[]
# Missing data:[丢失的数据:]
#[]
  stopifnot(is.na(GC(s2c("NNNN"))))
  stopifnot(is.na(GC(s2c("NNNN"), exact = TRUE)))
  stopifnot(is.na(GC(s2c("WWSS"))))
  stopifnot(GC(s2c("WWSS"), exact = TRUE) == 0.5)
#[]
# Coding sequences tests:[编码序列测试:]
#[]
  cdstest <- s2c("ATGATG")
  stopifnot(GC3(cdstest) == 1)
  stopifnot(GC2(cdstest) == 0)
  stopifnot(GC1(cdstest) == 0)
#[]
# How to reproduce the results obtained with the C program codonW[如何重现的C程序codonW获得的结果]
# version 1.4.4 writen by John Peden. We use here the "input.dat"[由约翰·佩登版本1.4.4书面上。我们在这里使用“INPUT.DAT”]
# test file from codonW (there are no ambiguous base in these[测试文件从codonW中(在这些有没有歧义的基础]
# sequences).[序列)。]
#[]
  inputdatfile <- system.file("sequences/input.dat", package = "seqinr")
  input &lt;- read.fasta(file = inputdatfile) # read the FASTA file[阅读FASTA文件]
  inputoutfile <- system.file("sequences/input.out", package = "seqinr")
  input.res &lt;- read.table(inputoutfile, header = TRUE) # read codonW result file[读codonW结果文件]
#[]
# remove stop codon before computing G+C content (as in codonW)[删除终止密码子前计算G+ C含量(在codonW)]
#[]
  GC.codonW <- function(dnaseq, ...){
           GC(dnaseq[seq_len(length(dnaseq) - 3)], ...)
  }
  input.gc <- sapply(input, GC.codonW, forceToLower = FALSE)
  max(abs(input.gc - input.res$GC)) # 0.0004946237[0.0004946237]

  plot(x = input.gc, y = input.res$GC, las = 1,
  xlab = "Results with GC()", ylab = "Results from codonW",
  main = "Comparison of G+C content results")
  abline(c(0, 1), col = "red")
  legend("topleft", inset = 0.01, legend = "y = x", lty = 1, col = "red")
## Not run: [#不运行:]
# Too long for routine check[太长例行检查]
# This is a benchmark to compare the effect of various parameter[这是一个用来比较的各种参数的影响的基准]
# setting on computation time[计算时间的设定]
n <- 10
from <-10^4
to <- 10^5
size <- seq(from = from, to = to, length = n)
res <- data.frame(matrix(NA, nrow = n, ncol = 5))
colnames(res) <- c("size", "FF", "FT", "TF", "TT")
res[, "size"] <- size

for(i in seq_len(n)){
  myseq <- sample(x = s2c("acgtws"), size = size[i], replace = TRUE)   
  res[i, "FF"] <- system.time(GC(myseq, forceToLower = FALSE, exact = FALSE))[3]
  res[i, "FT"] <- system.time(GC(myseq, forceToLower = FALSE, exact = TRUE))[3]
          res[i, "TF"] <- system.time(GC(myseq, forceToLower = TRUE, exact = FALSE))[3]
          res[i, "TT"] <- system.time(GC(myseq, forceToLower = TRUE, exact = TRUE))[3]
}

par(oma = c(0,0,2.5,0), mar = c(4,5,0,2) + 0.1, mfrow = c(2, 1))
plot(res$size, res$TT, las = 1,
xlab = "Sequence size [bp]",
ylim = c(0, max(res$TT)), xlim = c(0, max(res$size)), ylab = "")
title(ylab = "Observed time [s]", line = 4)
abline(lm(res$TT~res$size))
points(res$size, res$FT, col = "red")
abline(lm(res$FT~res$size), col = "red", lty = 3)
points(res$size, res$TF, pch = 2)
abline(lm(res$TF~res$size))
points(res$size, res$FF, pch = 2, col = "red")
abline(lm(res$FF~res$size), lty = 3, col = "red")


legend("topleft", inset = 0.01, legend = c("forceToLower = TRUE", "forceToLower = FALSE"), col = c("black", "red"), lty = c(1,3))
legend("bottomright", inset = 0.01, legend = c("exact = TRUE", "exact = FALSE"),
pch = c(1,2))

mincpu <- lm(res$FF~res$size)$coef[2]

barplot(
c(lm(res$FF~res$size)$coef[2]/mincpu,
  lm(res$TF~res$size)$coef[2]/mincpu,
  lm(res$FT~res$size)$coef[2]/mincpu,
  lm(res$TT~res$size)$coef[2]/mincpu),
horiz = TRUE, xlab = "Increase of CPU time",
col = c("red", "black", "red", "black"),
names.arg = c("(F,F)", "(T,F)", "(F,T)", "(T,T)"), las = 1)
title(ylab = "forceToLower,exact", line = 4)

mtext("CPU time as function of options", outer = TRUE, line = 1, cex = 1.5)

## End(Not run)[#(不执行)]

转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。


注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

手机版|小黑屋|生物统计家园 网站价格

GMT+8, 2025-5-20 05:09 , Processed in 0.024051 second(s), 16 queries .

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表