R语言 seqinr包 fastacc()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-9-30 01:17:44

fastacc(seqinr)
fastacc()所属R语言包：seqinr

                                    Fast Allele in Common Count
                                       在公共计数的快速等位基因

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

The purpose of this function is to compute as fast as possible the number of allele in common between a target (typically the genetic profile observed at a crime scene, possibly a mixture with dropouts) and a database reference (typically genetic profile of individuals). Both are assumed to be pre-encoded at the bit level in a consistent way.
这个功能的目的是尽可能快的等位基因数计算之间共同的目标（通常是在犯罪现场观察到的基因档案，可能是一个混合物，辍学）和一个数据库的引用（通常为遗传个人档案）。两者都假设位电平在一个一致的方式进行预编码。

用法----------Usage----------

fastacc(target, database)

参数----------Arguments----------

参数：target
the raw encoding of the target, typically 40 octets for a core-CODIS profile in 2009
raw编码的目标，通常为40个八位位组的核心CODIS在2009年的档案

参数：database
the raw encoding of the database. If there are n entries in the database, then the database must n times longer than the target.
raw编码的数据库。如果有n个条目的数据库，然后在数据库中必须长于n次的目标。

Details

详细信息----------Details----------

This function is an RFC state. Comments are welcome.
此功能是一个RFC的状态。欢迎提出宝贵意见。

Genetic profiles are encoded at the bit level. One bit represents one allele. Count is based on a logical AND at bit level. Bit count is encoded at C level using the precomputed approach: one indirection with an auxiliary table of size 256 called bits_in_char which is pre-computed at R level and passed at C level.
位级编码的基因图谱。其中一位代表一个等位基因。计数为基础上的逻辑“与”的比特电平。使用预先计算方法：一是间接称为辅助表的大小为256位计数编码的C级bits_in_char这是预先计算的R级，C级通过。

值----------Value----------

A vector of integer giving for each entry in the database how many alleles are in common between the entry and the target.
integer给许多共同的入口和目标之间的等位基因是如何在数据库中的每个条目的矢量。

警告----------Warning ----------

Experimental, first release schedulded for seqinr  2.0-6 by the end of 2009
实验，第一个版本由2009年年底的seqinr 2.0-6 schedulded

（作者）----------Author(s)----------

J.R. Lobry

参考文献----------References----------

参见----------See Also----------

FIXME
FIXME

实例----------Examples----------

#[]
# NOTE:[注意：]
#[]
# This example section is a proof-of-concept stuff. Most code should be[这个例子是一个概念证明的东西。大多数的代码应该是]
# enbeded in documented functions to avoid verbosity. But at the RFC stage[enbeded记录功能，以避免冗长。但是，在RFC阶段]
# this is perhaps not a too bad idea to show how powerfull R is.[这也许是显示强大的R是不是太糟糕的主意。]
#[]

#[]
# Let's start from the 16 loci available in the AmpFLSTR kit:[让我们从16个基因座可在AmpFLSTR套件：]
#[]

path <- system.file("abif/AmpFLSTR_Bins_v1.txt", package = "seqinr")
resbin <- readBins(path)
codis <- resbin[["Identifiler_CODIS_v1"]]
names(codis)

#[]
# We count how many different alleles are present per locus:[我们看看有多少人存在不同的等位基因，每个位点：]
#[]

na <- unlist(lapply(codis, function(x) length(x[[1]])))
na

#[]
# The number of octets required to encode a genetic for each locus is then:[然后编码每个基因座的遗传所需的八位位组的数目是：]
#[]

ceiling(na/8)

#[]
# We need then a total of 40 octets to code these profiles:[然后，我们需要一个总的40字节的代码这些配置文件：]
#[]

sum(ceiling(na/8))

#[]
# Let's definene a function to encode a profile at a given locus, and vice versa :[，让我们definene的功能，在一个给定的位点编码配置文件，反之亦然：]
#[]

prof2raw <- function(profile, alleles) {
  if (!is.ordered(alleles)) stop("ordered factor expected for alleles")
  if (!is.character(profile)) stop("vector of character expected for profile")
  noctets <- ceiling(length(alleles)/8)
  res.b <- rawToBits(raw(noctets))
  for (i in 1:length(profile)) {
res.b[which(profile[i] == alleles)] <- as.raw(1)
  }
  return(packBits(res.b, type = "raw"))
}

raw2prof <- function(rawdata, alleles) {
  if (!is.ordered(alleles)) stop("ordered factor expected for alleles")
  if (!is.raw(rawdata)) stop("vector of raw expected for rawdata")
  res <- as.character(alleles)[as.logical(rawToBits(rawdata))]
  return(paste(res, collapse = ", "))
}

#[]
# Let now code all alleles present in codis as ordered factors:[让我们现在CODIS有序因子存在于所有等位基因编码：]
#[]

allalleles <- lapply(codis, function(x) factor(x[, 1], levels = x[, 1], ordered = TRUE))

#[]
# Let's play with our encoding/decoding utilities with first locus:[让我们发挥我们的编码/解码工具，与第1轨迹：]
#[]

allalleles[[1]] #  <8 8 9 10 11 12 13 14 15 16 17 18 19 >19[<8 8 9 10 11 12 13 14 15 16 17 18 19> 19]
res <- prof2raw(c("8", "9", "13", "14", ">19"), allalleles[[1]])
res # c6 20[C6 20]
rawToBits(res) # 00 01 01 00 00 00 01 01 00 00 00 00 00 01 00 00[00 01 01 00 00 00 01 01 00 00 00 00 00 01 00 00]
raw2prof(res, allalleles[[1]]) #  "8, 9, 13, 14, >19"[“8，9，13，14，> 19”]

#[]
# Let define a profile with all possible alleles:[让我们定义一个配置文件，所有可能的等位基因：]
#[]

ladder <- unlist(lapply(allalleles, function(x) prof2raw(as.character(x),x)))
names(ladder) <- NULL
stopifnot(identical(as.integer(ladder),
c(255L, 63L, 255L, 255L, 255L, 63L, 255L, 63L, 255L, 31L, 255L,
63L, 255L, 255L, 7L, 255L, 3L, 255L, 63L, 255L, 255L, 255L, 255L,
15L, 255L, 127L, 255L, 3L, 255L, 255L, 255L, 255L, 3L, 3L, 255L,
15L, 255L, 255L, 255L, 7L))) # simple sanity check[简单的例行性检查]

#[]
# Let's make a simulated database. Here we use a random sampling[让我们模拟数据库。在这里，我们使用了随机抽样]
# with a uniform distribution between all possible profile possible[具有均匀分布之间所有可能的配置文件可能]
# at a given locus. A more realist sampling for an individual database[在一个给定的轨迹。一个更现实主义采样为单个数据库]
# would be to sample only two alleles at each locus according to[将品尝只有两个等位基因，每个位点]
# observed frequencies in populations. [在人群中观察到的频率。]
#[]

n <- 10^5 # the number of records in the database[在数据库中的记录数]
DB <- sapply(ladder, function(x) as.raw(sample(0:as.integer(x), size = n, replace = TRUE)))

#[]
# Now we make sure that the target is in the database:[现在，我们要确保的目标是在数据库中：]
#[]

target <- DB[666, ]
DB <- as.vector(t(DB)) # put DB as a flat database (is it usefull?) [DB作为一个单位的数据库（它是有用的？）]

#[]
# Now we compute the number of alleles in common between the[现在，我们计算之间的共同的等位基因数]
# target and all the entries in the DB:[目标和所有的DB中的条目：]
#[]

system.time(res <- fastacc(target,DB)) # Fast, isn't it ?[快，是不是？]
stopifnot(which.max(res) == 666) # sanity check[完整性检查]

#[]
# Don't run : too tedious for routine check. We check here that complexity is[不要运行：过于繁琐的例行检查。在这里，我们检查的复杂性是]
# linear in time up to a 10 10^6 database size (roughly the size of individual[10 10 ^ 6数据库的大小呈线性关系（差不多大小的个人]
# profiles at the EU level)[在欧盟层面上的配置文件）]
#[]

## Not run: [＃不运行：]
maxn <- 10^7
DB <- sapply(ladder, function(x) as.raw(sample(0:as.integer(x),
  size = maxn, replace = T)))
target <- DB[666, ]
DB <- as.vector(t(DB))

np <- 10
nseq <- seq(from = 10^5, to = maxn, length = np)
res <- numeric(np)
i <- 1
for (n in nseq) {
  print(i)
  res[i] <- system.time(tmp <- fastacc(target, DB[1:n]))[1]
  stopifnot(which.max(tmp) == 666)
  i <- i + 1
}
dbse <- data.frame(list(nseq = nseq, res = res))

x <- dbse$nseq
y <- dbse$res
plot(x, y, type = "b", xlab = "Number of entries in DB", ylab = "One query time [s]",
las = 1, xlim = c(0, maxn), ylim = c(0, max(y)), main = "Data base size effect on query time")
lm1 <- lm(y ~ x - 1)
abline(lm1, col = "red")
legend("topleft", inset = 0.01, legend = paste("y =", formatC(lm1$coef[1],
digits = 3), "x"), col = "red", lty = 1)

#[]
# On my laptop the slope is 2.51e-08, that is a 1/4 of second to scan a database[在我的笔记本电脑的斜率是2.51e-08，这是一个1/4的第二扫描数据库]
# with 10 10^6 entries.[10 10 ^ 6项。]
#[]

## End(Not run)[＃（不执行）]

## end[＃结束]

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册