找回密码
 注册
查看: 575|回复: 0

R语言 BioSeqClass包 hr()函数中文帮助文档(中英文对照)

[复制链接]
发表于 2012-2-25 13:42:41 | 显示全部楼层 |阅读模式
hr(BioSeqClass)
hr()所属R语言包:BioSeqClass

                                        Homolog Reduction
                                         同源减少

                                         译者:生物统计家园网 机器人LoveR

描述----------Description----------

Filter homolog sequences by sequence similarity.
筛选的序列相似性的同源序列。


用法----------Usage----------


  hr(seq, method, identity, cdhit.path)
  
  cdhitHR(seq, identity=0.3, cdhit.path)
  aligndisHR(seq, identity=0.6)
  distance(seq1,seq2)
  
  getTrain(seqfile, posfile, aa, w, identity, balance=T)  
  getNegSite(posSite, seq, aa)



参数----------Arguments----------

参数:seq
a list with one element for each protein/gene sequence. The  elements are in two parts, one the description ("desc") and the second is a  character string of the biological sequence ("seq").  
每个蛋白质/基因序列的元素之一列表。这些元素是在两个部分,一是说明(“DESC”),第二个是生物序列的字符串(“SEQ”)。


参数:identity
a numeric value ranged from 0 to 1. It is used as a maximum  identity cutoff among input sequences.   
数值范围从0到1。它被用来作为一个输入序列中最大的身份截止。


参数:method
a string for the method of homolog redunction. This must be  one of the strings "cdhit" or "aligndis".  
字符串一个的同源redunction的方法。这必须是字符串的“cdhit”或“aligndis”之一。


参数:cdhit.path
a string for the path of cdhit program directory. eg:  "/people/hongli/cd-hit". It is necessary when method="cdhit".  
为cdhit程序目录的路径字符串。例如:“/人/红荔/ CD重灾区”。这是必要时,方法=“cdhit”。


参数:seq1
a string for the protein or gene sequence.     
蛋白质或基因序列的字符串。


参数:seq2
a string for the protein or gene sequence. seq1 and seq2 must  have same length.
蛋白质或基因序列的字符串。 SEQ1和SEQ2必须具有相同的长度。


参数:seqfile
a string for the name of FASTA file.  
FASTA格式的文件名称的字符串。


参数:posfile
a string for the name of file which contains the positive site  dataset. It has two columns: 1st column is the protein name; 2st column is  the positive site. Protein name should be consistent with the name used in  seqfile.  
文件名的字符串,其中包含了积极的网站集。它有两列:第一列是蛋白质的名称; 2ST列是积极的网站。蛋白质的名称应该是在seqfile使用的名称一致。


参数:aa
a character for the interested amino acid. eg: "C".
感兴趣的氨基酸字符。例如:“C”的。


参数:w
an integer for the window size of flanking peptide sequence. Window  size is 2*w+1, and the central residues are the positive sites in posfile.
侧翼肽序列的窗口大小的整数。窗口大小为2 * W +1,和中央的残留物是积极在posfile的网站。


参数:balance
a logical value indicating whether negative sites will be random selected to have the same number with positive sites.
一个逻辑值,指示是否将随机选择有相同数量的正网站负面网站。


参数:posSite
a string vector for the positive sites. It is consisted of protein description and positive site, eg: "P278168:952".  
一个位点的字符串矢量。它是由蛋白质的描述和积极的网站,如:“P278168:952”。


Details

详情----------Details----------

hr employs cdhitHR and aligndisHR to filter homolog sequences. It supported following methods:
hr员工cdhitHR和aligndisHR筛选同源序列。它支持下列方法:

"cdhit": Use cd-hit program to quickly filter sequences by given identity.  It is designed to filter full-length protein or gene sequences. "formatdb" and "blastall" are required for running cd-hit program.  (http://www.bioinformatics.org/download.php/cd-hit/cd-hit-2007-0131.tar.gz or http://www.bioinformatics.org/download.php/cd-hit/cd-hit-2007-0131-win32.tar.gz)
“cdhit”:使用CD-命中程序快速筛选给定的标识序列。它的目的是筛选全长蛋白质或基因序列。 “formatdb”和“blastall”的要求运行CD-命中方案。 (http://www.bioinformatics.org/download.php/cd-hit/cd-hit-2007-0131.tar.gz或http://www.bioinformatics.org/download.php/cd-hit/cd受灾-2007-0131-win32.tar.gz)

"aligndis": Use the number of different residues to meature the identity  between two sequences.  It is designed to filter aligned seuqnces with equal length.
“aligndis”:使用不同残留物的数量meature两个序列之间的身份。它的目的是筛选与相同长度的对齐seuqnces。

getTrain extract 2*w+1 flanking peptides of positive sites and  filter homolog sequences. Negative sites are non-positive sites in the same proteins.
getTrain提取2 * W +1侧翼积极站点和过滤器的同源序列的多肽。负面网站非阳性的网站在相同的蛋白质。

distance calculate the number of positions with different residues between two sequences.
distance计算两个序列之间的不同残留的职位数量。


值----------Value----------

hr return a list of reduced sequences.
hr减少序列返回一个列表。


作者(S)----------Author(s)----------


Hong Li



举例----------Examples----------


  distance("AABD","ACBD")
  distance("AABD","ECBD")
  if(interactive()){  
    file = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.fasta")
    library(Biostrings)
    seq = readFASTA(file)
    ## Homolog reduction of whole-length sequence by cd-hit[CD-命中#全长度序列的同源减少]
    # need cd-hit program;[需要CD-命中的方案;]
    reducSeq50 = hr(seq, method="cdhit", identity=0.5, cdhit.path="/people/hongli/cd-hit")
   
    file = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.site")
    tmp = as.matrix(read.csv(file, sep="\t",header=F))
    logical = apply(tmp,1,function(x){ l=length(unlist(strsplit(seq[x[1]],split=""))); (l>=as.numeric(x[2])+7 & as.numeric(x[2])-7>0) })
    fragment = sub.seq(seq[tmp[logical,1]], as.numeric(tmp[logical,2])-7, as.numeric(tmp[logical,2])+7)  
    ## Homolog reduction of short sequence fragment[#减短序列片段的同源]
    # It may be slow.[它可能是缓慢的。]
    reducSeq = hr(fragment, method="aligndis", identity=0.4)
   
    ## produce train set based on given positive sites and fasta sequences. [#生产火车集的基础上给予积极的网站和FASTA序列。]
    file = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.fasta")
    posfile = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.site")
    ## "getTrain" integrate negative set construction and homolog reduction. It is designed for site level training data. [#的“getTrain”的集成负集建设和同源减少。它是专为网站级别的训练数据。]
    # It may be very slow.[它可能会非常缓慢。]
    data = getTrain(file, posfile, aa="K", w=7, identity=0.4)
  }

转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。


注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

手机版|小黑屋|生物统计家园 网站价格

GMT+8, 2025-1-25 01:48 , Processed in 0.028988 second(s), 15 queries .

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表