R语言 BioSeqClass包 featureEvaluate()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-2-25 13:41:51

featureEvaluate(BioSeqClass)
featureEvaluate()所属R语言包：BioSeqClass

                                    Evaluate Different Feature Coding Schemas
                                       评估不同的功能编码架构

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

Feature sets from different feature coding schemas are used as input of classification  models, and the model performance are given in the result.
从不同的编码模式的功能特性集作为分类模型输入，模型的性能结果。

用法----------Usage----------

  featureEvaluate(seq, classLable, fileName, ele.type, featureMethod,
         cv=10, classifyMethod="libsvm",
         group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k, g,
         hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC"),
         aaindex.name, n, d, w=0.05, start.pos, stop.pos, psiblast.path,
         database.path, hmmpfam.path, pfam.path, Evalue=10^-5,
         na.type="all", na.strand="all", diprodb.method="all", diprodb.type="all",
         svm.kernel="linear", svm.scale=FALSE, svm.path, svm.options="-t 0",
         knn.k=1, nnet.size=2, nnet.rang=0.7, nnet.decay=0, nnet.maxit=100)

参数----------Arguments----------

参数：seq
a string vector for the protein, DNA, or RNA sequences.
为蛋白质，DNA或RNA序列的字符串向量。

参数：classLable
a factor or vector for the class lable of sequences in seq.
SEQ序列类LABLE的一个因素或向量。

参数：fileName
a string for the output file name.
输出文件名的字符串。

参数：ele.type
a string for the type of biological sequence. This must be  one of the strings "rnaBase", "dnaBase", "aminoacid" or "aminoacid2".
生物序列类型的字符串。这必须是字符串“rnaBase”，“dnaBase”，“氨基酸”或“aminoacid2的”之一。

参数：featureMethod
a string vector for the name of feature coding. The  alternative names are "Binary", "CTD", "FragmentComposition",  "GapPairComposition", "CKSAAP", "Hydro", "ACH", "AAindex", "ACI",  "ACF", "PseudoAAComp", "PSSM", "DOMAIN", "BDNAVIDEO", and "DIPRODB".
一个字符串矢量特征编码的名称。替代名称是“二”，“CTD的”，“FragmentComposition”，“GapPairComposition”，“CKSAAP”，“水电”，将“ACH”，“AAindex”，“公会”，“ ACF的“，”PseudoAAComp“，”的PSSM“，”域“，”BDNAVIDEO“和”DIPRODB“。

参数：classifyMethod
a string for the classification method. This must be one  of the strings "libsvm", "svmlight", "NaiveBayes", "randomForest", "knn", "tree", "nnet", "rpart", "ctree", "ctreelibsvm", "bagging".
分类方法的字符串。这必须是一个字符串“libsvm的”，“svmlight”，“NaiveBayes”，“randomForest”，“KNN”，“树”，“nnet”，“软件rpart”，“ctree” “ctreelibsvm”，“套袋”。

参数：cv
an integer for the time of cross validation, or a string  "leave\_one\_out" for the jacknife test.
交叉验证的时间，或一个字符串“假\ _one \ _OUT”Rsync常常用于测试的整数。

参数：group
a string vector for the group of amino acids. This alternative  groups are: "aaH", "aaV", "aaZ", "aaP", "aaF", "aaS" or "aaE".
一个氨基酸组的字符串向量。这种替代的群体是：“AAH”，“腺”，“AAZ”，“AAP”，“AAF”，“AAS”或“AAE”。

参数：k
an integer indicating the length of sequence fragment (k>=1).
一个整数，表示序列片段的长度（K> = 1）。

参数：g
an integer indicating the distance between two aminoacids/bases (g>=0).
一个整数，指示两个氨基酸/碱基（G> = 0）之间的距离。

参数：hydro.methods
a string vector for the methods of coding protein  hydrophobic effect. This alternative groups are: "kpm" or "SARAH1".
一个字符串向量编码蛋白质的疏水作用的方法。这种替代的群体是：“KPM”或“SARAH1”。

参数：hydro.indexs
a string vector for the methods of coding protein  hydrophobic effect. This alternative groups are: "hydroE", "hydroF" or "hydroC".
一个字符串向量编码蛋白质的疏水作用的方法。这种替代的群体是：“hydroE”，“hydroF”或“hydroC”。

参数：aaindex.name
a string for the name of physicochemical and biochemical  properties in AAindx.
理化和生化特性AAindx名称的字符串。

参数：n
an integer used as paramter of featureACF (1<=n<=L-2,  L is the the length of sequence). featureACF takes the auto-correlation  between fragment X(1)...X(L-m) and X(m+1)...X(L) (1<=m<=n) as features.
featureACF放慢参数的整数（1 <= N <= l-2，L是序列的长度）。 featureACF x（1）片段。X（LM）和X（M +1）。X之间的自相关（长）（1 <= M <= N）为特征。

参数：d
an integer used as paramter of featurePseudoAAComp  (d>=1). Coupling between amino acids X(i) and X(i+d) are considered as features.
整数featurePseudoAAComp（D> = 1）放慢参数作为。 x（i）和x（I + D）为特征的考虑。氨基酸之间的耦合

参数：w
a numeric value for the weight factor of sequence order effect in  featurePseudoAAComp.
为featurePseudoAAComp顺序效果的权重因子的数值。

参数：start.pos
a integer vector denoting the start position of the fragment window. If it is missing, it is 1 by default.
一个整数向量表示的片段窗口的起始位置。如果它不存在，它是默认为1。

参数：stop.pos
a integer vector denoting the stop position of the fragment window. If it is missing, it is the length of sequence by default.
一个整数向量表示停止位置的片段窗口。如果它不存在，它是默认情况下，序列的长度。

参数：psiblast.path
a string for the path of PSI-BLAST program blastpgp. blastpgp will be employed to iteratively search database and generate position-specific  scores for each position in the alignment.
的PSI-高炉计划blastpgp的路径字符串。 blastpgp将反复搜索数据库并生成特定位置对齐在每个位置的分数。

参数：database.path
a string for the path of formatted protein database. Database can be formatted by formatdb program.
格式化蛋白数据库的路径字符串。数据库可以格式化由formatdb方案。

参数：hmmpfam.path
a string for the path of hammpfam program in HMMER.  hammpfam will be employed to predict domains using models in Pfam database.
为路径的在HMMER hammpfam方案的字符串。 hammpfam将预测域Pfam数据库中使用的模型。

参数：pfam.path
a string for the path of pfam domain database.
PFAM域名数据库的路径字符串。

参数：Evalue
a numeric value for the E-value cutoff of perdicted Pfam domain.
数值为E值的perdicted PFAM域截止。

参数：na.type
a string for nucleic acid type. It must be "DNA", "DNA/RNA", "RNA",  or "all".
核酸类型的字符串。它必须的“DNA”，“DNA / RNA的”，“核糖核酸”，或“所有”。

参数：na.strand
a string for strand information. It must be "double", "single",  or "all".
一串链信息。它必须是“双”，“单”，或“所有”。

参数：diprodb.method
a string for mode of property determination. It can be  "experimental", "calculated", or "all".
认定财产无模式的字符串。它可以是“实验”，“计算”，或“所有”。

参数：diprodb.type
a string for property type. It can be "physicochemical",  "conformational", "letter based", or "all".
属性类型的字符串。它可以是“物化”，“构”，“信”，或“所有”。

参数：svm.kernel
a string for kernel function of SVM.
内核SVM的函数的字符串。

参数：svm.scale
a logical vector indicating the variables to be scaled.
逻辑向量表示要缩放的变量。

参数：svm.path
a character for path to SVMlight binaries (required, if path  is unknown by the OS).
的SVMlight二进制文件的路径（需要，如果路径是未知的操作系统）的特点。

参数：svm.options
Optional parameters to SVMlight. For further details see:  "How to use" on http://svmlight.joachims.org/. (e.g.: "-t 2 -g 0.1"))
SVMlight可选参数。为进一步的详细信息，请参阅：“如何使用”上http://svmlight.joachims.org/~~V。（例如：“-T-G 0.1”））

参数：nnet.size
number of units in the hidden layer. Can be zero if there are  skip-layer units.
在隐藏层单位数目。如果有跳层单位，可以是零。

参数：nnet.rang
Initial random weights on [-rang, rang]. Value about 0.5 unless  the inputs are large, in which case it should be chosen so that  rang * max(|x|) is about 1.
初始随机权[响了，响了]。 *最大（价值约0.5除非投入大，在这种情况下，应选择使响| X |）1。

参数：nnet.decay
parameter for weight decay.
参数重量衰变。

参数：nnet.maxit
maximum number of iterations.
最大迭代次数。

参数：knn.k
number of neighbours considered in function classifyModelKNN.
在功能classifyModelKNN考虑的邻居数。

Details

详情----------Details----------

featureEvaluate can test feature coding methods for short  peptide, protein, DNA or RNA.  It returns a ranked list based on the accuracy of classification result.  Each element in the list has three components: "data", "model", and "performance". "data" is a data.frame object, which stores feature matrix and its last column  is the class label. "model" is a vector for feature coding method, which  contains 6 elements: "Feature\_Function", "Feature\_Parameter",  "Feature\_Number", "Model", "Model\_Parameter", and "Cross_Validataion".  "performance" is a vector for the performance result of classification model,  which contains 10 elements: "tp", "tn", "fp", "fn", "prcc", "sn", "sp", "acc",  "mcc", "pc".
featureEvaluate可以测试功能短肽的编码方法，蛋白质，DNA或RNA。它返回一个排名列表的基础上，分类结果的准确性。列表中的每个元素有三个组成部分：“数据”，“模范”，“业绩”。 “数据”是数据框的对象，它存储功能矩阵，它的最后一列是类的标签。 “模式”为特征编码方法的向量，其中包含6个元素：“功能\ _Function”，“特征\ _Parameter”，“功能\ _Number”，“模型”，“型号\ _Parameter”和“Cross_Validataion”。 “业绩”是向量分类模型的性能，其中包含10个元素的结果为：“TP”，“TN”，“计划生育”，“FN”，“PRCC”，“SN”， “SP”，“ACC”，“MCC”，“个人电脑”。

作者（S）----------Author(s)----------

Hong Li

举例----------Examples----------

  ## read positive/negative sequence from files.[＃从文件中读取正/负序列。]
  tmpfile1 = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.pos40.pep")
  tmpfile2 = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.neg40.pep")
  posSeq = as.matrix(read.csv(tmpfile1,header=FALSE,sep="\t",row.names=1))[,1]
  negSeq = as.matrix(read.csv(tmpfile2,header=FALSE,sep="\t",row.names=1))[,1]
  seq=c(posSeq,negSeq)
  classLable=c(rep("+1",length(posSeq)),rep("-1",length(negSeq)) )
  if(interactive()){
## test various feature coding methods.[＃编码方法测试各种功能。]
## it may be time consuming.[＃这可能是费时。]
fileName = tempfile()
testFeatureSet = featureEvaluate(seq, classLable, fileName, ele.type="aminoacid",
            featureMethod=c("Binary", "CTD", "FragmentComposition", "GapPairComposition",
            "Hydro"), cv=5, classifyMethod="libsvm",
            group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7,
            hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") )
summary = read.csv(fileName,sep="\t",header=T)
fix(summary)

## Evaluate features from different feature coding functions[＃评价从不同的功能特性，编码功能。]
feature.index = 1:5
tmp <- testFeatureSet[[1]]$data
colnames(tmp) <- paste(testFeatureSet[[feature.index[1]]]$model["Feature_Function"],testFeatureSet[[feature.index[1]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ")
data <- tmp[,-ncol(tmp)]
for(i in 2:length(feature.index) ){
   tmp <- testFeatureSet[[feature.index[i]]]$data
   colnames(tmp) <- paste(testFeatureSet[[feature.index[i]]]$model["Feature_Function"],testFeatureSet[[feature.index[i]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ")
   data <- data.frame(data, tmp[,-ncol(tmp)] )
}
name <- colnames(data)
data <- data.frame(data, tmp[,ncol(tmp)] )
## feature forward selection by 'cv_FFS_classify'[＃功能正向选择cv_FFS_classify“]
## it is very time consuming.[＃这是非常费时。]
combineFeatureResult = fsFFS(data,stop.n=50,classifyMethod="knn",cv=5)
tmp = sapply(combineFeatureResult,function(x){c(length(x$features),x$performance["acc"])})
plot(tmp[1,],tmp[2,],xlab="featureNumber",ylab="Accuracy",main="result of FFS_KNN",pch=19)
lines(tmp[1,],tmp[2,])

## compare the prediction accuracy based on different feature coding methods and different classification models.[＃比较预测的准确性，根据不同的编码方法和不同的分类模式的特点。]
## it is very time consuming.[＃这是非常费时。]
testResult = lapply(c("libsvm", "randomForest", "knn", "tree"),
   function(x){
            tmp = featureEvaluate(seq, classLable, fileName = tempfile(),
            ele.type="aminoacid", featureMethod=c("Binary", "CTD", "FragmentComposition",
            "GapPairComposition", "Hydro"), cv=5, classifyMethod=x,
            group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7,
            hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") );
            sapply(tmp,function(y){c(y$model[["Feature_Function"]], y$model[["Feature_Parameter"]], y$model[["Model"]], y$performance[["acc"]])})
})
tmpFeature = as.factor(c(sapply(testResult,function(x){apply(x[1:2,],2,function(y){paste(y,collapse="; ")})})))
tmpModel = as.factor(c(sapply(testResult,function(x){x[3,]})))
tmp1 = data.frame(as.integer(tmpFeature), as.integer(tmpModel), as.numeric(c(sapply(testResult,function(x){x[4,]}))) )
require(scatterplot3d)
s3d=scatterplot3d(tmp1,color=c("red","blue","green","yellow")[tmp1[,2]],pch=19,
      xlab="Feature Coding", ylab="Classification Model",
      zlab="Accuracy under 5-fold cross validation",lab=c(10,6,7),
      y.ticklabs=c("",as.character(sort(unique(tmpModel))),"") )
  }

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册

R语言 BioSeqClass包 featureEvaluate()函数中文帮助文档(中英文对照)

浏览过的版块