featureEvaluate(BioSeqClass)
featureEvaluate()所属R语言包:BioSeqClass
Evaluate Different Feature Coding Schemas
评估不同的功能编码架构
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Feature sets from different feature coding schemas are used as input of classification models, and the model performance are given in the result.
从不同的编码模式的功能特性集作为分类模型输入,模型的性能结果。
用法----------Usage----------
featureEvaluate(seq, classLable, fileName, ele.type, featureMethod,
cv=10, classifyMethod="libsvm",
group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k, g,
hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC"),
aaindex.name, n, d, w=0.05, start.pos, stop.pos, psiblast.path,
database.path, hmmpfam.path, pfam.path, Evalue=10^-5,
na.type="all", na.strand="all", diprodb.method="all", diprodb.type="all",
svm.kernel="linear", svm.scale=FALSE, svm.path, svm.options="-t 0",
knn.k=1, nnet.size=2, nnet.rang=0.7, nnet.decay=0, nnet.maxit=100)
参数----------Arguments----------
参数:seq
a string vector for the protein, DNA, or RNA sequences.
为蛋白质,DNA或RNA序列的字符串向量。
参数:classLable
a factor or vector for the class lable of sequences in seq.
SEQ序列类LABLE的一个因素或向量。
参数:fileName
a string for the output file name.
输出文件名的字符串。
参数:ele.type
a string for the type of biological sequence. This must be one of the strings "rnaBase", "dnaBase", "aminoacid" or "aminoacid2".
生物序列类型的字符串。这必须是字符串“rnaBase”,“dnaBase”,“氨基酸”或“aminoacid2的”之一。
参数:featureMethod
a string vector for the name of feature coding. The alternative names are "Binary", "CTD", "FragmentComposition", "GapPairComposition", "CKSAAP", "Hydro", "ACH", "AAindex", "ACI", "ACF", "PseudoAAComp", "PSSM", "DOMAIN", "BDNAVIDEO", and "DIPRODB".
一个字符串矢量特征编码的名称。替代名称是“二”,“CTD的”,“FragmentComposition”,“GapPairComposition”,“CKSAAP”,“水电”,将“ACH”,“AAindex”,“公会”,“ ACF的“,”PseudoAAComp“,”的PSSM“,”域“,”BDNAVIDEO“和”DIPRODB“。
参数:classifyMethod
a string for the classification method. This must be one of the strings "libsvm", "svmlight", "NaiveBayes", "randomForest", "knn", "tree", "nnet", "rpart", "ctree", "ctreelibsvm", "bagging".
分类方法的字符串。这必须是一个字符串“libsvm的”,“svmlight”,“NaiveBayes”,“randomForest”,“KNN”,“树”,“nnet”,“软件rpart”,“ctree” “ctreelibsvm”,“套袋”。
参数:cv
an integer for the time of cross validation, or a string "leave\_one\_out" for the jacknife test.
交叉验证的时间,或一个字符串“假\ _one \ _OUT”Rsync常常用于测试的整数。
参数:group
a string vector for the group of amino acids. This alternative groups are: "aaH", "aaV", "aaZ", "aaP", "aaF", "aaS" or "aaE".
一个氨基酸组的字符串向量。这种替代的群体是:“AAH”,“腺”,“AAZ”,“AAP”,“AAF”,“AAS”或“AAE”。
参数:k
an integer indicating the length of sequence fragment (k>=1).
一个整数,表示序列片段的长度(K> = 1)。
参数:g
an integer indicating the distance between two aminoacids/bases (g>=0).
一个整数,指示两个氨基酸/碱基(G> = 0)之间的距离。
参数:hydro.methods
a string vector for the methods of coding protein hydrophobic effect. This alternative groups are: "kpm" or "SARAH1".
一个字符串向量编码蛋白质的疏水作用的方法。这种替代的群体是:“KPM”或“SARAH1”。
参数:hydro.indexs
a string vector for the methods of coding protein hydrophobic effect. This alternative groups are: "hydroE", "hydroF" or "hydroC".
一个字符串向量编码蛋白质的疏水作用的方法。这种替代的群体是:“hydroE”,“hydroF”或“hydroC”。
参数:aaindex.name
a string for the name of physicochemical and biochemical properties in AAindx.
理化和生化特性AAindx名称的字符串。
参数:n
an integer used as paramter of featureACF (1<=n<=L-2, L is the the length of sequence). featureACF takes the auto-correlation between fragment X(1)...X(L-m) and X(m+1)...X(L) (1<=m<=n) as features.
featureACF放慢参数的整数(1 <= N <= l-2,L是序列的长度)。 featureACF x(1)片段。X(LM)和X(M +1)。X之间的自相关(长)(1 <= M <= N)为特征。
参数:d
an integer used as paramter of featurePseudoAAComp (d>=1). Coupling between amino acids X(i) and X(i+d) are considered as features.
整数featurePseudoAAComp(D> = 1)放慢参数作为。 x(i)和x(I + D)为特征的考虑。氨基酸之间的耦合
参数:w
a numeric value for the weight factor of sequence order effect in featurePseudoAAComp.
为featurePseudoAAComp顺序效果的权重因子的数值。
参数:start.pos
a integer vector denoting the start position of the fragment window. If it is missing, it is 1 by default.
一个整数向量表示的片段窗口的起始位置。如果它不存在,它是默认为1。
参数:stop.pos
a integer vector denoting the stop position of the fragment window. If it is missing, it is the length of sequence by default.
一个整数向量表示停止位置的片段窗口。如果它不存在,它是默认情况下,序列的长度。
参数:psiblast.path
a string for the path of PSI-BLAST program blastpgp. blastpgp will be employed to iteratively search database and generate position-specific scores for each position in the alignment.
的PSI-高炉计划blastpgp的路径字符串。 blastpgp将反复搜索数据库并生成特定位置对齐在每个位置的分数。
参数:database.path
a string for the path of formatted protein database. Database can be formatted by formatdb program.
格式化蛋白数据库的路径字符串。数据库可以格式化由formatdb方案。
参数:hmmpfam.path
a string for the path of hammpfam program in HMMER. hammpfam will be employed to predict domains using models in Pfam database.
为路径的在HMMER hammpfam方案的字符串。 hammpfam将预测域Pfam数据库中使用的模型。
参数:pfam.path
a string for the path of pfam domain database.
PFAM域名数据库的路径字符串。
参数:Evalue
a numeric value for the E-value cutoff of perdicted Pfam domain.
数值为E值的perdicted PFAM域截止。
参数:na.type
a string for nucleic acid type. It must be "DNA", "DNA/RNA", "RNA", or "all".
核酸类型的字符串。它必须的“DNA”,“DNA / RNA的”,“核糖核酸”,或“所有”。
参数:na.strand
a string for strand information. It must be "double", "single", or "all".
一串链信息。它必须是“双”,“单”,或“所有”。
参数:diprodb.method
a string for mode of property determination. It can be "experimental", "calculated", or "all".
认定财产无模式的字符串。它可以是“实验”,“计算”,或“所有”。
参数:diprodb.type
a string for property type. It can be "physicochemical", "conformational", "letter based", or "all".
属性类型的字符串。它可以是“物化”,“构”,“信”,或“所有”。
参数:svm.kernel
a string for kernel function of SVM.
内核SVM的函数的字符串。
参数:svm.scale
a logical vector indicating the variables to be scaled.
逻辑向量表示要缩放的变量。
参数:svm.path
a character for path to SVMlight binaries (required, if path is unknown by the OS).
的SVMlight二进制文件的路径(需要,如果路径是未知的操作系统)的特点。
参数:svm.options
Optional parameters to SVMlight. For further details see: "How to use" on http://svmlight.joachims.org/. (e.g.: "-t 2 -g 0.1"))
SVMlight可选参数。为进一步的详细信息,请参阅:“如何使用”上http://svmlight.joachims.org/~~V。 (例如:“-T-G 0.1”))
参数:nnet.size
number of units in the hidden layer. Can be zero if there are skip-layer units.
在隐藏层单位数目。如果有跳层单位,可以是零。
参数:nnet.rang
Initial random weights on [-rang, rang]. Value about 0.5 unless the inputs are large, in which case it should be chosen so that rang * max(|x|) is about 1.
初始随机权[响了,响了]。 *最大(价值约0.5除非投入大,在这种情况下,应选择使响| X |)1。
参数:nnet.decay
parameter for weight decay.
参数重量衰变。
参数:nnet.maxit
maximum number of iterations.
最大迭代次数。
参数:knn.k
number of neighbours considered in function classifyModelKNN.
在功能classifyModelKNN考虑的邻居数。
Details
详情----------Details----------
featureEvaluate can test feature coding methods for short peptide, protein, DNA or RNA. It returns a ranked list based on the accuracy of classification result. Each element in the list has three components: "data", "model", and "performance". "data" is a data.frame object, which stores feature matrix and its last column is the class label. "model" is a vector for feature coding method, which contains 6 elements: "Feature\_Function", "Feature\_Parameter", "Feature\_Number", "Model", "Model\_Parameter", and "Cross_Validataion". "performance" is a vector for the performance result of classification model, which contains 10 elements: "tp", "tn", "fp", "fn", "prcc", "sn", "sp", "acc", "mcc", "pc".
featureEvaluate可以测试功能短肽的编码方法,蛋白质,DNA或RNA。它返回一个排名列表的基础上,分类结果的准确性。列表中的每个元素有三个组成部分:“数据”,“模范”,“业绩”。 “数据”是数据框的对象,它存储功能矩阵,它的最后一列是类的标签。 “模式”为特征编码方法的向量,其中包含6个元素:“功能\ _Function”,“特征\ _Parameter”,“功能\ _Number”,“模型”,“型号\ _Parameter”和“Cross_Validataion”。 “业绩”是向量分类模型的性能,其中包含10个元素的结果为:“TP”,“TN”,“计划生育”,“FN”,“PRCC”,“SN”, “SP”,“ACC”,“MCC”,“个人电脑”。
作者(S)----------Author(s)----------
Hong Li
举例----------Examples----------
## read positive/negative sequence from files.[#从文件中读取正/负序列。]
tmpfile1 = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.pos40.pep")
tmpfile2 = file.path(.path.package("BioSeqClass"), "example", "acetylation_K.neg40.pep")
posSeq = as.matrix(read.csv(tmpfile1,header=FALSE,sep="\t",row.names=1))[,1]
negSeq = as.matrix(read.csv(tmpfile2,header=FALSE,sep="\t",row.names=1))[,1]
seq=c(posSeq,negSeq)
classLable=c(rep("+1",length(posSeq)),rep("-1",length(negSeq)) )
if(interactive()){
## test various feature coding methods.[#编码方法测试各种功能。]
## it may be time consuming.[#这可能是费时。]
fileName = tempfile()
testFeatureSet = featureEvaluate(seq, classLable, fileName, ele.type="aminoacid",
featureMethod=c("Binary", "CTD", "FragmentComposition", "GapPairComposition",
"Hydro"), cv=5, classifyMethod="libsvm",
group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7,
hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") )
summary = read.csv(fileName,sep="\t",header=T)
fix(summary)
## Evaluate features from different feature coding functions[#评价从不同的功能特性,编码功能。]
feature.index = 1:5
tmp <- testFeatureSet[[1]]$data
colnames(tmp) <- paste(testFeatureSet[[feature.index[1]]]$model["Feature_Function"],testFeatureSet[[feature.index[1]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ")
data <- tmp[,-ncol(tmp)]
for(i in 2:length(feature.index) ){
tmp <- testFeatureSet[[feature.index[i]]]$data
colnames(tmp) <- paste(testFeatureSet[[feature.index[i]]]$model["Feature_Function"],testFeatureSet[[feature.index[i]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ")
data <- data.frame(data, tmp[,-ncol(tmp)] )
}
name <- colnames(data)
data <- data.frame(data, tmp[,ncol(tmp)] )
## feature forward selection by 'cv_FFS_classify'[#功能正向选择cv_FFS_classify“]
## it is very time consuming.[#这是非常费时。]
combineFeatureResult = fsFFS(data,stop.n=50,classifyMethod="knn",cv=5)
tmp = sapply(combineFeatureResult,function(x){c(length(x$features),x$performance["acc"])})
plot(tmp[1,],tmp[2,],xlab="featureNumber",ylab="Accuracy",main="result of FFS_KNN",pch=19)
lines(tmp[1,],tmp[2,])
## compare the prediction accuracy based on different feature coding methods and different classification models.[#比较预测的准确性,根据不同的编码方法和不同的分类模式的特点。]
## it is very time consuming.[#这是非常费时。]
testResult = lapply(c("libsvm", "randomForest", "knn", "tree"),
function(x){
tmp = featureEvaluate(seq, classLable, fileName = tempfile(),
ele.type="aminoacid", featureMethod=c("Binary", "CTD", "FragmentComposition",
"GapPairComposition", "Hydro"), cv=5, classifyMethod=x,
group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7,
hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") );
sapply(tmp,function(y){c(y$model[["Feature_Function"]], y$model[["Feature_Parameter"]], y$model[["Model"]], y$performance[["acc"]])})
})
tmpFeature = as.factor(c(sapply(testResult,function(x){apply(x[1:2,],2,function(y){paste(y,collapse="; ")})})))
tmpModel = as.factor(c(sapply(testResult,function(x){x[3,]})))
tmp1 = data.frame(as.integer(tmpFeature), as.integer(tmpModel), as.numeric(c(sapply(testResult,function(x){x[4,]}))) )
require(scatterplot3d)
s3d=scatterplot3d(tmp1,color=c("red","blue","green","yellow")[tmp1[,2]],pch=19,
xlab="Feature Coding", ylab="Classification Model",
zlab="Accuracy under 5-fold cross validation",lab=c(10,6,7),
y.ticklabs=c("",as.character(sort(unique(tmpModel))),"") )
}
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|