R语言 RTextTools包 create_matrix()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-9-28 22:41:28

create_matrix(RTextTools)
create_matrix()所属R语言包：RTextTools

                                       creates a document-term matrix to be passed into create_container().
                                       被传递到create_container（）创建一个文件术语矩阵。

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

Creates an object of class DocumentTermMatrix from tm that can be used in the create_container function.
创建一个对象类DocumentTermMatrix从tm，可用于在create_container函数。

用法----------Usage----------

create_matrix(textColumns, language="english", minDocFreq=1, maxDocFreq=Inf,
minWordLength=3, maxWordLength=Inf, ngramLength=1, originalMatrix=NULL,
removeNumbers=FALSE, removePunctuation=TRUE, removeSparseTerms=0,
removeStopwords=TRUE,  stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE,
weighting=weightTf)

参数----------Arguments----------

参数：textColumns
Either character vector (e.g. data$Title) or a cbind() of columns to use for training the algorithms (e.g. cbind(data$Title,data$Subject)).
无论是字符向量（例如标题）或cbind()培训的算法使用的列（如：cbind(data$Title,data$Subject)）。

参数：language
The language to be used for stemming the text data.
要使用的语言所产生的文本数据。

参数：minDocFreq
The minimum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details.
倍的最小数目的词语应该出现在文档中，它被包含在基质中。请参阅套件“tm更多详细信息。

参数：maxDocFreq
The maximum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details.
词语的次数的最大数目应出现在文档中，它被包含在基质中。请参阅套件“tm更多详细信息。

参数：minWordLength
The minimum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details.
字母的最小数目的一个或n-克应包含被包含在基质中。请参阅套件“tm更多详细信息。

参数：maxWordLength
The maximum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details.
n-gram的一个字或字母的最大数目应包含被包含在基质中。请参阅套件“tm更多详细信息。

参数：ngramLength
The number of words to include per n-gram for the document-term matrix.
的字的数量，包括每个n-gram的的文档项矩阵。

参数：originalMatrix
The original DocumentTermMatrix used to train the models. If supplied, will adjust the new matrix to work with saved models.
原始的DocumentTermMatrix用来训练模型。如果提供与保存的模型，将调整新的矩阵。

参数：removeNumbers
A logical parameter to specify whether to remove numbers.
Alogical参数指定是否要删除号码。

参数：removePunctuation
A logical parameter to specify whether to remove punctuation.
Alogical参数指定是否要删除标点符号。

参数：removeSparseTerms
See package tm for more details.
请参阅套件“tm更多详细信息。

参数：removeStopwords
A logical parameter to specify whether to remove stopwords using the language specified in language.
Alogical参数指定是否要删除停用词使用的语言所指定的语言。

参数：stemWords
A logical parameter to specify whether to stem words using the language specified in language.
Alogical参数指定是否要阻止使用指定的语言在语言的词语。

参数：stripWhitespace
A logical parameter to specify whether to strip whitespace.
Alogical参数指定是否要剥离其中的空白。

参数：toLower
A logical parameter to specify whether to make all text lowercase.
Alogical参数指定是否将所有文字小写。

参数：weighting
Either weightTf or weightTfIdf. See package tm for more details.
无论是weightTf或weightTfIdf。请参阅套件“tm更多详细信息。

（作者）----------Author(s)----------

Timothy P. Jurka <tpjurka@ucdavis.edu>, Loren Collingwood <lorenc2@uw.edu>

实例----------Examples----------

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english",
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册