R语言 tm包 termFreq()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-10-1 10:54:08

termFreq(tm)
termFreq()所属R语言包：tm

                                    Term Frequency Vector
                                       定期变频矢量

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

Generate a term frequency vector from a text document.
从一个文本文件中生成一个术语的矢量变频。

用法----------Usage----------

termFreq(doc, control = list())

参数----------Arguments----------

参数：doc
An object inheriting from TextDocument.
继承的对象从TextDocument。

参数：control
A list of control options which override default settings.  First, following two options are processed.
控制选项的列表覆盖默认的设置。首先，处理以下两个选项。

tolowerEither a logical value indicating whether characters should be translated to lower case or a custom function converting characters to lower case. Defaults to tolower.
tolower是逻辑值，该值指示是否应转换为小写或自定义函数将字符转换为小写字符。默认为tolower的。

tokenizeA function tokenizing documents into single tokens or a string matching one of the predefined tokenization functions:
tokenizeA到单个记号，记号化文件或一个字符串匹配一个预定义的标记化功能的函数：

scan for scan_tokenizer, or
scanscan_tokenizer，或

MCfor MC_tokenizer.    Defaults to scan_tokenizer.       Next, a set of options which are sensitive to the order of occurrence in the control list. Options are processed in the same order as specified. User-specified options have precedence over the default ordering so that first all user-specified options and then all remaining options (with the default settings and in the order as listed below) are processed.
MC的MC_tokenizer。默认为scan_tokenizer的。接下来，一组的选项control列表中发生的顺序是敏感的。在相同的顺序指定选项的处理。用户指定的选项优先于默认的排序方式使所有用户指定的选项，然后所有剩下的选项（使用默认设置，下面列出的顺序）进行处理。

removePunctuationA logical value indicating whether punctuation characters should be removed from doc, a custom function which performs punctuation removal, or a list of arguments for removePunctuation. Defaults to FALSE.
removePunctuation一个逻辑值，指示是否应该被删除标点字符doc，自定义函数执行标点删除，或一个参数列表removePunctuation。默认为FALSE的。

removeNumbersA logical value indicating whether numbers should be removed from doc or a custom function for number removal. Defaults to FALSE.
removeNumbers一个逻辑值，表示数字是否应该从doc或自定义函数去除的数量。默认为FALSE的。

stopwordsEither a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package, a character vector holding custom stopwords, or a custom function for stopword removal. Defaults to FALSE.
stopwords无论是使用默认的语言特定的附带这个包，自定义的非索引字的字符向量，或自定义功能停用词去除停用词列出一个布尔值，指示停用词去除。默认为FALSE的。

stemmingEither a Boolean value indicating whether tokens should be stemmed or a custom stemming function. Defaults to FALSE.    Finally, following options are processed in the given order.
stemming无论是一个布尔值，指示是否的记号应该源于或一个自定义所产生的功能。默认为FALSE的。最后，下列选项中给定的顺序进行处理。

dictionaryA character vector to be tabulated against. No other terms will be listed in the result. Defaults to NULL which means that all terms in doc are listed.
dictionary字符向量列成表反对。任何其他条款中列出的结果。默认为NULL这意味着中的所有条款doc上市。

boundsA list with a tag local whose value must be an integer vector of length 2. Terms that appear less often in doc than the lower bound bounds$local[1] or more often than the upper bound bounds$local[2] are discarded. Defaults to list(local = c(1,Inf)) (i.e., every token will be used).
bounds一个标签local，其值必须是一个整数向量长度为2的列表。条款，较少出现在doc比下限bounds$local[1]或更多的时候比的上限bounds$local[2]被丢弃的。默认为list(local = c(1,Inf))（即，每个令牌将被使用）。

wordLengthsAn integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.
wordLengths长度为2的整数向量。词语短于最低字长wordLengths[1]或长于最大字长wordLengths[2]被丢弃。默认值c(3, Inf)，即字的最小长度为3个字符。

值----------Value----------

A named integer vector of class term_frequency with term frequencies as values and tokens as names.
一个命名的整型向量类term_frequency术语频率的价值观和令牌名。

参见----------See Also----------

getTokenizers
getTokenizers

实例----------Examples----------

data("crude")
termFreq(crude[[14]])
strsplit_space_tokenizer <- function(x) unlist(strsplit(x, "[[:space:]]+"))
ctrl <- list(tokenize = strsplit_space_tokenizer,
         removePunctuation = list(preserve_intra_word_dashes = TRUE),
         stopwords = c("reuter", "that"),
         stemming = TRUE,
         wordLengths = c(4, Inf))
termFreq(crude[[14]], control = ctrl)

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册