termFreq(tm)
termFreq()所属R语言包:tm
Term Frequency Vector
定期变频矢量
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Generate a term frequency vector from a text document.
从一个文本文件中生成一个术语的矢量变频。
用法----------Usage----------
termFreq(doc, control = list())
参数----------Arguments----------
参数:doc
An object inheriting from TextDocument.
继承的对象从TextDocument。
参数:control
A list of control options which override default settings. First, following two options are processed.
控制选项的列表覆盖默认的设置。首先,处理以下两个选项。
tolowerEither a logical value indicating whether characters should be translated to lower case or a custom function converting characters to lower case. Defaults to tolower.
tolower是逻辑值,该值指示是否应转换为小写或自定义函数将字符转换为小写字符。默认为tolower的。
tokenizeA function tokenizing documents into single tokens or a string matching one of the predefined tokenization functions:
tokenizeA到单个记号,记号化文件或一个字符串匹配一个预定义的标记化功能的函数:
scan for scan_tokenizer, or
scanscan_tokenizer,或
MCfor MC_tokenizer. Defaults to scan_tokenizer. Next, a set of options which are sensitive to the order of occurrence in the control list. Options are processed in the same order as specified. User-specified options have precedence over the default ordering so that first all user-specified options and then all remaining options (with the default settings and in the order as listed below) are processed.
MC的MC_tokenizer。默认为scan_tokenizer的。接下来,一组的选项control列表中发生的顺序是敏感的。在相同的顺序指定选项的处理。用户指定的选项优先于默认的排序方式使所有用户指定的选项,然后所有剩下的选项(使用默认设置,下面列出的顺序)进行处理。
removePunctuationA logical value indicating whether punctuation characters should be removed from doc, a custom function which performs punctuation removal, or a list of arguments for removePunctuation. Defaults to FALSE.
removePunctuation一个逻辑值,指示是否应该被删除标点字符doc,自定义函数执行标点删除,或一个参数列表removePunctuation。默认为FALSE的。
removeNumbersA logical value indicating whether numbers should be removed from doc or a custom function for number removal. Defaults to FALSE.
removeNumbers一个逻辑值,表示数字是否应该从doc或自定义函数去除的数量。默认为FALSE的。
stopwordsEither a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package, a character vector holding custom stopwords, or a custom function for stopword removal. Defaults to FALSE.
stopwords无论是使用默认的语言特定的附带这个包,自定义的非索引字的字符向量,或自定义功能停用词去除停用词列出一个布尔值,指示停用词去除。默认为FALSE的。
stemmingEither a Boolean value indicating whether tokens should be stemmed or a custom stemming function. Defaults to FALSE. Finally, following options are processed in the given order.
stemming无论是一个布尔值,指示是否的记号应该源于或一个自定义所产生的功能。默认为FALSE的。最后,下列选项中给定的顺序进行处理。
dictionaryA character vector to be tabulated against. No other terms will be listed in the result. Defaults to NULL which means that all terms in doc are listed.
dictionary字符向量列成表反对。任何其他条款中列出的结果。默认为NULL这意味着中的所有条款doc上市。
boundsA list with a tag local whose value must be an integer vector of length 2. Terms that appear less often in doc than the lower bound bounds$local[1] or more often than the upper bound bounds$local[2] are discarded. Defaults to list(local = c(1,Inf)) (i.e., every token will be used).
bounds一个标签local,其值必须是一个整数向量长度为2的列表。条款,较少出现在doc比下限bounds$local[1]或更多的时候比的上限bounds$local[2]被丢弃的。默认为list(local = c(1,Inf))(即,每个令牌将被使用)。
wordLengthsAn integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.
wordLengths长度为2的整数向量。词语短于最低字长wordLengths[1]或长于最大字长wordLengths[2]被丢弃。默认值c(3, Inf),即字的最小长度为3个字符。
值----------Value----------
A named integer vector of class term_frequency with term frequencies as values and tokens as names.
一个命名的整型向量类term_frequency术语频率的价值观和令牌名。
参见----------See Also----------
getTokenizers
getTokenizers
实例----------Examples----------
data("crude")
termFreq(crude[[14]])
strsplit_space_tokenizer <- function(x) unlist(strsplit(x, "[[:space:]]+"))
ctrl <- list(tokenize = strsplit_space_tokenizer,
removePunctuation = list(preserve_intra_word_dashes = TRUE),
stopwords = c("reuter", "that"),
stemming = TRUE,
wordLengths = c(4, Inf))
termFreq(crude[[14]], control = ctrl)
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|