textcnt(tau)
textcnt()所属R语言包:tau
Term or Pattern Counting of Text Documents
术语或模式计数的文本文件
译者:生物统计家园网 机器人LoveR
描述----------Description----------
This function provides a common interface to perform typical term or pattern counting tasks on text documents.
此功能提供了一个通用的接口,执行期一般或文本文件的计数任务的模式。
用法----------Usage----------
textcnt(x, n = 3L, split = "[[:space:][:punct:][:digit:]]+",
tolower = TRUE, marker = "_", words = NULL, lower = 0L,
method = c("ngram", "string", "prefix", "suffix"),
recursive = FALSE, persistent = FALSE, useBytes = FALSE,
perl = TRUE, verbose = FALSE, decreasing = FALSE)
## S3 method for class 'textcnt'
format(x, ...)
参数----------Arguments----------
参数:x
a (list of) vector(s) of character representing one (or more) text document(s).
(列表)(s)的向量的字符,表示文本文档()的一个(或多个)。
参数:n
the maximum number of characters considered in ngram, prefix, or suffix counting (for word counting see details).
的最大数量的字符认为在NGRAM,前缀,后缀计数(字计数详见)。
参数:split
the regular expression pattern (PCRE) to be used in word splitting (if NULL, do nothing).
正则表达式(PCRE)中使用的单词分割(如果NULL,什么也不做)。
参数:tolower
option to transform the documents to lowercase (after word splitting).
选择的文件转换为小写(单词分割后)。
参数:marker
the string used to mark word boundaries.
使用的字符串标记单词边界。
参数:words
the number of words to use from the beginning of a document (if NULL, all words are used).
数的话,从一开始就使用的文件(如果NULL,所有的话)。
参数:lower
the lower bound for a count to be included in the result set(s).
下界的结果集()被包括在一个计数。
参数:method
the type of counts to compute.
的类型的计数来计算。
参数:recursive
option to compute counts for individual documents (default all documents).
选项计算计数为单独的文件(默认的所有文件)。
参数:persistent
option to count documents incrementally.
逐步计算文件的选项。
参数:useBytes
option to process byte-by-byte instead of character-by-character.
选项来处理字节的字节而不是字符的字符。
参数:perl
option to use PCRE in word splitting.
选择使用PCRE在单词分割。
参数:verbose
option to obtain timing statistics.
选项,以获得时序统计数据。
参数:decreasing
option to return the counts in decreasing order.
选项中的计数递减的顺序返回。
参数:...
further (unused) arguments.
进一步的参数(未使用)。
Details
详细信息----------Details----------
The following counting methods are currently implemented:
目前正在实施的计算方法如下:
ngram Count all word n-grams of order 1,...,n.
ngram计数所有的字N-1阶克,...,n。
string Count all word sequence n-grams of order n.
string计算所有单词序列的n-gram为了n。
prefix Count all word prefixes of at most length n.
prefix计算所有单词的前缀,在最长n。
suffix Count all word suffixes of at most length n.
suffix计算所有字后缀,在最长n。
The n-grams of a word are defined to be the substrings of length n = min(length(word), n) starting at positions 1,...,length(word)-n. Note that the value of marker is pre- and appended to word before counting. However, the empty word is never marked and therefore not counted. Note that marker = "\1" is reserved for counting of an efficient set of ngrams and marker = "\2" for the set proposed by Cavnar and Trenkle (see references).
的n-gram词语定义的子长度n = min(length(word), n)起始位点1,...,length(word)-n。请注意的值marker前和计数前追加到字。然而,空字从未标记和因此不计算在内。请注意这marker = "\1"是保留给计数有效的ngrams和marker = "\2"的组提出的Cavnar和Trenkle(请参阅参考资料)。
If method = "string" word-sequences of and only of length n are counted. Therefore, documents with less than n words are omitted.
如果method = "string"字序列的长度n被计算在内。因此,文件小于n字被删去。
By default all documents are preprocessed and counted using a single C function call. For large document collections this may come at the price of considerable memory consumption. If persistent = TRUE and recursive = TRUE documents are counted incrementally, i.e., into a persistent prefix tree using as many C function calls as there are documents. Further, if persistent = TRUE and recursive = FALSE the documents are counted using a single call but no result is returned until the next call with persistent = FALSE. Thus, persistent acts as a switch with the counts being accumulated until release. Timing statistics have shown that incremental counting can be order of magnitudes faster than the default.
默认情况下,使用一个单一的C函数调用的所有文件进行预处理和计算。对于大文档集合中,这可能会消耗相当大的内存的价格。如果persistent = TRUE和recursive = TRUE文件都算增量,即使用尽可能多的C函数调用有文件到一个持久的前缀树。此外,如果persistent = TRUE和recursive = FALSE的文件都算使用一个单一的检测,但没有任何结果,直到下一次调用persistent = FALSE返回。因此,persistent被累计,直至释放的计数作为一个开关。定时的统计资料表明,增量计数可能是数量级的速度比默认的。
Be aware that the character strings in the documents are translated to the encoding of the current locale if the encoding is set (see Encoding). Therefore, with the possibility of "unknown" encodings when in an "UTF-8" locale, or invalid "UTF-8" strings declared to be in "UTF-8", the code checks if each string is a valid "UTF-8" string and stops if not. Otherwise, strings are processed bytewise without any checks. However, embedded nul bytes are always removed from a string. Finally, note that during incremental counting a change of locale is not allowed (and a change in method is not recommended).
成为注意的字符串中的文件被转换为当前语言环境的编码,如果编码(见Encoding)。因此,在"unknown"的语言环境,或无效的"UTF-8""UTF-8",代码检查字符串宣布,如果每个字符串是一个有效的"UTF-8"编码的可能性"UTF-8"的字符串和停止,如果没有。否则,字符串是按位没有任何检查处理。然而,嵌入式nul字节总是从一个字符串中删除。最后,请注意,在增量计数的语言环境的改变是不允许的(和方法的变化不推荐)。
Note that the C implementation counts words into a prefix tree. Whereas this is highly efficient for n-gram, prefix, or suffix counting it may be less efficient for simple word counting. That is, implementations which use hash tables may be more efficient if the dictionary is large.
请注意,前缀树的C语言实现计算的话。而这是高效的n-gram,前缀,后缀计数可能是简单的单词计数效率较低。也就是说,使用哈希表的实现可能更有效,如果字典是大。
format.textcnt pretty prints a named vector of counts (see below) including information about the rank and encoding details of the strings.
format.textcnt漂亮的打印的命名向量的数量(见下文),包括信息的排名和编码的字符串。
值----------Value----------
Either a single vector of counts of mode integer with the names indexing the patterns counted, or a list of such vectors with the components corresponding to the individual documents. Note that by default the counts are in prefix tree (byte) order (for method = "suffix" this is the order of the reversed strings). Otherwise, if decreasing = TRUE the counts are sorted in decreasing order. Note that the (default) order of ties is preserved (see sort).
无论是单一的向量模式的计数integer与索引模式计数,或与对应的组件的单个文档的列表,这样的向量的名称。请注意,默认情况下计数前缀树(字节)命令(method = "suffix"“”这是颠倒字符串的顺序)。否则,如果decreasing = TRUE计数递减顺序进行排序。需要注意的是保存(默认)的顺序关系(见sort“)。
注意----------Note----------
The C functions can be interrupted by <KBD>CTRL-C</KBD>. This is convenient in interactive mode but comes at the price that the C code cannot clean up the internal prefix tree. This is a known problem of the R API and the workaround is to defer the cleanup to the next function call.
的C函数可以中断<KBD>的CTRL-C </ KBD>。在交互模式下,这是非常方便的,但自带的C代码无法清理内部的前缀树的价格。这是一个已知的R API和问题的解决办法是推迟到下一个函数调用的清理。
The C code calls translateChar for all input strings which is documented to release the allocated memory no sooner than when returning from the .Call/.External interface. Therefore, in order to avoid excessive memory consumption it is recommended to either translate the input data to the current locale or to process the data incrementally.
C代码调用translateChar它记录下来,以释放分配的内存不早于.Call/.External接口返回时,所有输入的字符串。因此,为了避免过多的内存消耗,建议翻译与当前语言环境的输入数据或处理数据的增量。
useBytes may not be fully functional with R versions where strsplit does not support that argument.
useBytes可能不会充分发挥作用的R版本strsplit不支持这样的说法。
If useBytes = TRUE the character strings of names will never be declared to be in an encoding.
如果useBytes = TRUEnames的字符串将永远不会被宣布为在编码。
(作者)----------Author(s)----------
Christian Buchta
参考文献----------References----------
N-Gram Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 161–175.
实例----------Examples----------
## the classic[#经典]
txt <- "The quick brown fox jumps over the lazy dog."
##[#]
textcnt(txt, method = "ngram")
textcnt(txt, method = "prefix", n = 5L)
r <- textcnt(txt, method = "suffix", lower = 1L)
data.frame(counts = unclass(r), size = nchar(names(r)))
format(r)
## word sequences[#字序列]
textcnt(txt, method = "string")
## inefficient[#低效]
textcnt(txt, split = "", method = "string", n = 1L)
## incremental[#增量]
textcnt(txt, method = "string", persistent = TRUE, n = 1L)
textcnt(txt, method = "string", n = 1L)
## subset[#子集]
textcnt(txt, method = "string", words = 5L, n = 1L)
## non-ASCII[#非ASCII]
txt <- "The quick br\xfcn f\xf6x j\xfbmps \xf5ver the lazy d\xf6\xf8g."
Encoding(txt) <- "latin1"
txt
## implicit translation[#隐含的翻译]
r <- textcnt(txt, method = "suffix")
table(Encoding(names(r)))
r
## efficient sets[#高效的集]
textcnt("is", n = 3L, marker = "\1")
textcnt("is", n = 4L, marker = "\1")
textcnt("corpus", n = 5L, marker = "\1")
## CT sets[#CT组]
textcnt("corpus", n = 5L, marker = "\2")
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|