找回密码
 注册
查看: 9954|回复: 0

R语言:grep()函数中文帮助文档(中英文对照)

[复制链接]
发表于 2012-2-16 18:59:22 | 显示全部楼层 |阅读模式
grep(base)
grep()所属R语言包:base

                                        Pattern Matching and Replacement
                                         模式匹配和替换

                                         译者:生物统计家园网 机器人LoveR

描述----------Description----------

grep, grepl, regexpr and gregexpr search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
grep,grepl,regexpr和gregexpr参数匹配的搜索pattern内的一个特征向量的每个元素:它们的不同的格式和金额详细的结果。

sub and gsub perform replacement of the first and all matches respectively.
sub和gsub执行的第一个和所有的比赛更换。


用法----------Usage----------


grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
     fixed = FALSE, useBytes = FALSE)

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)

gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
         fixed = FALSE, useBytes = FALSE)

regexec(pattern, text, ignore.case = FALSE,
        fixed = FALSE, useBytes = FALSE)



参数----------Arguments----------

参数:pattern
character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector.  Coerced by as.character to a character string if possible.  If a character vector of length 2 or more is supplied, the first element is used with a warning.  Missing values are allowed except for regexpr and gregexpr.
字符串包含一个正则表达式(或字符字符串fixed = TRUE)在给定的字符向量匹配。 as.character强制转换为字符串,如果可能的话。如果字符长度为2个或更多的向量提供,第一个元素是使用一个警告。遗漏值是允许除了regexpr和gregexpr。


参数:x, text
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
比赛正在寻求一个字符向量,或可以通过as.character字符向量裹挟的对象。


参数:ignore.case
if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.
如果FALSE,模式匹配是大小写敏感的,如果TRUE,情况在匹配过程中忽略。


参数:perl
logical.  Should perl-compatible regexps be used? Has priority over extended.
逻辑。 Perl兼容的正则表达式应该使用?有过extended优先。


参数:value
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.
如果FALSE,向量(integer)指数决定比赛grep返回,如果TRUE,自己一个向量,包含匹配的元素,则返回。


参数:fixed
logical.  If TRUE, pattern is a string to be matched as is.  Overrides all conflicting arguments.
逻辑。如果TRUE,pattern是一个匹配的字符串是。覆盖所有冲突的论点。


参数:useBytes
logical.  If TRUE the matching is done byte-by-byte rather than character-by-character.  See "Details".
逻辑。如果TRUE进行匹配字节逐字节而非字符字符。见“详细资料”。


参数:invert
logical.  If TRUE return indices or values for elements that do not match.
逻辑。如果TRUE回报指数或不匹配的元素的值。


参数:replacement
a replacement for matched pattern in sub and gsub.  Coerced to character if possible.  For fixed =       FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern.  For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or lower case and "\E" to end case conversion.  If a character vector of length 2 or more is supplied, the first element is used with a warning.  If NA, all elements in the result corresponding to matches will be set to NA.  
更换匹配的模式在sub和gsub。强制字符,如果可能的话。 fixed =       FALSE这可包括反向引用"\1""\9"pattern括号的子表达式。 perl = TRUE,它也可以包含"\U"或"\L"转换大写或小写,其余的更换"\E"结束的情况下转换。如果字符长度为2个或更多的向量提供,第一个元素是使用一个警告。如果NA,相应的匹配结果中的所有元素将设置为NA。


Details

详情----------Details----------

Arguments which should be character strings or character vectors are coerced to character if possible.
如果可能的话,这应该是字符串或者字符向量的参数被强制转换为字符。

Each of these functions (apart from regexec, which currently does not support Perl-style regular expressions) operates in one of three modes:
这些函数(除了从的regexec,它目前不支持Perl风格的正则表达式)工作在三种模式之一:

fixed = TRUE: use exact matching.
fixed = TRUE:使用精确匹配。

perl = TRUE: use Perl-style regular expressions.
perl = TRUE:使用Perl风格的正则表达式。

fixed = FALSE, perl = FALSE: use POSIX 1003.2 extended regular expressions.
fixed = FALSE, perl = FALSE:使用POSIX 1003.2扩展正则表达式。

See the help pages on regular expression for details of the different types of regular expressions.
正则表达式正则表达式的不同类型的详细信息,请参阅帮助页面。

The two *sub functions differ only in that sub replaces only the first occurrence of a pattern whereas gsub replaces all occurrences.  If replacement contains backreferences which are not defined in pattern the result is undefined (but most often the backreference is taken to be "").
两个*sub功能不同的只是,sub取代只有pattern第一的发生而gsub替换所有出现的。如果replacement包含反向引用没有定义在pattern结果是不确定的(但最常见的反向引用"")。

For regexpr, gregexpr and regexec it is an error for pattern to be NA, otherwise NA is permitted and gives an NA match.
对于regexpr,gregexpr和regexec它是一个错误patternNA,否则NA允许并给出了一个<X >匹配。

The main effect of useBytes is to avoid errors/warnings about invalid inputs and spurious matches in multibyte locales, but for regexpr it changes the interpretation of the output. It inhibits the conversion of inputs with marked encodings, and is forced if any input is found which is marked as "bytes".
useBytes主要作用是为了避免错误/警告无效投入和多字节语言环境中的虚假的比赛,但regexpr它改变了输出的解释。它抑制了显着的编码输入的转换,被迫被标记为"bytes"如果发现任何输入。

Caseless matching does not make much sense for bytes in a multibyte locale, and you should expect it only to work for ASCII characters if useBytes = TRUE.
如果useBytes = TRUE不区分大小写的匹配多字节语言环境中的字节太大的意义,你应该想到它只有ASCII字符。

As from R 2.14.0, regexpr and gregexpr with perl     = TRUE allow Python-style named captures.
2.14.0从R,regexpr和gregexpr与perl     = TRUE允许Python风格的命名捕获。


值----------Value----------

grep(value = FALSE) returns an integer vector of the indices of the elements of x that yielded a match (or not, for invert = TRUE.
grep(value = FALSE)返回一个整数向量元素的指数xinvert = TRUE,取得了一场比赛(或不。“

grep(value = TRUE) returns a character vector containing the selected elements of x (after coercion, preserving names but no other attributes).
grep(value = TRUE)返回一个字符向量,包含x(胁迫后,保存的名字,但没有其他属性)选定的元素。

grepl returns a logical vector (match or not for each element of x).
grepl返回一个逻辑向量(匹配或不为每个元素x)。

For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character).  Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).  If useBytes = FALSE a non-ASCII substituted result will often be in UTF-8 with a marked encoding (e.g. if there is a UTF-8 input, and in a multibyte locale unless fixed = TRUE). Such strings can be re-encoded by enc2native.
sub和gsub返回一个相同的长度和x(后可能强迫字符)相同的属性的特征向量。特征向量的元素x这是无法取代将被退回不变(包括任何声明的编码)。如果useBytes = FALSE取代非ASCII的结果往往会是一个显着的编码为UTF-8(例如,如果有一个UTF-8输入,并在多字节语言环境,除非fixed = TRUE)。这些字符串可以被重新编码enc2native。

regexpr returns an integer vector of the same length as text giving the starting position of the first match or -1 if there is none, with attribute "match.length", an integer vector giving the length of the matched text (or -1 for no match).  The match positions and lengths are in characters unless useBytes = TRUE is used, when they are in bytes.  If named capture is used there are further attributes "capture.start", "capture.length" and "capture.names".
regexpr返回text给人的第一场比赛或起始位置-1如果有没有与属性,"match.length",给予一个整数向量的长度相同的整数向量匹配的文本的长度(或-1不匹配)。本场比赛的位置和长度的字符,除非useBytes = TRUE使用,当他们是在字节。如果使用了命名捕获有进一步的属性"capture.start","capture.length"和"capture.names"。

gregexpr returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.
gregexprtext“作为regexpr的返回值是相同的形式,其中每个元素,除了每一个(不相交)匹配的起始位置是相同的长度返回列表定。

regexec returns a list of the same length as text each element of which is either -1 if there is no match, or a sequence of integers with the starting positions of the match and all substrings corresponding to parenthesized subexpressions of pattern, with attribute "match.length" an integer vector giving the lengths of the matches (or -1 for no match).
regexec返回一个列表长度相同text是其中的每个元素要么-1如果有不匹配,或一个整数序列,与比赛的开始位置和所有子相应括号的子表达式的pattern,属性"match.length"整数向量,使比赛的长度(或-1不匹配)。


警告----------Warning----------

POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g. pattern =   "\b").  Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of "word" is system-dependent).
POSIX的1003.2gsub和gregexpr不重复字的边界正确的模式(例如pattern =   "\b")。使用perl = TRUE(但可能不会如预期般使用非ASCII输入工作,为字的含义是依赖于系统)等比赛。


性能方面的考虑----------Performance considerations----------

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).
如果你正在做的正则表达式匹配,在很长的字符串,其中包括很多,你要考虑使用的选项。一般的PCRE比默认的正则表达式引擎更快,fixed = TRUE还要更快(尤其是当每个模式匹配只有几次)。

If you are working in a single-byte locale and have marked UTF-8 strings that are representable in that locale, convert them first as just one UTF-8 string will force all the matching to be done in Unicode, which attracts a penalty of around 3x for the default POSIX 1003.2 mode.
如果你的工作在一个单字节语言环境,并有明显的UTF-8字符串,表示在该区域设置,将其转换只是一个UTF-8字符串在Unicode,这吸引了罚款将强制所有匹配的第一围绕3x默认POSIX 1003.2模式。

If you can make use of useBytes = TRUE, the strings will not be checked before matching, and the actual matching will be faster. Often byte-based matching suffices in a UTF-8 locale since byte patterns of one character never match part of another.
如果你可以使使用useBytes = TRUE,不会被检查的字符串匹配之前,和实际的匹配速度会更快。往往基于字节的匹配,足以在一个UTF-8语言环境,因为一个字符的字节模式没有匹配的另一部分。


注意----------Note----------

Prior to R 2.11.0 there was an argument extended which could be used to select "basic" regular expressions: this was often used when fixed = TRUE would be preferable.  In the actual implementation (as distinct from the POSIX standard) the only difference was that ?, +, {, |, (, and ) were not interpreted as metacharacters.
&#341;2.11.0之前,有一个参数extended可以用来选择基本正则表达式:这是经常被用来当fixed = TRUE将是可取的。 (POSIX标准不同)在实际执行中,唯一的区别是?,+,{,|,(,)没有被解释为元字符。


源----------Source----------

The C code for POSIX-style regular expression matching has changed over the years.  As from R 2.10.0 the TRE library of Ville Laurikari (http://laurikari.net/tre/) is used.  From 2005 to R 2.9.2, code based on glibc was used (and before that, code from GNU grep).  The POSIX standard does give some room for interpretation, especially in the handling of invalid regular expressions and the collation of character ranges, so the results will have changed slightly.
改变了多年来的POSIX风格的正则表达式匹配的C代码。从R 2.10.0居民企业库威乐Laurikari(http://laurikari.net/tre/)。从2005年到R 2.9.2,基于代码glibc(在此之前,代码的GNUgrep)。 POSIX标准给予解释一些空间,尤其是在处理无效的正则表达式和字符范围的整理,这样的结果会略有变化。

For Perl-style matching PCRE (http://www.pcre.org) is used.
对于使用Perl风格的匹配的PCRE(http://www.pcre.org)。


参考文献----------References----------

The New S Language. Wadsworth &amp; Brooks/Cole (<code>grep</code>)

参见----------See Also----------

regular expression (aka regexp) for the details of the pattern specification.
正则表达式(又名regexp)的模式规范的细节。

regmatches for extracting matched substrings based on the results of regexpr, gregexpr and regexec.
regmatches上基于regexpr,gregexpr和regexec结果的匹配子串提取。

glob2rx to turn wildcard matches into regular expressions.
glob2rx变成正则表达式通配符匹配。

agrep for approximate matching.
agrep近似匹配。

charmatch, pmatch for partial matching, match for matching to whole strings.
charmatch,pmatch部分匹配,match匹配整个字符串。

tolower, toupper and chartr for character translations.
tolower,toupper和chartr字符转换。

apropos uses regexps and has more examples.
apropos使用正则表达式,并有更多的例子。

grepRaw for matching raw vectors.
grepRaw匹配的原始向量。


举例----------Examples----------


grep("[a-z]", letters)

txt <- c("arm","foot","lefroo", "bafoobar")
if(length(i <- grep("foo",txt)))
   cat("'foo' appears at least once in\n\t",txt,"\n")
i # 2 and 4[2和4]
txt[i]

## Double all 'a' or 'b's;  "\" must be escaped, i.e., 'doubled'[#双击所有的a或B的,必须转义“\”,即增加一倍,“”]
gsub("([ab])", "\\1_\\1_", "abc and ABC")

txt <- c("The", "licenses", "for", "most", "software", "are",
  "designed", "to", "take", "away", "your", "freedom",
  "to", "share", "and", "change", "it.",
   "", "By", "contrast,", "the", "GNU", "General", "ublic", "License",
   "is", "intended", "to", "guarantee", "your", "freedom", "to",
   "share", "and", "change", "free", "software", "--",
   "to", "make", "sure", "the", "software", "is",
   "free", "for", "all", "its", "users")
( i &lt;- grep("[gu]", txt) ) # indices[指数]
stopifnot( txt[i] == grep("[gu]", txt, value = TRUE) )

## Note that in locales such as en_US this includes B as the[#注意,比如en_US语言环境,这包括作为B]
## collation order is aAbBcCdEe ...[#排序顺序是aAbBcCdEe ...]
(ot <- sub("[b-e]",".", txt))
txt[ot != gsub("[b-e]",".", txt)]#- gsub does "global" substitution[ -  GSUB“全球”替代]

txt[gsub("g","#", txt) !=[“TXT)! - ]
    gsub("g","#", txt, ignore.case = TRUE)] # the "G" words[“,TXT,ignore.case = TRUE),]#的”G“字]

regexpr("en", txt)

gregexpr("e", txt)

## Using grepl() for filtering[#过滤grepl(的)]
## Find functions with argument names matching "warn":[#参数名称匹配的“警告”功能:]
findArgs <- function(env, pattern) {
  nms <- ls(envir = as.environment(env))
  nms &lt;- nms[is.na(match(nms, c("F","T")))] # &lt;-- work around "checking hack"[< - 工作围绕“检查黑客”]
  aa <- sapply(nms, function(.) { o <- get(.)
               if(is.function(o)) names(formals(o)) })
  iw <- sapply(aa, function(a) any(grepl(pattern, a, ignore.case=TRUE)))
  aa[iw]
}
findArgs("package:base", "warn")

## trim trailing white space[#削减尾随空格]
str <- 'Now is the time      '
sub(' +$', '', str)  ## spaces only[#空间]
sub('[[:space:]]+$', '', str) ## white space, POSIX-style[#白色的空间,POSIX风格]
sub('\\s+$', '', str, perl = TRUE) ## Perl-style white space[#Perl风格的白色空间]

## capitalizing[#资本]
txt <- "a test of capitalizing"
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", txt, perl=TRUE)
gsub("\\b(\\w)",    "\\U\\1",       txt, perl=TRUE)

txt2 <- "useRs may fly into JFK or laGuardia"
gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE)
sub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", txt2, perl=TRUE)

## named capture[#命名捕获]
notables <- c("  Ben Franklin and Jefferson Davis",
              "\tMillard Fillmore")
# name groups 'first' and 'last'[名团体的第一和最后]
name.rex <- "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)"
(parsed <- regexpr(name.rex, notables, perl = TRUE))
gregexpr(name.rex, notables, perl = TRUE)[[2]]
parse.one <- function(res, result) {
  m <- do.call(rbind, lapply(seq_along(res), function(i) {
    if(result[i] == -1) return("")
    st <- attr(result, "capture.start")[i, ]
    substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
  }))
  colnames(m) <- attr(result, "capture.names")
  m
}
parse.one(notables, parsed)

## Decompose a URL into its components.[#分解成其组成部分的URL。]
## Example by LT (http://www.cs.uiowa.edu/~luke/R/regexp.html).[#例如LT(http://www.cs.uiowa.edu/&#12316;卢克/ / regexp.html的)。]
x <- "http://stat.umn.edu:80/xyz"
m <- regexec("^(([^:]+)://)?([^:/]+)([0-9]+))?(/.*)", x)
m
regmatches(x, m)
## Element 3 is the protocol, 4 is the host, 6 is the port, and 7[#元素3的协议,是主机,6个港口,7]
## is the path.  We can use this to make a function for extracting the[#路径。我们可以用它来提取函数]
## parts of a URL:[#部分的URL:]
URL_parts <- function(x) {
    m <- regexec("^(([^:]+)://)?([^:/]+)([0-9]+))?(/.*)", x)
    parts <- do.call(rbind,
                     lapply(regmatches(x, m), `[`, c(3L, 4L, 6L, 7L)))
    colnames(parts) <- c("protocol","host","port","path")
    parts
}
URL_parts(x)

转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。


注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

手机版|小黑屋|生物统计家园 网站价格

GMT+8, 2025-1-24 04:41 , Processed in 0.021293 second(s), 15 queries .

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表