R语言:regex()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-2-16 18:38:42

regex(base)
regex()所属R语言包：base

                                    Regular Expressions as used in R
                                       R中使用的正则表达式

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

This help page documents the regular expression patterns supported by grep and related functions grepl, regexpr, gregexpr, sub and gsub, as well as by strsplit.
本帮助页的文件grep和相关职能grepl，regexpr，gregexpr，sub和gsub，支持正则表达式模式如strsplit。

Details

详情----------Details----------

A "regular expression" is a pattern that describes a set of strings.  Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression.
一个“正则表达式是一种模式，描述一组字符串。两种类型的正则表达式中使用R，扩展正则表达式（默认）和类似Perl的正则表达式使用perl = TRUE。有也fixed = TRUE可以考虑使用文字的正则表达式。

Other functions which use regular expressions (often via the use of grep) include apropos, browseEnv, help.search, list.files and ls. These will all use extended regular expressions.
使用正则表达式的其他功能（通常是通过使用grep）包括apropos，browseEnv，help.search，list.files和ls。这些都将使用扩展正则表达式。

Patterns are described here as they would be printed by cat: (do remember that backslashes need to be doubled when entering R character strings, e.g. from the keyboard).
模式描述在这里，因为他们将印cat：（千万记住反斜杠需要将增加一倍时，进入R字符的字符串，例如从键盘）。

扩展的正则表达式----------Extended Regular Expressions----------

This section covers the regular expressions allowed in the default mode of grep, regexpr, gregexpr, sub, gsub and strsplit.  They use an implementation of the POSIX 1003.2 standard: that allows some scope for interpretation and the interpretations here are those used as from R 2.10.0.
本节包括允许在默认模式grep，regexpr，gregexpr，sub，gsub和strsplit正则表达式。他们使用的POSIX 1003.2标准的实施范围，使一些解释，这里的解释是那些从R 2.10.0。

Regular expressions are constructed analogously to arithmetic expressions, by using various operators to combine smaller expressions.  The whole expression matches zero or more characters (read "character" as "byte" if useBytes = TRUE).
正则表达式构建类似算术表达式，通过使用不同的运营商结合小表达式。整个表达式匹配零个或多个字符（读字符，字节如果useBytes = TRUE）。

The fundamental building blocks are the regular expressions that match a single character.  Most characters, including all letters and digits, are regular expressions that match themselves.  Any metacharacter with special meaning may be quoted by preceding it with a backslash.  The metacharacters in EREs are . \ | ( ) [ { ^ $ * + ?, but note that whether these have a special meaning depends on the context.
基本构建块是匹配单个字符的正则表达式。大多数字符，包括所有字母和数字，是符合自己的正则表达式。任何元字符具有特殊的意义，可通过它前面加上一个反斜杠引用。 ERES元字符是. \ | ( ) [ { ^ $ * + ?，但要注意是否有特殊的意义取决于上下文。

Escaping non-metacharacters with a backslash is implementation-dependent.  The current implementation interprets \a as BEL, \e as ESC, \f as FF, \n as LF, \r as CR and \t as TAB.  (Note that these will be interpreted by R's parser in literal character strings.)
用反斜杠转义非元字符是依赖于实现。目前实施的解释\aBEL，\eESC，\fFF，\nLF ，\rCR和\tTAB。（请注意，这些将由R的文本字符串解析器解释。）

A character class is a list of characters enclosed between [ and ] which matches any single character in that list; unless the first character of the list is the caret ^, when it matches any character not in the list.  For example, the regular expression [0123456789] matches any single digit, and [^abc] matches anything except the characters a, b or c.  A range of characters may be specified by giving the first and last characters, separated by a hyphen.  (Because their interpretation is locale- and implementation-dependent, they are best avoided.)  The only portable way to specify all ASCII letters is to list them all as the character class<br> [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz].<br> (The current implementation uses numerical order of the encoding: prior to R 2.10.0 locale-specific collation was used, and might be again.)
字符类一个[和之间括起来的字符列表]匹配该列表中的任何单个字符，除非该列表的第一个字符是插入符^时，它匹配任何字符列表中。例如，正则表达式[0123456789]匹配任何单位，[^abc]匹配任何字符除外a，b或c。可以指定一个字符范围发出的第一个和最后一个字符，由连字符分隔。（因为他们的解释是依赖语言环境和实施，他们最好避免使用。）唯一可移植的方式来指定所有的ASCII字母被列为所有字符类参考[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]。参考（当前实现使用数字顺序编码：前ŕ2.10.0特定的语言环境的整理，并可能再次）。

Certain named classes of characters are predefined.  Their interpretation depends on the locale (see locales); the interpretation below is that of the POSIX locale.
某些命名字符类是预定义的。他们的解释取决于语言环境（语言环境）;下面的解释是POSIX语言环境。

[:alnum:] Alphanumeric characters: [:alpha:]
[:alnum:]字母数字字符：[:alpha:]

[:alpha:] Alphabetic characters: [:lower:] and
[:alpha:]字母字符：[:lower:]“

[:blank:] Blank characters: space and tab, and possibly other locale-dependent characters such as non-breaking
[:blank:]空白字符：空格和制表，以及其他可能依赖于语言环境的字符，如非破

Control characters.  In ASCII, these characters have octal codes 000 through 037, and 177 (DEL).  In another character set,
控制字符。在ASCII，这些字符有八进制代码000至037，177（DEL）。在另一个字符集，

[:digit:] Digits: 0 1 2 3 4 5 6 7 8 9.
[:digit:]位数：0 1 2 3 4 5 6 7 8 9。

[:graph:] Graphical characters: [:alnum:] and
[:graph:]图形字符：[:alnum:]“

[:lower:] Lower-case letters in the current locale.
[:lower:]在当前的语言环境中的小写字母。

Printable characters: [:alnum:], [:punct:] and space.
可打印字符：[:alnum:]，[:punct:]和空间。

[:punct:] Punctuation characters:<br>
[:punct:]标点符号：参考

Space characters: tab, newline, vertical tab, form feed, carriage
空格字符：“标签，换行，垂直制表符，换页，回车

[:upper:] Upper-case letters in the current locale.
[:upper:]在当前的语言环境中的大写字母。

[:xdigit:] Hexadecimal digits:<br>
[:xdigit:]十六进制数字：参考

For example, [[:alnum:]] means [0-9A-Za-z], except the latter depends upon the locale and the character encoding, whereas the former is independent of locale and character set.  (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket list.) Most metacharacters lose their special meaning inside a character class.  To include a literal ], place it first in the list. Similarly, to include a literal ^, place it anywhere but first. Finally, to include a literal -, place it first or last (or, for perl = TRUE only, precede it by a backslash.).  (Only ^ - \ ] are special inside character classes.)
例如，[[:alnum:]]指[0-9A-Za-z]，除非后者取决于语言环境和字符编码，而前者是独立的语言环境和字符集。（注意，在这些类名的括号符号名称的一部分，必须包含除了划定支架列表的括号）。大多数元字符失去其在字符类中的特殊含义。包括文字上的]，把它放在列表中的第一。同样，包括文字上的^，把它放在任何地方，但首先。最后，包括文字上的-，把它放在第一或最后（或为perl = TRUE，前面加上一个反斜杠。）。（仅^ - \ ]是特殊字符类内部。）

The period . matches any single character.  The symbol \w matches a "word" character (a synonym for [[:alnum:]_]) and \W is its negation.  Symbols \d, \s, \D and \S denote the digit and space classes and their negations.
期间.匹配任何单个字符。符号\w匹配一个字字符（[[:alnum:]_]的代名词）和\W是它的否定。符号\d，\s，\D和\S表示数字和空间的类和他们的否定。

The caret ^ and the dollar sign $ are metacharacters that respectively match the empty string at the beginning and end of a line.  The symbols \< and \> match the empty string at the beginning and end of a word.  The symbol \b matches the empty string at either edge of a word, and \B matches the empty string provided it is not at an edge of a word.  (The interpretation of "word" depends on the locale and implementation.)
插入符^和美元符号$是元字符，分别匹配行的开头和结尾空字符串。符号\<和\>在一个字的开头和结尾的空字符串相匹配。符号\b匹配空字符串，在任何一个字的边缘，\B空字符串相匹配的，它不是一个字的边缘。（“词”的解释取决于对语言环境和实施。）

A regular expression may be followed by one of several repetition quantifiers:
正则表达式可以跟随几个重复的量词之一：

? The preceding item is optional and will be matched
?前面的项是可选的，将被匹配

* The preceding item will be matched zero or more
*前面的项目将被匹配零个或多个

+ The preceding item will be matched one or more
+前面的项目将匹配一个或多个

{n} The preceding item is matched exactly n
{n}前面的项完全匹配n

{n,} The preceding item is matched n or more
{n,}先前的项目匹配n或更多

{n,m} The preceding item is matched at least n
{n,m}先前的项目匹配至少n

By default repetition is greedy, so the maximal possible number of repeats is used.  This can be changed to "minimal" by appending ? to the quantifier.  (There are further quantifiers that allow approximate matching: see the TRE documentation.)
默认情况下，重复是贪婪的，所以最大可能重复使用。这是可以改变的“最小”，通过追加?量词。（有进一步允许近似匹配的量词：看到居民企业文档。）

Regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating the substrings that match the concatenated subexpressions.
正则表达式可以连接在一起，由此产生的正则表达式匹配任何字符串匹配的级联子表达式的子串联而成。

Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any string matching either subexpression. For example, abba|cde matches either the string abba or the string cde.  Note that alternation does not work inside character classes, where | has its literal meaning.
两个正则表达式可以加入中缀操作符|;产生的正则表达式匹配任何字符串匹配或者子表达式。例如，abba|cde匹配字符串abba或字符串cde。请注意，交替不工作在字符类，其中|有它的字面意思。

Repetition takes precedence over concatenation, which in turn takes precedence over alternation.  A whole subexpression may be enclosed in parentheses to override these precedence rules.
重复需要优先串联，这反过来又接管交替优先。整个子表达式可以包含在括号中覆盖这些优先规则。

The backreference \N, where N = 1 ... 9, matches the substring previously matched by the Nth parenthesized subexpression of the regular expression.  (This is an extension for extended regular expressions: POSIX defines them only for basic ones.)
反向引用\N，其中N = 1 ... 9，与先前的第N个括号正则表达式的子表达式匹配的子串匹配。（这是一个延伸：POSIX扩展正则表达式只为基本的定义。）

类似Perl的正则表达式----------Perl-like Regular Expressions----------

The perl = TRUE argument to grep, regexpr, gregexpr, sub, gsub and strsplit switches to the PCRE library that implements regular expression pattern matching using the same syntax and semantics as Perl 5.10, with just a few differences.
perl = TRUEgrep参数，regexpr，gregexpr，sub，gsub和strsplit切换到PCRE库实现正则表达式模式匹配使用Perl 5.10的相同的语法和语义，与短短差异。

For complete details please consult the man pages for PCRE, especially man pcrepattern and man pcreapi), on your system or from the sources at http://www.pcre.org. If PCRE support was compiled from the sources within R, the PCRE version is 8.12 as described here.
如需完整的详细信息，请咨询PCRE中的手册页，尤其是man pcrepattern和man pcreapi），您的系统上，或从在http://www.pcre.org来源。如果PCRE中支持R的内源编译，PCRE的版本是8.12，此处所述。

Perl regular expressions can be computed byte-by-byte or (UTF-8) character-by-character: the latter is used in all multibyte locales and if any of the inputs are marked as UTF-8 (see Encoding).
Perl正则表达式，可以计算字节逐字节（UTF-8）字符的字符：后者用于所有多字节语言环境中，如果输入任何标记为UTF-8（见Encoding ）。

All the regular expressions described for extended regular expressions are accepted except \< and \>: in Perl all backslashed metacharacters are alphanumeric and backslashed symbols always are interpreted as a literal character. { is not special if it would be the start of an invalid interval specification.  There can be more than 9 backreferences (but the replacement in sub can only refer to the first 9).
所有的扩展正则表达式描述的正则表达式接受除\<和\>：在Perl中的所有反斜杠元字符是字母，数字和反斜杠符号总是被解释为一个文字字符。 {是不是特别的，如果这将是无效的间隔规范的开始。可以有超过9反向引用（但更换sub只能参照第9）。

Character ranges are interpreted in the numerical order of the characters, either as bytes in a single-byte locale or as Unicode points in UTF-8 mode.  So in either case [A-Za-z] specifies the set of ASCII letters.
在数字顺序排列的字符字符范围的解释，无论是作为一个单字节语言环境中的字节或Unicode的UTF-8模式点。因此，在任何情况下[A-Za-z]指定ASCII字符集。

In UTF-8 mode the named character classes only match ASCII characters: see \p below for an alternative.
命名字符类中的UTF-8模式只匹配ASCII字符：\p下面看到一个替代。

The construct (?...) is used for Perl extensions in a variety of ways depending on what immediately follows the ?.
用于在各种各样的方式取决于什么紧随(?...)Perl扩展的结构?。

Perl-like matching can work in several modes, set by the options (?i) (caseless, equivalent to Perl's /i), (?m) (multiline, equivalent to Perl's /m), (?s) (single line, so a dot matches all characters, even new lines: equivalent to Perl's /s) and (?x) (extended, whitespace data characters are ignored unless escaped and comments are allowed: equivalent to Perl's /x).  These can be concatenated, so for example, (?im) sets caseless multiline matching.  It is also possible to unset these options by preceding the letter with a hyphen, and to combine setting and unsetting such as (?im-sx).  These settings can be applied within patterns, and then apply to the remainder of the pattern. Additional options not in Perl include (?U) to set "ungreedy" mode (so matching is minimal unless ? is used as part of the repetition quantifier, when it is greedy).  Initially none of these options are set.
类似Perl的匹配，可以工作在多种模式，选项设置(?i)（不区分大小写，相当于Perl的/i）(?m)（多行，相当于Perl的/m） (?s)（单行线，所以一个点匹配所有字符，甚至是新的生产线：相当于Perl的/s）(?x)（扩展，数据的空白字符将被忽略，除非逃脱和意见是允许：相当于Perl的/x）。这些都可以被连接起来，例如，(?im)设置多行不区分大小写匹配。它也可以由一个连字符前面加上字母来取消这些选项，并结合设置和注销，如(?im-sx)。这些设置可以应用在图案，然后应用模式的其余部分。在Perl中没有额外的选项包括(?U)设置ungreedy“模式（匹配的是最小的，除非?部分重复的量词是贪婪的，当它使用）。最初没有设置这些选项。

If you want to remove the special meaning from a sequence of characters, you can do so by putting them between \Q and \E. This is different from Perl in that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in Perl, $ and @ cause variable interpolation.
如果你想删除的特殊含义的字符序列，你可以这样做把\Q和\E之间。这是从Perl不同，$和@处理作为\Q...\EPCRE的序列中的文字，而在Perl中，$和@引起的变量插值。

The escape sequences \d, \s and \w represent any decimal digit, space character and "word" character (letter, digit or underscore in the current locale: in UTF-8 mode only ASCII letters and digits are considered) respectively, and their upper-case versions represent their negation.  Unlike POSIX, vertical tab is not regarded as a space character.  Sequences \h, \v, \H and \V match horizontal and vertical space or the negation.  (In UTF-8 mode, these do match non-ASCII Unicode points.)
转义序列\d，\s和\w代表任何十进制数字，空格字符和字字符（字母，数字或下划线在当前的语言环境：只在UTF-8模式被认为是ASCII字母和数字），他们的大写版本代表着他们的否定。 POSIX的不同，垂直制表符不被视为一个空格字符。序列\h，\v，\H和\V水平和垂直空间或否定匹配。（UTF-8模式下，这些不符合非ASCII的Unicode点。）

There are additional escape sequences: \cx is cntrl-x for any x, \ddd is the octal character (for up to three digits unless interpretable as a backreference, as \1 to \7 always are), and \xhh specifies a character by two hex digits. In a UTF-8 locale, \x{h...} specifies a Unicode point by one or more hex digits.  (Note that some of these will be interpreted by R's parser in literal character strings.)
有额外的转义序列：\cx是任何cntrl-xx，\ddd是八进制的字符（最多3位，除非解释作为一个后向引用，<X >\1总是），\7指定了一个由两个十六进制数字的字符。在UTF-8语言环境中，\xhh指定由一个或多个十六进制数字的Unicode点。（请注意，其中一些将R的文本字符串解析器解释。）

Outside a character class, \A matches at the start of a subject (even in multiline mode, unlike ^), \Z matches at the end of a subject or before a newline at the end, \z matches only at end of a subject. and \G matches at first matching position in a subject (which is subtly different from Perl's end of the previous match).  \C matches a single byte, including a newline, but its use is warned against.  In UTF-8 mode, \R matches any Unicode newline character (not just CR), and \X matches any number of Unicode characters that form an extended Unicode sequence.
字符类之外的，不像\A比赛在一个主题的开始（即使是在多行模式，^），\Z比赛，在一个主题年底或在年底前一个换行符，\z只匹配在一个主题结束。 \G比赛在第一个匹配的主体地位（这是巧妙地从不同的Perl的，以前的比赛结束）。 \C单字节，包括一个换行符相匹配，但它的使用警告。在UTF-8模式，\R匹配任何Unicode的换行符（不只是华润），\X匹配任何数量，形成一个扩展Unicode序列的Unicode字符。

In UTF-8 mode, some Unicode properties are supported via \p{xx} and \P{xx} which match characters with and without property xx respectively. For a list of supported properties see the PCRE documentation, but for example Lu is "upper case letter" and Sc is "currency symbol".
在UTF-8模式，一些Unicode属性\p{xx}和\P{xx}匹配字符和无财产xx分别支持通过。为支持的属性列表中看到的PCRE文档，但例如Lu是大写字母和Sc是“货币符号”。

The sequence (?# marks the start of a comment which continues up to the next closing parenthesis.  Nested parentheses are not permitted.  The characters that make up a comment play no part at all in the pattern matching.
序列(?#标志着开始持续到下一个右括号评论。嵌套的括号中是不允许的。评论的字符发挥在所有的模式匹配的任何部分。

If the extended option is set, an unescaped # character outside a character class introduces a comment that continues up to the next newline character in the pattern.
如果扩展选项设置，未转义#字符以外的字符类引入了注释，继续到下一个换行符的格局。

The pattern (?:...) groups characters just as parentheses do but does not make a backreference.
模式(?:...)组字符，就像括号做，但不会使一个后向引用。

Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed. Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \C in ....
图案(?=...)和(?!...)是零宽度的正和负lookahead断言：他们匹配，如果匹配的...从当前位置向前，试图将成功（或不），但使用起来正在处理字符串中的任何字符。模式(?<=...)和(?<!...)是向后等值：他们不容许重复的量词也\C...。

As from R 2.14.0 regexpr and gregexpr support "named capture".  If groups are named, e.g., "(?<first>[A-Z][a-z]+)" then the positions of the matches are also returned by name.  (Named backreferences are not supported by sub.)
为从R 2.14.0regexpr和gregexpr“支持”命名捕获“。如果组被命名为，例如，"(?<first>[A-Z][a-z]+)"然后比赛的位置也由名称返回。（不支持命名的反向引用sub）。

Atomic grouping, possessive qualifiers and conditional and recursive patterns are not covered here.
这里没有涉及到原子的分组，所有格限定条件和递归模式。

作者（S）----------Author(s)----------

This help page is based on the documentation of GNU grep 2.4.2, the
TRE documentation and the POSIX standard, and the <code>pcrepattern</code>
man page from PCRE 8.0.

参见----------See Also----------

grep, apropos, browseEnv, glob2rx, help.search, list.files, ls and strsplit.
grep，apropos，browseEnv，glob2rx，help.search，list.files，ls和strsplit。

The TRE documentation at http://laurikari.net/tre/documentation/regex-syntax/).
居民企业在http://laurikari.net/tre/documentation/regex-syntax/文档）。

The POSIX 1003.2 standard at http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09
POSIX 1003.2标准http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html＃tag_09

The pcrepattern can be found as part of http://www.pcre.org/pcre.txt, and details of Perl's own implementation at http://perldoc.perl.org/perlre.html.
pcrepattern可以找到部分的http://www.pcre.org/pcre.txt，和Perl的在http://perldoc.perl.org/perlre.html自己的实施细节。

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册

R语言:regex()函数中文帮助文档(中英文对照)

浏览过的版块