Also see the discussion page for this proposal.
A small group of proposals to make regular expressions easier to use.
RegExps should be extended to take better advantage of full Unicode ranges. The Unicode Consortium has a specific set of suggestions for how to extend RegExp usage for Unicode (http://www.unicode.org/reports/tr18/index.html); this proposal is based loosely on their “Level 1” recommendations.
Th1e \x and \u escapes shall be extended as specified in the update_unicode proposal to allow for arbitrary 21-bit code points, e.g.,
\u{1AFFE}
\x{1AFFE}
The \p and \P escapes shall be added to specify matching of arbitrary Unicode character properties. (The lowercase p escape is used to indicate a positive match, while the uppercase P is used to indicate a negative match.) The syntax is
\p{property} // matches a single character that has the given property
\P{property} // matches a single character that does not have the given property
Notes:
All implementations are required to implement the following General Category properties, as specified in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt :
property description --------- ----------- L Letter Lu Letter, Uppercase Ll Letter, Lowercase Lt Letter, Titlecase Lm Letter, Modifier Lo Letter, Other M Mark Mn Mark, Nonspacing Mc Mark, Spacing Combining Me Mark, Enclosing N Number Nd Number, Decimal Digit Nl Number, Letter No Number, Other P Punctuation Pc Punctuation, Connector Pd Punctuation, Dash Ps Punctuation, Open Pe Punctuation, Close Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage) Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage) Po Punctuation, Other S Symbol Sm Symbol, Math Sc Symbol, Currency Sk Symbol, Modifier So Symbol, Other Z Separator Zs Separator, Space Zl Separator, Line Zp Separator, Paragraph C Other Cc Other, Control Cf Other, Format Cs Other, Surrogate Co Other, Private Use Cn Other, Not Assigned (no characters in the file have this property)
It is anticipated that the properties above can be efficiently represented using a small amount (< 16 kbytes) of static data when properly represented.
Implementations are free to supply additional Unicode properties.
Attempting to creating a RegExp (either as a RegExp literal, or via “new RegExp”) that specifies an unimplemented property will throw a ReferenceError.
To make the richness of Unicode character classes even more useful, an operator is added to express the intersection of character sets. Coupled with complementation, this operator can also express character set subtraction.
The syntax is borrowed from Java. The character sequence &&[ inside a character set introduces an embedded character set that is to be intersected with the surrounding set. A ] character terminates the nested set in the expected manner.
[a-z&&[d-f]] // matches d, e, or f
[a-z&&[d-f]&&[f-h]] // matches f
[a-z&&[^d-f5]0-9] // matches any lowercase ASCII letter or 0-9 EXCEPT d, e, f, or 5
[\p{L}&&[^\P{Lu}]] // matches all Letters that are not Uppercase
[\p{L}\p{N}&&[^\u{30}]] // match any Letter or Number, except for U+0030 (DIGIT ZERO)
The meaning of a set with one or more embedded intersection sets is to match one character from the set with the intersection sets removed, provided it also matches the intersection of the intersection sets.
Compatibility with 3rd Ed:
&& is already legal in a character set and denotes the single character &. Thus there is a slight incompatibility with 3rd Ed character sets. This compatibility problem is expected to be absent in practice.Compatibility with Java:
[a-z[0-9]] is the same as a-z0-9. ECMAScript does not adopt this because it breaks backwards compatibility too much: unlike the &&[ sequence, [ by itself is likely to appear in existing regular expressions.See the discussion page for a longer discussion of issues pertaining to the ECMAScript lexer when handling nested character sets.
For reasons of backwards compatibility, the word-boundary escapes (\b and \B) and word-character escapes (\w and \W) will NOT be extended to implement the full Unicode Alphabetic range. However, the Unicode Properties above can be used to synthesize more suitable matches; for instance, a Unicode-savvy equivalent of \w could be approximated with
[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}\u{200C}\u{200D}]
(although this ignores the characters in the “Other_Alphabetic” range).
Implementations are required to implement case-insensitive matching for the same range of code points that are handled by String.toLowerCase() and String.toUpperCase().
Implementations will recognize NEL (U+0085) as an end-of-line character, in addition to the existing Edition 3 end-of-line characters (CR+LF, CR, LF, U+2028, U+2029).
Implementations must perform all searches using only Unicode code points, not surrogates.
Instances of RegExp have two new properties, extended and sticky. Both hold boolean values. The extended property is true iff the “/x” flag was present when the expression was compiled. The sticky property is true iff the “/y” flag was present when the expression was compiled. Both are ReadOnly, DontDelete, DontEnum, like the other flag properties.
A “comment pattern” can be embedded in any regular expression using the syntax (?# <text> ), where the text is arbitrary except it can’t contain ).
The comment turns into the empty pattern (?:).
Comments do not nest.
Rationale: comments make regular expressions more readable.
Compatibility: There are no known problems with this syntax; it is illegal in 3rd Edition.
Prior art: The syntax comes from Perl.
The flag “x” shall be permitted.
The flag states that the expression is an “extended regular expression”. If present, the flag affects the parsing of the regular expression: literal whitespace and all Unicode format control characters will be ignored and line comments and line breaks inside the regular expression are allowed.
Literal whitespace, format control characters, line break, and line comments are ignored between tokens and serve as token separators.
The multicharacter tokens of the RegExp grammar are \b, \B, /DecimalDigits/, \/AtomEscape, (?:, (?=, (?!, (?#, (?P=, (?P<, /Identifier/, and \/ClassEscape/.
Line comments start with the character # and extend to the end of the line or the end of the input. Any terminating line break character is not part of the comment.
Rationale: Regular expressions are notoriously hard to read; extended regular expressions are easier to read.
Lexical syntax: The lexical syntax for the RegularExpressionBody needs to change: it should no longer make use of the NonTerminator nonterminal.
Compatibility: As defined above, both implementations that lex a regular expression according to RegularExpressionLiteral before parsing it as well as implementations that parse regular expressions as part of program lexing can be accomodated. However, comments in the regexp must not contain unmatched [ characters; see section on “RegExp scanning” below.
Prior art: The flag comes from Perl.
RegExp objects: The state of the flag would show up as a property extended on RegExp objects.
The flag “y” shall be permitted.
The flag states that the regular expression peforms sticky matching in the target string by attempting to match at lastIndex: if matching at that location fails then null is returned, ie, no forward searching is performed. If matching succeeds, then lastIndex is updated as for the flag “g”.
Rationale: This flag will make it easier to write simple and efficient lexical analyzers for embedded languages using ECMAScript regular expressions. The current language has quadratic complexity because each match may potentially search to the end of the input for a match. (That can be worked around in a couple of ways but it’s cumbersome.)
Compatibility: The flag is illegal in 3rd Edition, and the matching behavior is a subset of existing behavior, so this flag should not cause trouble for any implementation.
RegExp objects: The state of the flag would show up as a property sticky on RegExp objects.
A regular expression object can be called as a function on a single string argument.
The value returned is exactly the same value that would be returned if the RegExp.prototype.exec method were invoked on the regular expression object with the string for an argument.
Rationale: Compatibility with existing implementation; handy shorthand.
Compatibility: RegExp instances are not callable in 3rd Edition; it is unlikely to be a hardship to make them callable in 4th Edition.
Prior art: Netscape/Mozilla support this behavior.
typeof: The value of “typeof /x/” is “object” (backwards compatible)?
Regular expression submatches can be named, and back-references can reference these names.
Creating a named group:
(?P<name>...)name.id in the example below can also be referenced as the numbered group 1.(?P<id>[a-zA-Z_]\w*), the group can be referenced by its name in String.prototype.match result objects, such as m.id, and also by name in pattern text (for example, (?P=id)) and replacement text (such as \g<id>).Referencing a named group:
(?P=name)name.Rationale: Regular expressions are hard to read. This will make them easier to read (and therefore use).
Compatibility: The syntax is illegal today and has no compatibility issues. The implementation burden is expected to be moderate.
Prior art: comes from Python
Following MSIE we propose that an unescaped / character inside a character set should stand for itself, it should not terminate scanning of the expression.
Compatibility: The consequence of allowing this is that regular expressions that contain line comments with unmatched [ characters will throw the regular expression scanner off track; it will not recognize the terminating / properly. The scanner can’t know if the /x flag is in effect, it must blindly scan the expression and will fall prey to this. Therefore comments must not contain unmatched [ characters. This is not incompatible, it’s just surprising.