Extended regular expressions

Also see the discussion page for this proposal.

A small group of proposals to make regular expressions easier to use.

Extending RegExps for Unicode Ranges

RegExps should be extended to take better advantage of full Unicode ranges. The Unicode Consortium has a specific set of suggestions for how to extend RegExp usage for Unicode (http://www.unicode.org/reports/tr18/index.html); this proposal is based loosely on their “Level 1” recommendations.

1. Hex notation

Th1e \x and \u escapes shall be extended as specified in the update_unicode proposal to allow for arbitrary 21-bit code points, e.g.,

\u{1AFFE}
\x{1AFFE}

2. Unicode Character Properties

The \p and \P escapes shall be added to specify matching of arbitrary Unicode character properties. (The lowercase p escape is used to indicate a positive match, while the uppercase P is used to indicate a negative match.) The syntax is

\p{property}	// matches a single character that has the given property
\P{property}	// matches a single character that does not have the given property

Notes:

  • The curly braces are required, not optional (\p property is not acceptable syntax).
  • property is case-sensitive: \p{N} is not the same as \p{n}.
  • property is exact: \p{Letter} is not the same as \p{L}.

All implementations are required to implement the following General Category properties, as specified in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt :

	property	description
	---------	-----------
	L		Letter
	Lu  		Letter, Uppercase
	Ll 		Letter, Lowercase
	Lt 		Letter, Titlecase
	Lm 		Letter, Modifier
	Lo 		Letter, Other
	M		Mark
	Mn 		Mark, Nonspacing
	Mc 		Mark, Spacing Combining
	Me 		Mark, Enclosing
	N		Number
	Nd 		Number, Decimal Digit
	Nl 		Number, Letter
	No 		Number, Other
	P		Punctuation
	Pc 		Punctuation, Connector
	Pd 		Punctuation, Dash
	Ps 		Punctuation, Open
	Pe 		Punctuation, Close
	Pi 		Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
	Pf 		Punctuation, Final quote (may behave like Ps or Pe depending on usage)
	Po 		Punctuation, Other
	S		Symbol
	Sm 		Symbol, Math
	Sc 		Symbol, Currency
	Sk 		Symbol, Modifier
	So 		Symbol, Other
	Z		Separator
	Zs 		Separator, Space
	Zl 		Separator, Line
	Zp 		Separator, Paragraph
	C		Other
	Cc 		Other, Control
	Cf 		Other, Format
	Cs 		Other, Surrogate
	Co 		Other, Private Use
	Cn 		Other, Not Assigned (no characters in the file have this property)

It is anticipated that the properties above can be efficiently represented using a small amount (< 16 kbytes) of static data when properly represented.

Implementations are free to supply additional Unicode properties.

Attempting to creating a RegExp (either as a RegExp literal, or via “new RegExp”) that specifies an unimplemented property will throw a ReferenceError.

3. Subtraction and Intersection

To make the richness of Unicode character classes even more useful, an operator is added to express the intersection of character sets. Coupled with complementation, this operator can also express character set subtraction.

The syntax is borrowed from Java. The character sequence &&[ inside a character set introduces an embedded character set that is to be intersected with the surrounding set. A ] character terminates the nested set in the expected manner.

[a-z&&[d-f]]		  	// matches d, e, or f
[a-z&&[d-f]&&[f-h]]             // matches f
[a-z&&[^d-f5]0-9]  		// matches any lowercase ASCII letter or 0-9 EXCEPT d, e, f, or 5 
[\p{L}&&[^\P{Lu}]]		// matches all Letters that are not Uppercase
[\p{L}\p{N}&&[^\u{30}]]		// match any Letter or Number, except for U+0030 (DIGIT ZERO)

The meaning of a set with one or more embedded intersection sets is to match one character from the set with the intersection sets removed, provided it also matches the intersection of the intersection sets.

Compatibility with 3rd Ed:

  • Technically && is already legal in a character set and denotes the single character &. Thus there is a slight incompatibility with 3rd Ed character sets. This compatibility problem is expected to be absent in practice.

Compatibility with Java:

  • Java also provides for an embedded character set unions, [a-z[0-9]] is the same as a-z0-9. ECMAScript does not adopt this because it breaks backwards compatibility too much: unlike the &&[ sequence, [ by itself is likely to appear in existing regular expressions.

See the discussion page for a longer discussion of issues pertaining to the ECMAScript lexer when handling nested character sets.

4. Simple Word Boundaries

For reasons of backwards compatibility, the word-boundary escapes (\b and \B) and word-character escapes (\w and \W) will NOT be extended to implement the full Unicode Alphabetic range. However, the Unicode Properties above can be used to synthesize more suitable matches; for instance, a Unicode-savvy equivalent of \w could be approximated with

[\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}\u{200C}\u{200D}]

(although this ignores the characters in the “Other_Alphabetic” range).

5. Simple Loose Matches

Implementations are required to implement case-insensitive matching for the same range of code points that are handled by String.toLowerCase() and String.toUpperCase().

6. Line Boundaries

Implementations will recognize NEL (U+0085) as an end-of-line character, in addition to the existing Edition 3 end-of-line characters (CR+LF, CR, LF, U+2028, U+2029).

7. Code Points

Implementations must perform all searches using only Unicode code points, not surrogates.

RegExp properties

Instances of RegExp have two new properties, extended and sticky. Both hold boolean values. The extended property is true iff the “/x” flag was present when the expression was compiled. The sticky property is true iff the “/y” flag was present when the expression was compiled. Both are ReadOnly, DontDelete, DontEnum, like the other flag properties.

Comment patterns

A “comment pattern” can be embedded in any regular expression using the syntax (?# <text> ), where the text is arbitrary except it can’t contain ).

The comment turns into the empty pattern (?:).

Comments do not nest.

Rationale: comments make regular expressions more readable.

Compatibility: There are no known problems with this syntax; it is illegal in 3rd Edition.

Prior art: The syntax comes from Perl.

"/x" flag

The flag “x” shall be permitted.

The flag states that the expression is an “extended regular expression”. If present, the flag affects the parsing of the regular expression: literal whitespace and all Unicode format control characters will be ignored and line comments and line breaks inside the regular expression are allowed.

Literal whitespace, format control characters, line break, and line comments are ignored between tokens and serve as token separators.

The multicharacter tokens of the RegExp grammar are \b, \B, /DecimalDigits/, \/AtomEscape, (?:, (?=, (?!, (?#, (?P=, (?P<, /Identifier/, and \/ClassEscape/.

Line comments start with the character # and extend to the end of the line or the end of the input. Any terminating line break character is not part of the comment.

Rationale: Regular expressions are notoriously hard to read; extended regular expressions are easier to read.

Lexical syntax: The lexical syntax for the RegularExpressionBody needs to change: it should no longer make use of the NonTerminator nonterminal.

Compatibility: As defined above, both implementations that lex a regular expression according to RegularExpressionLiteral before parsing it as well as implementations that parse regular expressions as part of program lexing can be accomodated. However, comments in the regexp must not contain unmatched [ characters; see section on “RegExp scanning” below.

Prior art: The flag comes from Perl.

RegExp objects: The state of the flag would show up as a property extended on RegExp objects.

"/y" flag

The flag “y” shall be permitted.

The flag states that the regular expression peforms sticky matching in the target string by attempting to match at lastIndex: if matching at that location fails then null is returned, ie, no forward searching is performed. If matching succeeds, then lastIndex is updated as for the flag “g”.

Rationale: This flag will make it easier to write simple and efficient lexical analyzers for embedded languages using ECMAScript regular expressions. The current language has quadratic complexity because each match may potentially search to the end of the input for a match. (That can be worked around in a couple of ways but it’s cumbersome.)

Compatibility: The flag is illegal in 3rd Edition, and the matching behavior is a subset of existing behavior, so this flag should not cause trouble for any implementation.

RegExp objects: The state of the flag would show up as a property sticky on RegExp objects.

Regular expressions are also functions

A regular expression object can be called as a function on a single string argument.

The value returned is exactly the same value that would be returned if the RegExp.prototype.exec method were invoked on the regular expression object with the string for an argument.

Rationale: Compatibility with existing implementation; handy shorthand.

Compatibility: RegExp instances are not callable in 3rd Edition; it is unlikely to be a hardship to make them callable in 4th Edition.

Prior art: Netscape/Mozilla support this behavior.

typeof: The value of “typeof /x/” is “object” (backwards compatible)?

Named groups

Regular expression submatches can be named, and back-references can reference these names.

Creating a named group:

  • (?P<name>...)
    • Similar to regular capturing parentheses, but the substring matched by the group is accessible via the symbolic group name name.
    • Group names must be valid lexical identifiers, and each group name must be defined only once within a regular expression.
    • A symbolic group is also a numbered group, just as if the group were not named. So the group named id in the example below can also be referenced as the numbered group 1.
    • For example, if the pattern is (?P<id>[a-zA-Z_]\w*), the group can be referenced by its name in String.prototype.match result objects, such as m.id, and also by name in pattern text (for example, (?P=id)) and replacement text (such as \g<id>).

Referencing a named group:

  • (?P=name)
    • Matches whatever text was matched by the earlier group named name.

Rationale: Regular expressions are hard to read. This will make them easier to read (and therefore use).

Compatibility: The syntax is illegal today and has no compatibility issues. The implementation burden is expected to be moderate.

Prior art: comes from Python

RegExp scanning

Following MSIE we propose that an unescaped / character inside a character set should stand for itself, it should not terminate scanning of the expression.

Compatibility: The consequence of allowing this is that regular expressions that contain line comments with unmatched [ characters will throw the regular expression scanner off track; it will not recognize the terminating / properly. The scanner can’t know if the /x flag is in effect, it must blindly scan the expression and will fall prey to this. Therefore comments must not contain unmatched [ characters. This is not incompatible, it’s just surprising.

 
proposals/extend_regexps.txt · Last modified: 2008/07/14 18:43 by jodyer
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki