Quantcast
Channel: AutoIt v3 - Developer Chat
Viewing all articles
Browse latest Browse all 750

Default regexp options

$
0
0

I'd like to submit two points to your consideration, especially Jon since they point to PCRE compile options or canned option(s) set or prepended to every pattern submitted.

 

First, it seems to me that the current default newline convention is LF only.

#include <Array.au3> Local $sData =  "12 abc." & @CRLF & "13 def;" & @CRLF & "14 ghi." Local $aRes ; Say we want an array of lines that start with a number and end with a dot. ; Using the default built-in convention, we miss "12 abc." $aRes = StringRegExp($sData, "(?m)(^\d+.*\.$)", 3) _ArrayDisplay($aRes, "Valid lines (*LF)") ; Forcing newline convention to be @CRLF works like expected $aRes = StringRegExp($sData, "(*CRLF)(?m)(^\d+.*\.$)", 3) _ArrayDisplay($aRes, "Valid lines (*CRLF)")

While this default works well under Unix-like OSes using @LF only, it brings issues with $ under Windows.

 

EDIT: much simplified example follows in subsequent post.

 

In multiline mode (?m), $ is a true assertion at end of subject or before a newline (c.f. current newline convention). In the first example above, the literal dot (\.) is not the character just before $, since there is a @CR between the dot and the @LF which is where $ is true.

 

Using the sequence (*CRLF) at start of a pattern, we force the newline convention to be @CRLF as a whole, which is the most common situation in the Windows world. You can see the difference with the second example above.

 

That's why I'd recommend to use the  --enable-newline-is-crlf PCRE library build-time option. This is equivalent to prepending (*CRLF) to every pattern submitted. Yet people can override this default setting when they need to process text using another convention. For the record, available conventions (to be used once at the very start of a pattern) are:

  (*CR)        carriage return
  (*LF)        linefeed
  (*CRLF)      carriage return, followed by linefeed
  (*ANYCRLF)   any of the three above
  (*ANY)       all Unicode newline sequences

Note to Jon: equivalently, the PCRE_NEWLINE_CRLF option bit can be passed at pattern compile-time to pcre_compile[2]().

 

_________________________________________________________

 

The second point offers more room for debate. The question is: should we force the UCP option internally or should we leave it to the users to specify it when they actually need it?

 

You all know I've been a strong lobbyist for compilation of PCRE with full Unicode support (UCP option). It allows users of non-english scripts (= written language) to see casing apply to their fancy letters, use category properties and this is very important.

The issue is that the UCP option is currently forced ON internally. Not only it slows down most pattern matching to a great extent but it also precludes users to reset the option. The consequence is that many common features like \w or \b change their meaning to extend it to the full Unicode plane 0 (AutoIt charset). It may not be what user want, but they have no way to revert to the non-UCP behavior.

 

If we leave UCP support in (that is the --enable-unicode-properties library compile-time option) but do not force it at pattern compile-time (by not setting the PCRE_UCP option bit passed to pcre_compile[2]()), I feel we have the best of all worlds. By default, pattern matching will run at the best speed and if/when people know or suspect they will have to match non-english letters, non-ASCII punctuation and the like, they can always prepend (*UCP) right at the start of their patterns, which is the pattern option to enable that feature.

 

I'm sorry if this sounds complicated but in fact it isn't really. Anyway the outcome impacts the regexp summary in StringRegExp help file.


Viewing all articles
Browse latest Browse all 750

Trending Articles