Default regexp options

I'd like to submit two points to your consideration, especially Jon since they point to PCRE compile options or canned option(s) set or prepended to every pattern submitted.

First, it seems to me that the current default newline convention is LF only.

#include <Array.au3>

Local $sData =  "12 abc." & @CRLF & "13 def;" & @CRLF & "14 ghi."
Local $aRes

; Say we want an array of lines that start with a number and end with a dot.

; Using the default built-in convention, we miss "12 abc."
$aRes = StringRegExp($sData, "(?m)(^\d+.*\.$)", 3)
_ArrayDisplay($aRes, "Valid lines (*LF)")

; Forcing newline convention to be @CRLF works like expected
$aRes = StringRegExp($sData, "(*CRLF)(?m)(^\d+.*\.$)", 3)
_ArrayDisplay($aRes, "Valid lines (*CRLF)")

While this default works well under Unix-like OSes using @LF only, it brings issues with $ under Windows.

EDIT: much simplified example follows in subsequent post.

In multiline mode (?m), $ is a true assertion at end of subject or before a newline (c.f. current newline convention). In the first example above, the literal dot (\.) is not the character just before $, since there is a @CR between the dot and the @LF which is where $ is true.

Using the sequence (*CRLF) at start of a pattern, we force the newline convention to be @CRLF as a whole, which is the most common situation in the Windows world. You can see the difference with the second example above.

That's why I'd recommend to use the --enable-newline-is-crlf PCRE library build-time option. This is equivalent to prepending (*CRLF) to every pattern submitted. Yet people can override this default setting when they need to process text using another convention. For the record, available conventions (to be used once at the very start of a pattern) are:

  (*CR)        carriage return
  (*LF)        linefeed
  (*CRLF)      carriage return, followed by linefeed
  (*ANYCRLF)   any of the three above
  (*ANY)       all Unicode newline sequences

Note to Jon: equivalently, the PCRE_NEWLINE_CRLF option bit can be passed at pattern compile-time to pcre_compile[2]().

_________________________________________________________

The second point offers more room for debate. The question is: should we force the UCP option internally or should we leave it to the users to specify it when they actually need it?

You all know I've been a strong lobbyist for compilation of PCRE with full Unicode support (UCP option). It allows users of non-english scripts (= written language) to see casing apply to their fancy letters, use category properties and this is very important.

The issue is that the UCP option is currently forced ON internally. Not only it slows down most pattern matching to a great extent but it also precludes users to reset the option. The consequence is that many common features like \w or \b change their meaning to extend it to the full Unicode plane 0 (AutoIt charset). It may not be what user want, but they have no way to revert to the non-UCP behavior.

If we leave UCP support in (that is the --enable-unicode-properties library compile-time option) but do not force it at pattern compile-time (by not setting the PCRE_UCP option bit passed to pcre_compile[2]()), I feel we have the best of all worlds. By default, pattern matching will run at the best speed and if/when people know or suspect they will have to match non-english letters, non-ASCII punctuation and the like, they can always prepend (*UCP) right at the start of their patterns, which is the pattern option to enable that feature.

I'm sorry if this sounds complicated but in fact it isn't really. Anyway the outcome impacts the regexp summary in StringRegExp help file.

Default regexp options

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List