Discussion:
bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
(too old to reply)
Kenny McCormack
2024-07-22 21:59:11 UTC
Permalink
Note: this is just a question of aesthetics. Functionally, it all works as
expected.

Sample bash code:

f="$(fortune)" # Get some multi-line output into "f"
# Look for foo followed by bar on the same line
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"

The point is you need the "anything other than a newline" or else it might
match foo on one line and bar on a later line. The above is the only way I
could figure out to express a newline in the particular flavor of reg exps
used by the =~ operator.

The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".

Is there any way to get this?
--
Shikata ga nai...
Kaz Kylheku
2024-07-22 22:47:59 UTC
Permalink
Post by Kenny McCormack
The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".
I see what you mean:

$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}
Post by Kenny McCormack
Is there any way to get this?
Patch Bash so that when it's listing code, any items that need '...'
quoting and that contain control characters are printed as $'...'
syntax with escape sequences.

Someone who had their original code as '
' will might not want that. It has to be an option.

If Bash stored a bit in the code indicating "this word was produced
using $ syntax", then it could be recovered accordingly.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Janis Papanagnou
2024-07-23 09:48:11 UTC
Permalink
Post by Kaz Kylheku
Post by Kenny McCormack
The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".
$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}
Post by Kenny McCormack
Is there any way to get this?
Of course (and out of curiosity) I tried that display detail as well
in Kornshell to see how it behaves, and using a different command to
display it...


With my (old?) bash:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}


The same with ksh:

$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }


And for good measure also in zsh:

% f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
% typeset -f f
f () {
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
}


Both seem to show "better aesthetics". Too bad it doesn't help for
your bash context.

Janis
Post by Kaz Kylheku
[...]
Kenny McCormack
2024-07-23 11:46:19 UTC
Permalink
In article <v7nu8t$15bon$***@dont-email.me>,
Janis Papanagnou <janis_papanagnou+***@hotmail.com> wrote:
...
Both (ksh & zsh) seem to show "better aesthetics".
Indeed, it does. That is how it should work.
Too bad it doesn't help for your bash context.
Alas, it doesn't.
--
In American politics, there are two things you just don't f*ck with:

1) Goldman Sachs
2) The military/industrial complex
Janis Papanagnou
2024-07-23 14:44:37 UTC
Permalink
Post by Kenny McCormack
...
Both (ksh & zsh) seem to show "better aesthetics".
Indeed, it does. That is how it should work.
BTW, it's interesting that bash and zsh both reformat (sort
of pretty-print) the code (when using 'typeset -f'), only
that zsh keeps that literal '\n'. This may show a way (by
zsh example) how to follow Kaz' suggestion of patching the
bash. (But, frankly, I'm not sure it was meant seriously.)

But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.

Janis
Kenny McCormack
2024-07-23 16:13:40 UTC
Permalink
Post by Janis Papanagnou
Post by Kenny McCormack
...
Both (ksh & zsh) seem to show "better aesthetics".
Indeed, it does. That is how it should work.
BTW, it's interesting that bash and zsh both reformat (sort
of pretty-print) the code (when using 'typeset -f'), only
that zsh keeps that literal '\n'. This may show a way (by
zsh example) how to follow Kaz' suggestion of patching the
bash. (But, frankly, I'm not sure it was meant seriously. (see ** below))
Yes. ksh seems to dump it out literally as is (as it was typed), but bash
(and, I guess also zsh - I have zero knowledge or experience of zsh) pretty
prints it. But it seems zsh does a prettier print than bash.

One thing that bash does that's annoying is puts semicolons on the end of
(almost) every line. I have, on occasion, had to recover a function from
the bash pretty print (*), and one of the things that needs to be done is
to remove those extraneous semicolons.

(*) BTW, the command I use is "type". I.e., "type funName" displays the
function definition of function funName. That seems to be the same as your
use of "typeset".
Post by Janis Papanagnou
But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.
Yep. Note also that bash reformats something like:

cmd1 &&
cmd2 &&
cmd3

to:

cmd1 && cmd2 && cmd3

which is annoying.

(**) I've hacked the bash source code for less. So, yeah, it is possible.
--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/ThePublicGood
Janis Papanagnou
2024-07-23 16:48:42 UTC
Permalink
Post by Kenny McCormack
One thing that bash does that's annoying is puts semicolons on the end of
(almost) every line.
Ouch!
Post by Kenny McCormack
I have, on occasion, had to recover a function from
the bash pretty print (*), and one of the things that needs to be done is
to remove those extraneous semicolons.
(*) BTW, the command I use is "type". I.e., "type funName" displays the
function definition of function funName. That seems to be the same as your
use of "typeset".
I started tests with 'type' but the result was something undesirable
(forgot already what it was), so I tried the 'typeset -f' which had
better results (with ksh, zsh, at least).

Actually I was just playing around, since your post made me curious.
(I almost never inspect function definitions using one method or the
other. The interesting functions are non-trivial and already tested,
so interactively looking them up makes no sense for me. And other
functions are part of shell programs, either monolithic or used as
lib.) But as a side-effect of my tries I noticed another bug in the
ksh93u+m shell that I'm using. :-/ (But I'm digressing.)
Post by Kenny McCormack
Post by Janis Papanagnou
But ksh displays it as it had been typed in; a raw format.
If you define your function, say, as multi-line code you
also see it that way, there's no processing at that point
(or the original retained as copy). I didn't expect that.
cmd1 &&
cmd2 &&
cmd3
cmd1 && cmd2 && cmd3
which is annoying.
Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.
Post by Kenny McCormack
(**) I've hacked the bash source code for less. So, yeah, it is possible.
Ah, okay. (Would not be my preferred way. :-)

Janis
Kenny McCormack
2024-07-23 17:15:41 UTC
Permalink
In article <v7omtd$19ng6$***@dont-email.me>,
Janis Papanagnou <janis_papanagnou+***@hotmail.com> wrote:
...
Post by Janis Papanagnou
Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.
The overall plan is to turn programming into a minimum wage job. That's
why they are starting to call it "coding" and make it sound like something
anybody can do.

So, they have to take as much as possible of the choice/initiative out of it.
Make it the modern equivalent of a factory job.
--
After Using Gender Slur Against AOC, GOP Rep. Yoyo Won't Apologize 'For Loving God'.

That's so sweet...
Janis Papanagnou
2024-07-24 09:53:00 UTC
Permalink
Post by Kenny McCormack
...
Post by Janis Papanagnou
Indeed. It reminds me the philosphy that I often noticed in MS (and
nowadays also in Linux software, sadly) contexts; they seem to think
their auto-changes are better than the intention of the programmer.
The overall plan is to turn programming into a minimum wage job. That's
why they are starting to call it "coding" and make it sound like something
anybody can do.
And sometimes it doesn't even appear as coding; during a small episode
in Java I could observe they are just clicking together pieces of code
from drop-down menus using Eclipse.
Post by Kenny McCormack
So, they have to take as much as possible of the choice/initiative out of it.
Make it the modern equivalent of a factory job.
Interesting comparison. But, yes.

Janis
Kaz Kylheku
2024-07-23 18:20:14 UTC
Permalink
Post by Janis Papanagnou
Post by Kaz Kylheku
Post by Kenny McCormack
The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".
$ test() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ set | grep -A 4 '^test'
test ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}
Post by Kenny McCormack
Is there any way to get this?
Of course (and out of curiosity) I tried that display detail as well
in Kornshell to see how it behaves, and using a different command to
display it...
$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f ()
{
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}
$ f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
$ typeset -f f
f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
% f() { [[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar" ; }
% typeset -f f
f () {
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
}
It bolsters the argument that Bash could use a fix to be this
way also.

zsh preserves the original syntax. So it is saving information
in the stored code about how the datum was represented in
the source code:

% f() { [[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar" ; }
sun-go% typeset -f f
f () {
[[ "$f" =~ foo[^'
']*bar ]] && echo "foo bar"
}

I can understand why an implementor wouldn't want to save this.

If the code that we see in "typeset" is the actual code that
executes, it means that in the $'...' case, zsh has to process
the escape sequences, whereas bash has expanded them out upfront.

If the code that we see in "typeset" is not the actual code
that executes, then that requires extra storage. The Bash
project might be reluctant to imitate that strategy.

Oh look, zsh preserves comments:

sun-go% f() { # f function
function> :
function> }
sun-go% typeset -f f
f () {
# f function
:
}

I doubt that when f is called, it's actually dealing with the
lexical details any more like comments; it's storing some
compiled version of the code along with the source, more likely.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Arti F. Idiot
2024-07-23 04:00:00 UTC
Permalink
Post by Kenny McCormack
Note: this is just a question of aesthetics. Functionally, it all works as
expected.
f="$(fortune)" # Get some multi-line output into "f"
# Look for foo followed by bar on the same line
[[ "$f" =~ foo[^$'\n']*bar ]] && echo "foo bar"
The point is you need the "anything other than a newline" or else it might
match foo on one line and bar on a later line. The above is the only way I
could figure out to express a newline in the particular flavor of reg exps
used by the =~ operator.
The problem is that if the above is in a function, when you list out the
function with "type funName", the \n has already been digested and
converted to a hard newline. This makes the listing look strange. I'd
rather see "\n".
Is there any way to get this?
Not sure this really addresses your 'type funcName' query but maybe
somewhat better output from 'type funcName' ? :

...
regex=$(printf 'foo[^$\n]*bar')
[[ "$f" =~ $regex ]] && echo "foo bar"

Kind of wish the regex string could be bracketed by "/" as in awk.
Kenny McCormack
2024-07-23 05:33:31 UTC
Permalink
In article <v7n9s1$2p39$***@nnrp.usenet.blueworldhosting.com>,
Arti F. Idiot <***@is.invalid> wrote:
...
Post by Arti F. Idiot
Not sure this really addresses your 'type funcName' query but maybe
...
regex=$(printf 'foo[^$\n]*bar')
[[ "$f" =~ $regex ]] && echo "foo bar"
Yes. I think there are actually some other situations like this (i.e.,
issues involving using =~) - where putting the reg exp into a variable and
then using the variable works better. It seems to be a common solution.
Post by Arti F. Idiot
Kind of wish the regex string could be bracketed by "/" as in awk.
Yes. I've often wished the same. It is annoying that you have to escape
any spaces in the regexp (since it isn't delimited, like reg exps are in
most other languages [not just AWK]).

Incidentally, here's another situation - involving using $'' but not
involving =~. In another script, I have:

ctrla=$'\001'

Then I use $ctrla thereafter. But when the function is listed, the above
line comes out as:

ctrla=$''

Unless I pipe it into "less", in which case it displays as:

ctrla=$'^A'

(with the ^A in reverse video). The point being that, as before with \n,
there is a hard 001 character in there, not a graphic representation of it
(as there should, IMHO, be).

Agreeing with what Kaz wrote, I'm not objecting to there being hard
characters in the internal representation of the code, but rather, I am
saying that when it is displayed (e.g., by the "type" command), it should be
rendered in a visible way.

And, yet, changing gears once again, I don't quite understand why you can't
write [\n] with =~. You have to write [$'\n']. It's not like that in most
other languages (e.g., AWK).

Which all kind of echoes back to the other recent thread in this NG about
regular expressions vs. globs. The cold hard fact is that there really is
no such thing as "regular expressions" (*), since every language, every
program, every implementation of them, is quite different.

(*) As an abstract concept, separate from any specific implementation.
--
Trump - the President for the rest of us.


Kaz Kylheku
2024-07-23 18:34:52 UTC
Permalink
Post by Kenny McCormack
Which all kind of echoes back to the other recent thread in this NG about
regular expressions vs. globs. The cold hard fact is that there really is
no such thing as "regular expressions" (*), since every language, every
program, every implementation of them, is quite different.
(*) As an abstract concept, separate from any specific implementation.
Yes, there are regular expressions as an abstract concept. They are part
of the theory of automata. Much of the research went on up through the
1960's. The * operator is called the "Kleene star".
https://en.wikipedia.org/wiki/Kleene_star

In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from
these.)

I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).

The ? operator is not required because (R)? can be written (R)(R)*.

Escaping is not required because the oeprators and input symbols are
distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.

Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.

Not all implementations follow solid theory. For instance, the branch
operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly
backtracking ones like PCRE and similar), there is a difference: these
implementations implement R1|R2|R3 by trying the expressions in left to
right order and stop at the first match.

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Ben Bacarisse
2024-07-23 23:51:44 UTC
Permalink
Post by Kaz Kylheku
Post by Kenny McCormack
Which all kind of echoes back to the other recent thread in this NG about
regular expressions vs. globs. The cold hard fact is that there really is
no such thing as "regular expressions" (*), since every language, every
program, every implementation of them, is quite different.
(*) As an abstract concept, separate from any specific implementation.
Yes, there are regular expressions as an abstract concept. They are part
of the theory of automata. Much of the research went on up through the
1960's. The * operator is called the "Kleene star".
https://en.wikipedia.org/wiki/Kleene_star
In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from
these.)
I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).
The ? operator is not required because (R)? can be written (R)(R)*.
(Aside: the choice is arbitrary but + would be a more "Unixy" choice for
that operator.)
Post by Kaz Kylheku
Escaping is not required because the oeprators and input symbols are
distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.
Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.
Not all implementations follow solid theory. For instance, the branch
operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly
backtracking ones like PCRE and similar), there is a difference: these
implementations implement R1|R2|R3 by trying the expressions in left to
right order and stop at the first match.
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
--
Ben.
Kaz Kylheku
2024-07-24 03:25:14 UTC
Permalink
Post by Ben Bacarisse
Post by Kaz Kylheku
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.

Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Ben Bacarisse
2024-07-24 13:17:14 UTC
Permalink
Post by Kaz Kylheku
Post by Ben Bacarisse
Post by Kaz Kylheku
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.
What is "the input" in the set view. The set view is simply a recursive
definition of the language.
Post by Kaz Kylheku
Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.
What is a "broken" RE in the set view?
--
Ben.
Kaz Kylheku
2024-07-24 18:35:51 UTC
Permalink
Post by Ben Bacarisse
Post by Kaz Kylheku
Post by Ben Bacarisse
Post by Kaz Kylheku
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.
What is "the input" in the set view. The set view is simply a recursive
definition of the language.
It is a separate string under consideration.

We have a set, and are asking the question "what is the longest prefix
of the given string which is a member of the set".
Post by Ben Bacarisse
Post by Kaz Kylheku
Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.
What is a "broken" RE in the set view?
Inconsistency in being able to answer the question "what is the longest
prefix of the string which is a member of the set".

Broken regexes contain a pitfall: they deliver the right answer
for expressions like ab*. If the input is "abbbbbbbc",
they identify the entire "abbbbbbb" prefix. But if the branch
operator is used, as in "a|ab*", oops, they short-circuit.
The "a" matches a prefix of the input, and so that's done; no need
to match the "ab*" part of the branch.

The "a" prefix is in the language described from the language; a
set element has been identified. But it's not the longest one.

It is an inconsistency. If the longest match is not required, why
bother finding one for "ab*"; for that expression, the "a" prefix could
also just be returned.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Ben Bacarisse
2024-07-24 21:28:32 UTC
Permalink
Post by Kaz Kylheku
Post by Ben Bacarisse
Post by Kaz Kylheku
Post by Ben Bacarisse
Post by Kaz Kylheku
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
This is more a consequence of the different views. The in the formal
theory there is no notion of "matching". Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules. The whole idea of an RE matching a string is from their
use in practical applications.
Under the set view, we can ask, what is the longest prefix of
the input which belongs to the language R1|R2. The answer is the
same for R2|R1, which denote the same set, since | corresponds
to set union.
What is "the input" in the set view. The set view is simply a recursive
definition of the language.
It is a separate string under consideration.
We have a set, and are asking the question "what is the longest prefix
of the given string which is a member of the set".
It's better, then, (as in the latter wording) not to use a term from the
"implementation" view of REs.
Post by Kaz Kylheku
Post by Ben Bacarisse
Post by Kaz Kylheku
Broken regular expressions identify the longest prefix, except
when the | operator is used; then they just identify a prefix,
not necessarily longest.
What is a "broken" RE in the set view?
Inconsistency in being able to answer the question "what is the longest
prefix of the string which is a member of the set".
Broken regexes contain a pitfall: they deliver the right answer
for expressions like ab*. If the input is "abbbbbbbc",
they identify the entire "abbbbbbb" prefix. But if the branch
operator is used, as in "a|ab*", oops, they short-circuit.
The "a" matches a prefix of the input, and so that's done; no need
to match the "ab*" part of the branch.
I don't see any "pitfall". The answer to you question "what is the
longest prefix of the given string which is a member of the set" is not
"a" and nothing in the either the formal definition of the language
"a|ab*" nor in the wording of the question is a pitfall. The longest
prefix of "abbbbbbbc" that is in the language "a|ab*" is, unambiguously,
"abbbbbbb".
Post by Kaz Kylheku
The "a" prefix is in the language described from the language; a
set element has been identified. But it's not the longest one.
Yes. But there is no "pitfall" and the RE is not "broken" in any formal
sense at all.

An implementation might be broken and there are pitfalls to look out for
when viewing REs as patterns to match, but that's my whole point. This
is all about the "other" view, not the view of REs as defining formal
languages.
Post by Kaz Kylheku
It is an inconsistency. If the longest match is not required, why
bother finding one for "ab*"; for that expression, the "a" prefix could
also just be returned.
We could, of course, ask about other prefixes of "abbbbbbbc" that are in
the language "a|ab*". I don't see anything inconsistent here at all.
--
Ben.
Janis Papanagnou
2024-07-24 09:41:48 UTC
Permalink
Post by Kaz Kylheku
[...]
In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from
these.)
I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).
While formally we can restrict to some basic elements it's quite
inconvenient in practice. I recall that in Compiler Construction
and Automata Theory we regularly used the 'all-but' operator for
complementing input symbol sets. And also obvious abbreviations
for larger sets of symbols (like 'digits', etc.). Not only in
practical regexp implementations, also in education certainly no
one wants to waste time.
Post by Kaz Kylheku
The ? operator is not required because (R)? can be written (R)(R)*.
ITYM: (R)+
Post by Kaz Kylheku
Escaping is not required because the oeprators and input symbols are
distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.
Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.
There are (at least) two different types of such adjustments. One
are the convenience enhancements (like the '+' or the multiplicity
('{m,n}', Perl's '\d' and '\D' etc.) that, from a complexity
perspective, all stay within the same [theoretical] class of the
Regular Languages. (There's other types of extensions that we find
in implementations that leave that language class.)
Post by Kaz Kylheku
Not all implementations follow solid theory. For instance, the branch
operator | is supposed to be commutative. There is no difference
between R1|R2 and R2|R1. But in many implementations (particularly
backtracking ones like PCRE and similar), there is a difference: these
implementations implement R1|R2|R3 by trying the expressions in left to
right order and stop at the first match.
In my book it's not necessary to follow "solid theory"; if, and only
if, it's documented, correctly/sensibly implemented, and implications
made clear to the programmer.

There's two common examples that come to mind. I think it's okay if
there's support for, e.g., back-references if it's clearly stated
that you should not expect that this code will run in O(N), linear
complexity. But there have been implementations - don't recall if it
was in Java, Perl, or both - where they implemented "generalizations"
of "Regexp-processings" that had the runtime-effect that even for
*true* Regular Expressions you had no O(N) guarantee any more; and
this is IMO a fatal decision! If the programmer uses only RE elements
from the class of Regular Languages there should always be complexity
O(N) guaranteed.
Post by Kaz Kylheku
This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.
I think we should differentiate the matching process (implementation
specific syntax for formulas of REs, matching implementation methods)
from the Regular Languages theory; the complete strings of text that
we match are in general not part of the Regular Language, the Regular
Expression only specifies the subset of (matching) strings as part of
the respective Regular Language. And, WRT complexity, we should also
be aware that the O(N) in Regular Languages is the complexity of the
"match", not the length of the data that is skimmed for a match. The
various algorithms combine these complexities in supposedly efficient
ways. Some RE parsers failed.

Janis
Loading...