Jump to content

REGEXP: match everything but specific pattern


Luke

Recommended Posts

As you know, something like

[^abc]

processes one character at a time. So in this case everything but a, b or c.

What if I need to match everything but the sequence abc? Is it possible to achieve that in WeiDU?

EDIT: well, in this case this should be enough

[^a-c]

but what about the following pattern

=>èàè+@@()[][]/>>>|

?

Edited by Luke
Link to comment

It might help to ask this in the PPG's weidu forum... and answer what exactly are you trying to do ? As it might probably be easier to define the file extension for %&¤"- sakes. Cause you probably would like to avoid editing say .are files just cause they are kinda essential to not get cheese all over them.

Also you can avoid matching by adding PATCH_IF's etc, just like you usually do:

COPY_EXISTING_REGEXP GLOB ~.*\.sto~ ~override~
	READ_LONG 0x14 ~price~
	PATCH_IF !(price = 0) BEGIN
...

 

Edited by Jarno Mikkola
Link to comment

No, WeiDU doesn't have a general way to do that.

 

The closest regexp approximation of "match anything except abc" that is supported in WeiDU is:

([^a].*|a[^b].*|ab[^c].*)

For readability, I left out the backslashes that are required in front of the round brackets and pipes.  This regexp matches a string that either:

  1. Doesn't start with the letter a
  2. Does start with the letter a but continues with a character other than b, or
  3. Does start with the sequence ab but continues with a character other than c

It's a very awkward way to deal with this and it gets worse when you are avoiding a sequence longer than abc. It only disallows the forbidden sequence in one particular place in the string and it doesn't combine well with other regexp constructs, which means there are situations where it won't do the trick.

 

If it's not possible to do what you want in one regular expression, consider your other options.  Depending on which part of WeiDU you're dealing with, you might have the ability to query the data (STRING_CONTAINS_REGEXP) or manipulate it (REPLACE_TEXTUALLY) beforehand.  Sometimes two or three regexps are better than one.

Link to comment
2 hours ago, lynx said:

It sounds like you're looking for "negative lookahead"...

Yep.

1 hour ago, Mike1072 said:

It only disallows the forbidden sequence in one particular place in the string...

I see. That's bad since I'm not interested in "negative lookahead" from the start, but rather "everything but a string containing specific text"...

As a result, I'll consider manipulating it (REPLACE_TEXTUALLY) beforehand...

Link to comment

@Mike1072

OK, I think this is not fully clear to me, and I'd like to have your help...

Case scenario: how would you find all instances (string positions) of a certain sequence of characters (for instance ", " without quotes – comma followed by a space character) without using lookahead constructs?

Test string

OUTER_TEXT_SPRINT "test_string" ~~~~~a, b, ~c~ STRING_EQUAL_CASE ~ , , , , jghg , , ,ty ty , ,6 6y , ," , u.    , ,   ,,,,  , , ~)~~~~~

To sum up: is it possible to write a function that given the string above and a certain sequence of characters (, ) returns

  • a
  • b
  • ~c~ STRING_EQUAL_CASE ~ , , , , jghg , , ,ty ty , ,6 6y , ," , u.    , ,   ,,,,  , , ~)

?

That is, it should return all substrings separated by ", " provided that ", " is not part of a string (that is, enclosed in ""s or ~~s).

Edited by Luke
Link to comment

SPRINT quote ~\("[^"]+"\)~ // anything encased within ""s that isn't a " itself
SPRINT tilda "\(~[^~]+~\)" // anything encased within ~~s that isn't a ~ itself

It's been years since I touched WeiDU, but theoretically this should match anything that is either quoted, tilda'ed or anything non-comma:

\(%quote%\|%tilda%\|[^,]\)+

Link to comment
9 hours ago, Ardanis said:

SPRINT quote ~\("[^"]+"\)~ // anything encased within ""s that isn't a " itself
SPRINT tilda "\(~[^~]+~\)" // anything encased within ~~s that isn't a ~ itself

It's been years since I touched WeiDU, but theoretically this should match anything that is either quoted, tilda'ed or anything non-comma:

\(%quote%\|%tilda%\|[^,]\)+

Sorry, but it's still not clear to me how I should use all of that 😕...

OK, the only issue seems to be that it returns the intended substrings with a leading space character (so for instance " b" instead of "b"), except for the first one, which is returned as it is...

Well, it's not a big deal, I can always remove it afterwards with a

REPLACE_TEXTUALLY "^ " ""

More precisely, I should remove all characters following the very first character in my separator string.

So if my separator string is something like ", , off , , hgj. " (without quotes), I should

REPLACE_TEXTUALLY "^ , off , , hgj. " ""

So the question is: can your regexp be tweaked so as to account for that...?

Edited by Luke
Link to comment
5 hours ago, Luke said:

OK, the only issue seems to be that it returns the intended substrings with a leading space character (so for instance " b" instead of "b"), except for the first one, which is returned as it is..

My bad... In that case, this will probably return MATCH1 without initial spaces:
~[ ]*\(\(%quote%\|%tilda%\|[^,]\)+\)~

 

Same thing in slightly more human-readable form:
SPRINT quote ~\("[^"]+"\)~ // anything encased within ""s that isn't a " itself
SPRINT tilda "\(~[^~]+~\)" // anything encased within ~~s that isn't a ~ itself
SPRINT nonseparator ~[^,]~

And this for matching expression:
~[ ]*\([%quote%%tilda%%nonseparator%]+\)~ // MATCH1 is set to what's inside \(\), so the first space(s) will be omitted

 

5 hours ago, Luke said:

More precisely, I should remove all characters following the very first character in my separator string.

*Scratches head* Not sure if I understand correctly... Is it found somewhere between quote/tilda characters, i.e. after you've already run the initial match you now need to further process its results? Or is it what you actually use instead of comma separator in the above example?

If the former, you can nest multiple REPLACE_EVALUATE and use outer's %MATCH1% as input string for the inner.

If the latter... I believe WeiDU doesn't actually support negating a specific sequence of symbols 🤔 You might try this
SPRINT separator ~
, , off , , hgj. ~
SPRINT nonseparator ~[^%separator%]~

But I'm not certain it's going to fool WeiDU. If it doesn't, then can you simply reduce it to a single comma and then proceed as before?
REPLACE_TEXTUALLY ~, , off , , hgj. ~ ~,~

Link to comment

Here's what I'd suggest: append a final comma and space to the end of your list before trying to split it.

Then, you can use this to match one of the list items as MATCH1 (including the comma and space):

\([^,]+\|"[^"]+"\|~[^~]+~, \)

You can easily perform additional formatting after you have the item.

To save a step and grab the item without the comma and space, you could add another capture group and retrieve the item as MATCH2:

\(\([^,]+\|"[^"]+"\|~[^~]+~\), \)

 

EDIT: Ignore the garbage regexps above, they suck.  Try this instead.

\(\("[^"]+"\|~[^~]+~\|[^,]+\), \)
Edited by Mike1072
Link to comment
23 hours ago, Ardanis said:

My bad... In that case, this will probably return MATCH1 without initial spaces:
~[ ]*\(\(%quote%\|%tilda%\|[^,]\)+\)~

This is working as expected now, thanks.

23 hours ago, Ardanis said:

If the latter... I believe WeiDU doesn't actually support negating a specific sequence of symbols 🤔

Yes, it's the latter...

Correct, WeiDU doesn't support negating a specific sequence of symbols. In particular, as stated above by Mike, it does not support any kind of lookahead (lookbehind) construct, hence my issue...

Link to comment

Join the conversation

You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...