Jump to content

REGEXP: match everything but specific pattern


Luke

Recommended Posts

In that case, the next best bet is to try including separator string into expression outside of the match variable:

SPRINT quote ~\("[^"]+"\)~ // anything encased within ""s that isn't a " itself
SPRINT tilda "\(~[^~]+~\)" // anything encased within ~~s that isn't a ~ itself
SPRINT separator ~
 , off , , hgj. ~ // without initial comma

~[%separator%]?[ ]*\(\(%quote%\|%tilda%\|[^,]\)+\)~

Link to comment
16 hours ago, Luke said:

@Mike1072

Unless I'm missing something, your solution does not work (in particular, it does not skip ", " inside ""s or ~~s)...?

My apologies, I didn't test it out.  There's definitely something wonky in my first regexp, but I see another problem that applies to both.

The cases you mention might be caught by the first capture group ([^,]+) which would absorb everything up to the embedded comma, and then ruin the future matches.  It might be possible to resolve that just by reordering the capture groups in the alternation and placing it after the other two.

I'll update the post with a hopefully-working version.  And I'm not testing it either just yet.

Link to comment

@Ardanis, @Mike1072

Sorry for the necro, but

On 12/7/2021 at 6:56 PM, Ardanis said:

In that case, the next best bet is to try including separator string into expression outside of the match variable:

This mostly works, the only issue is that it does not take into account the order of characters. That is to say, if my separator is "ab", then also "ba" is valid...

On 12/8/2021 at 10:46 AM, Mike1072 said:

The cases you mention might be caught by the first capture group ([^,]+) which would absorb everything up to the embedded comma, and then ruin the future matches.  It might be possible to resolve that just by reordering the capture groups in the alternation and placing it after the other two.

Unless I'm missing something, nothing changes when reordering the capture groups... Do you have any other idea...?

I mean, it is certainly possible to build a parser (function) that scans the input string character-by-character (byte-by-byte) and remembers when quotation is open...

Spoiler
DEFINE_DIMORPHIC_FUNCTION "SPLIT_EXPR"
STR_VAR
	"expr" = ""
	"pattern" = ""
RET_ARRAY
	"array"
BEGIN
	// Initialize
	ACTION_CLEAR_ARRAY "array"
	OUTER_SET "count" = 0
	OUTER_SET "expr_length" = STRING_LENGTH "%expr%"
	OUTER_TEXT_SPRINT "temp" ""
	OUTER_SET "tilda_found" = 0
	OUTER_SET "quote_found" = 0
	OUTER_PATCH "%pattern%" BEGIN
		READ_ASCII 0x0 "1st_char" ELSE "" (1)
		READ_ASCII 0x1 "remaining_chars" ELSE "" (BUFFER_LENGTH - 1)
	END
	// Main
	OUTER_PATCH "%expr%" BEGIN
		WHILE ("%expr_length%") BEGIN
			READ_ASCII 0x0 "current_char" (1)
			PATCH_MATCH "%current_char%" WITH
				"~" WHEN !("%quote_found%") BEGIN
					SET "tilda_found" += 1
					PATCH_IF ("%temp%" STRING_COMPARE_CASE "") BEGIN
						TEXT_SPRINT "temp" "%temp%%current_char%"
					END ELSE BEGIN
						TEXT_SPRINT "temp" "%current_char%"
					END
					DELETE_BYTES 0x0 0x1
					SET "expr_length" -= 1
				END
				~"~ WHEN !("%tilda_found%") BEGIN
					SET "quote_found" += 1
					PATCH_IF ("%temp%" STRING_COMPARE_CASE "") BEGIN
						TEXT_SPRINT "temp" "%temp%%current_char%"
					END ELSE BEGIN
						TEXT_SPRINT "temp" "%current_char%"
					END
					DELETE_BYTES 0x0 0x1
					SET "expr_length" -= 1
				END
				"%1st_char%" BEGIN
					READ_ASCII 0x1 "following_chars" ELSE "" (STRING_LENGTH "%pattern%" - 1)
					PATCH_IF ("%remaining_chars%" STRING_EQUAL "%following_chars%") BEGIN
						PATCH_IF ("%quote_found%" == 0 OR "%quote_found%" == 2) AND ("%tilda_found%" == 0 OR "%tilda_found%" == 2 OR "%tilda_found%" == 10) BEGIN
							DEFINE_ASSOCIATIVE_ARRAY "array" BEGIN
								"%count%" => "%temp%"
							END
							SET "count" += 1
							DELETE_BYTES 0x0 STRING_LENGTH "%pattern%"
							SET "expr_length" -= STRING_LENGTH "%pattern%"
							// Reset vars
							SET "tilda_found" = 0
							SET "quote_found" = 0
							TEXT_SPRINT "temp" ""
						END ELSE BEGIN
							PATCH_IF ("%temp%" STRING_COMPARE_CASE "") BEGIN
								TEXT_SPRINT "temp" "%temp%%current_char%"
							END ELSE BEGIN
								TEXT_SPRINT "temp" "%current_char%"
							END
							DELETE_BYTES 0x0 0x1
							SET "expr_length" -= 1
						END
					END ELSE BEGIN
						PATCH_IF ("%temp%" STRING_COMPARE_CASE "") BEGIN
							TEXT_SPRINT "temp" "%temp%%current_char%"
						END ELSE BEGIN
							TEXT_SPRINT "temp" "%current_char%"
						END
						DELETE_BYTES 0x0 0x1
						SET "expr_length" -= 1
					END
				END
				DEFAULT
					PATCH_IF ("%temp%" STRING_COMPARE_CASE "") BEGIN
						TEXT_SPRINT "temp" "%temp%%current_char%"
					END ELSE BEGIN
						TEXT_SPRINT "temp" "%current_char%"
					END
					DELETE_BYTES 0x0 0x1
					SET "expr_length" -= 1
			END
		END
	END
	// If ~%pattern%~ is not found...
	ACTION_IF ("%temp%" STRING_COMPARE_CASE "") BEGIN
		OUTER_SET "count" = "%count%" ? "%count%" + 1 : "%count%"
		ACTION_DEFINE_ASSOCIATIVE_ARRAY "array" BEGIN
			"%count%" => "%temp%"
		END
	END ELSE BEGIN
		FAIL "SPLIT_EXPR: ~temp~ is empty (~expr~=~%expr%~, ~pattern~=~%pattern%~). Wut???"
	END
END

 

However, in case of multiple separators (i.e., if multiple separators are valid), how should I use it?

Guess I should check them one by one, i.e.:

Spoiler
// Suppose separators "ab", "cfb89" and ">><<8677vdf2" are all valid

OUTER_TEXT_SPRINT "mystring" "" // your test string
OUTER_SET "found" = 0 // boolean
ACTION_FOR_EACH "separator" IN "ab" "cfb89" ">><<8677vdf2" BEGIN
	ACTION_IF !("%found%") BEGIN
		LAF "SPLIT_EXPR"
		STR_VAR
			"expr" = "%mystring%"
			"pattern" = "%separator%"
		RET_ARRAY
			"array"
		END
		LAF ~ARRAY_LENGTH~
		STR_VAR
			"array"
		RET
			"length"
		END
		ACTION_IF ("%length%" >= 2) BEGIN
			OUTER_SET "found" = 1
		END
	END
END

// where function ~ARRAY_LENGTH~ is

DEFINE_DIMORPHIC_FUNCTION ~ARRAY_LENGTH~
STR_VAR
	"array" = "" // array name
RET
	"length"
BEGIN
	// Initialize
	OUTER_SET "length" = 0
	// Main
	ACTION_PHP_EACH "%array%" AS "key" => "value" BEGIN
		OUTER_SET "length" += 1
	END
END

 

You see, it is a bit inelegant, but it should work... Having said that, I still did not understand whether it is possible to use a regexp or not 😕...

Edited by Luke
Link to comment

Join the conversation

You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...