2

I'm the author of pythonizer, perl to python converter, and I'm trying to translate a perl split statement that has a string pattern that includes a backslash, and I need some help understanding the behavior. Here is the example based on the source code I'm trying to translate:

$s = 'a|b|c';
@a = split '\|', $s;
print scalar(@a) . "\n";
print "@a\n";

The output is:

3
a b c

Now if I just print '\|' it prints \| so I'm not sure why the backslash is being ignored in the string pattern. The documentation doesn't say much of anything about a string being used as a pattern, except for the ' ' special case. Feeding '\|' to python string split will not split this string.

Even more strange is what happens if I change the above code to use a double-quoted string:

@a = split "\|", $s;

Then the output is:

5
a | b | c

If I change it to a regex, then it does the same thing as if it was a single-quoted string (splitting into 3 pieces), which makes perfect sense because | is a special char in a regex so it needs to be escaped:

@a = split /\|/, $s;

So my question is - how is a split on a string that contains a backslash (in single and then double quotes) supposed to work so I can reproduce it in python? Should I just remove all backslashes, except for \\ from a single-quoted input string if it's on a split?

Also, why does a split on "\|" (or "|") split the string into 5 pieces? (I'm thinking of punting on this case.)

5
  • 2
    Why did you tag your question with Python? Commented Nov 6, 2022 at 4:43
  • I'm generating python code from perl input, for example s.split('|') splits s into 3 pieces but s.split('\|') doesn't split it at all, and certainly nothing I can think of will split it into 5 pieces! Commented Nov 6, 2022 at 4:46
  • Reference: github.com/snoopyjc/pythonizer/issues/138 Commented Nov 6, 2022 at 4:55
  • I'm thinking the pattern is always being treated as a regex even if it's a simple string, and in the "\|" case, the \ is being removed leaving the pattern as "|" which operates as an empty pattern. Commented Nov 6, 2022 at 5:23
  • Just reproduced the split into 5 case: >>> import perllib >>> perllib.split('|', 'a|b|c') ['a', '|', 'b', '|', 'c'] Commented Nov 6, 2022 at 5:28

2 Answers 2

3

There's a few interleaved questions so let me go step by step

  • Perl's split takes a regular expression pattern to identify separators by which it splits the string, in its first argument. This is a "normal" regex, compiled and ran by the regex engine, but in doing so it does have special cases

  • As for delimiters in split's regex: variables in patterns are interpolated except under single quotes, like in regex, what has no relevance for examples here. The string \| is pattern \| either way, so the literal | (and not alternation)

    But for double quotes there is a difference: in split the string under the double quotes is first interpolated, apparently including escapes, and only then is the result handed to the regex engine to compile it into a pattern. So that "\|" becomes the pattern | for the regex, so yes the alternation! (Not the behavior in regex outside of split.)

  • What brings us to the issue of split-ing with the pattern of |, as split "\|" or as split /|/ — that works like splitting with an empty string, a split's specialty which returns all characters. A regex doesn't behave that way, with /|/ nor with //.

    This behavior of split appears undocumented. I can see a rationale like "split by either empty string or by empty string -- well, so split by empty string", what for split perhaps makes some sense.

    In regex that doesn't make much sense: Matching "empty string -or- empty string" matches the first empty string, what merely succeeds -- but an actual pattern of an empty string has very distinct behavior, which I don't see with /|/. (And which is unrelated to what split does.) So having a lone /|/ — a legal pattern — is only confusing as it does nothing.

As for what to do with this for Python, the str.split mentioned by OP in a comment doesn't use regex at all. To reproduce Perl's split operation one needs to use split from re, re.split(pattern, string,...). Then go through details and test behavior in re with escaped regex patterns.

Sign up to request clarification or add additional context in comments.

4 Comments

A regex doesn't behave that way, with /|/ nor with // Both compile to a NOTHING node in a regex, so nothing special apparently: perl -E ' use re "debug";say "abc" =~/|/'
Thanks - I now check if the pattern contains any regex chars and use re.split (actually perllib.split) if it does. If it's just plain chars, then I use str.split.
@snoopyjc Welcome -- as for "if the pattern contains any regex chars" -- it's always a regex even if the pattern is just a string literal. MIght just always pass it to re.split...?
@clamp Thanks for a comment, good point to look at it with re. I didn't want to expand this answer even more. I got all of 1, empty lines, and empty string, printed for various uses of either of these two in a normal regex. I'd say that it shouldn't really compile at all.
1

Perl treats a single quote as-is. It interpolates double-quotes.

Split expects regex, so the '\|' is being treated as regex \|, where the \ is a regex escape char, meaning the | is the split char matched. Perl interpolates the "\|" to just |, which is regex for OR.

3 Comments

"interpolates the "\|" to just |, which is regex for OR" ... and? Why does that split the string into characters? That is the behavior of an empty pattern -- why does it work that way? (A normal regex doesn't, $s =~ /|/ doesn't do that.) That's the question...
But /|/ is equivalent to the empty pattern - the first alternation is the empty pattern, which always succeeds, so the second alternation (also the empty pattern is never tried)
Sorry, didn't expand on the second part ... the OR. A standalone | will match between the chars, meaning each char will be a piece as a result of splitting there.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.