JavaScript: avoiding empty strings with String.split, and regular expression precedence

Question

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be.

For example, "***".split(/(\*)/) -> ["", "*", "", "*", "", "*", ""]. Is there a way to avoid this?

Another issue is the expression precedence in the regular expression itself. Let's say I am trying to parse a C style multi-line comment. That is, /* comment */. Now let's assume the input string is "/****/". If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!).

/(\/\*|\*\/|\*)/

A better way is to read /*'s, */'s and then read all the rest of the *'s in one token. That is, the better result for the above string is ["/*", "**", "*/"]. However, when using the regular expression that should do this, I get bad results. The regular expression is like so: /(\/\*|\*\/|\*+)/.

The result of this expression is however this: ["/*", "***", "/"]. I am guessing this is because the last part is greedy so it steals the match from the other part.

The only solution I found was to make a negated lookahead expression, like this:

/(\/\*|\*\/|\*+(?!\/)/

This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings.

Is there a solution for either of these problems?

"****".split().join().split(''); I know this probably is not what you need. but seems to work. — james emanon
– james emanon, Commented Nov 12, 2013 at 1:49

anubhava · Accepted Answer · 2013-11-11 23:42:29Z

25

Use lookahed to avoid empty matches:

arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]

OR use filter(Boolean) to discard empty matches:

arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]

answered Nov 11, 2013 at 23:42

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user2503048 Over a year ago

The lookahead sadly doesn't let me to get multiple things in the same token, e.g. "***" as one token in the above example. I used filter() previously, but the additional array traversal actually made the whole code slower than it is when doing one iteration while skipping the empty strings.

james emanon Over a year ago

Anu -- sorry to bother you with this, but can you explain or point me to a resource that explains the .filter(Boolean) concept. Are the empty strings returned as false etc.. I'm slightly confused by your usage.

user2503048 Over a year ago

Empty strings evaluate as false in boolean expressions.

SridharKritha Over a year ago

arr = "***".split(''); Can also be used to tokenize the individual characters of a string.

user2503048 Over a year ago

@anubhava You are responding to an answer from 7 years ago :) I don't remember what I ended up doing at the time, but at the end of the day tokenizing with regular expressions is just not the best, better to make a specific tokenizer for whatever input you want (e.g. iterate character by character and know when tokens start/end), it is significantly faster and more efficient than regexes

|

georg · Accepted Answer · 2013-11-12 00:38:21Z

4

Generally for tokenizing you use match, not split:

> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]

Also note how the non-greedy modifier ? solves the second problem.

answered Nov 12, 2013 at 0:38

georg

216k57 gold badges324 silver badges401 bronze badges

2 Comments

user2503048 Over a year ago

I am tokenizing gigantic strings, the C style multiline comment was just one example, how would that regex fit with the rest? (the rest being {}()[]+-/&^=! and so on). In the mean time I just stopped grabbing all *'s with the greedy modifier, while the rest of the tokens do use it. The worst case is when there are many of *'s, but since there are usually not many of them, I guess I can live with that. For the record, this is currently my regex: /[ \t]+|\[+|\]+|\{+|\}+|\(+|\)+|\++|\-+|\*|\/+|<+|>+|&+|\|+|=+|!+|\\/

georg Over a year ago

@user2503048: well, there are quite a few syntax highlighters in javascript, maybe you can start by studying their source code? Your regexp looks sub-optimal.

Collectives™ on Stack Overflow

JavaScript: avoiding empty strings with String.split, and regular expression precedence

2 Answers 2

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related