20

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be.

For example, "***".split(/(\*)/) -> ["", "*", "", "*", "", "*", ""]. Is there a way to avoid this?

Another issue is the expression precedence in the regular expression itself. Let's say I am trying to parse a C style multi-line comment. That is, /* comment */. Now let's assume the input string is "/****/". If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!).

/(\/\*|\*\/|\*)/

A better way is to read /*'s, */'s and then read all the rest of the *'s in one token. That is, the better result for the above string is ["/*", "**", "*/"]. However, when using the regular expression that should do this, I get bad results. The regular expression is like so: /(\/\*|\*\/|\*+)/.

The result of this expression is however this: ["/*", "***", "/"]. I am guessing this is because the last part is greedy so it steals the match from the other part.

The only solution I found was to make a negated lookahead expression, like this:

/(\/\*|\*\/|\*+(?!\/)/

This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings.

Is there a solution for either of these problems?

1
  • "****".split().join().split(''); I know this probably is not what you need. but seems to work. Commented Nov 12, 2013 at 1:49

2 Answers 2

25

Use lookahed to avoid empty matches:

arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]

OR use filter(Boolean) to discard empty matches:

arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]
Sign up to request clarification or add additional context in comments.

7 Comments

The lookahead sadly doesn't let me to get multiple things in the same token, e.g. "***" as one token in the above example. I used filter() previously, but the additional array traversal actually made the whole code slower than it is when doing one iteration while skipping the empty strings.
Anu -- sorry to bother you with this, but can you explain or point me to a resource that explains the .filter(Boolean) concept. Are the empty strings returned as false etc.. I'm slightly confused by your usage.
Empty strings evaluate as false in boolean expressions.
arr = "***".split(''); Can also be used to tokenize the individual characters of a string.
@anubhava You are responding to an answer from 7 years ago :) I don't remember what I ended up doing at the time, but at the end of the day tokenizing with regular expressions is just not the best, better to make a specific tokenizer for whatever input you want (e.g. iterate character by character and know when tokens start/end), it is significantly faster and more efficient than regexes
|
4

Generally for tokenizing you use match, not split:

> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]

Also note how the non-greedy modifier ? solves the second problem.

2 Comments

I am tokenizing gigantic strings, the C style multiline comment was just one example, how would that regex fit with the rest? (the rest being {}()[]+-/&^=! and so on). In the mean time I just stopped grabbing all *'s with the greedy modifier, while the rest of the tokens do use it. The worst case is when there are many of *'s, but since there are usually not many of them, I guess I can live with that. For the record, this is currently my regex: /[ \t]+|\[+|\]+|\{+|\}+|\(+|\)+|\++|\-+|\*|\/+|<+|>+|&+|\|+|=+|!+|\\/
@user2503048: well, there are quite a few syntax highlighters in javascript, maybe you can start by studying their source code? Your regexp looks sub-optimal.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.