1

I want to replace some specific letters (got from user input) to replace with some specific html tags like <b>,<u>,<i>,etc. I am using some regexps in javascript, but can not make out which use best. I am using

/\[u\](.*?)\[u\]/g // replace with <u>$1</u>
/*
 * if i type [u]underline[][u] //this allows '[]' braces
*/

or should I use

/\[u\]\([^\[u\]]+)\[u\]/g // this doesn't allow third braces to be underlined

I am also using the same regexps in php. I am confused which type of regexp use would be safe from xss attack.

1
  • bbcode is not a regular language. You should not try to parse it using regular expressions. Get a bbcode parser from github or write your own. Commented Jan 4, 2014 at 7:49

2 Answers 2

1

No regexes should be used. Find a decent bbcode parser (for instance, PHP's BBCode) and use it. trying to parse HTML or any established markup language with Regex yourself is asking for pain, trouble, and insecurity.

bobince wrote an epic answer about parsing HTML with regexes, which is relevant here as well and always worth a read.

Sign up to request clarification or add additional context in comments.

2 Comments

I edited for clarification, but it doesn't really matter. You're trying to write your own parser for an established markup language that has lots of little details and gotchas. Just learn and use something that's been tested and well-used, you'll save time and it will be far more secure and less buggy.
The reason my answer was originally oriented towards HTML, by the way, is that your title (before I edited it) implied you were parsing HTML. I went back and edited it to refer to bbcode, but missed the third reference to HTML in my answer.
0

You asked, whether to use /\[u\](.*?)\[u\]/g or /\[u\]\([^\[u\]]+)\[u\]/g. Both patterns are not designed with an ending-tag, which is important. [u]underlined text[/u] is BBCode

A solution using extended regex could be the use of recursive patterns. I think there is no support in JavaScript yet, but works fine e.g with PHP which uses PCRE.

The problem: Tags can be nested and this will make it difficult, to match the outermost ones.


Understand, what the following patterns do in this PHP example:

$str = 
'The [u][u][u]young[/u] quick[/u] brown[/u] fox jumps over the [u]lazy dog[/u]';

1.) Matching any character in [u]...[/u] using the dot non-greedy

$pattern = '~\[u\](.*?)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs:

The <u>[u][u]young</u> quick[/u] brown[/u] fox jumps over the <u>lazy dog</u>

Looks for the first occurence of [u] and eats up as few characters as possible to meet the conditional [/u] which results in tag-mismatches. So this is a bad choice.


2.) Using negation of square brackets [^[\]] for what is inside [u]...[/u]

$pattern = '~\[u\]([^[\]]*)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs:

The [u][u]<u>young</u> quick[/u] brown[/u] fox jumps over the <u>lazy dog</u>

It looks for the first occurence of [u] followed by any amount of characters, that are not [ or ] to meet the conditional [/u]. It is "safer" as it only matches the innermost elements but still would require additonal effort to resolve this from inside out.


3.) Using recursion + negation of square brackets [^[\]] for what is inside [u]...[/u]

$pattern = '~\[u\]((?:[^[\]]+|(?R))*)\[/u\]~';
$str = preg_replace($pattern, '<u>\1</u>', $str);
echo htmlspecialchars($str);

outputs:

The <u>[u][u]young[/u] quick[/u] brown</u> fox jumps over the <u>lazy dog</u>

Similar to the the second pattern: Look for the first occurence of [u] but then EITHER match one or more characters, that are not [ or ] OR paste the whole pattern at (?R). Do the whole thing zero or more times until the conditional [/u] is matched.

To get rid of the remaining bb-tags inside, that were not resolved, we now can easily remove them:

$str = preg_replace('~\[/?u\]~',"",$str);

And got it as desired:

outputs: The <u>young quick brown</u> fox jumps over the <u>lazy dog</u>

For sure there are different ways achieving it, like preg replace callback or for JavaScript the replace() method that can use a callback as replacement.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.