1

I am trying to make a regex for HTML, I am coming up with a few minor issues regarding header html blocks to be selected and title in head for some reason,

To explain it better:

<h5>Thing</h5> will all be selected but I only want <h5> and </h5> selected and it's the same with <title>Test</title> I only want the html tags selected but it selects the whole thing,

here is my regex so far:

/(<\/(\w+)>)|(<(\w+)).+?(?=>)>|(<(\w+))>/ig

3
  • Don't parse HTML using RegEx, you will fail. Use something like the HTML agility pack Commented Feb 23, 2016 at 11:54
  • Rule 1: don't use RegEx to parse HTML. Rule 2: if you still want to parse HTML with RegEx, see rule 1. RegEx can only match regular languages, and HTML is not a regular language Commented Feb 23, 2016 at 11:56
  • 1
    I understand these replies, but this was just for a personal project and would like to try and finish this regex and as the only problem I am facing is explained above don't understand why no one else can solve this? Commented Feb 23, 2016 at 12:23

1 Answer 1

2

Your problem is here: <(\w+).+?(?=>)>

This says:

  1. open an angle bracket
  2. consume as many word characters as possible (min 1)
  3. consume as few characters as possible (min 1)
  4. make sure a closing angle bracket follows
  5. consume the closing angle bracket

First of all, step 4 is superfluous; you know you will have a closing bracket next, otherwise step 5 will fail to match.

But the bigger problem is step 3. Let's see what happens on <h5>Thing</h5>:

  1. <
  2. h5 (because > is not a word character any more)
  3. >Thing</h5, because this is the least amount matched before a closing angle bracket (remember, matching 0 characters here is not an option)
  4. Make sure next is >
  5. >

Anyway, in the simple case, what you want can be done by /<\/?.+?>/. This will break if attributes have values that include a greater than symbol: <div title="a>b">. Avoiding this is possible, but it makes the regexp a bit more complex, kind of like this (but I may have forgotten something):

<\w+(?:\s+\w+(?:=(?:"[^"]*"|'[^']*'|[^'"][^\s>]*)?)?)*\s*>|<\/\w+>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.