PHP preg_replace find match in html but not if its a html attribute

Question

I have two regex one which matches [value] and another which matches html attributes but i need to combine them into a single regex.

This is the regex I'm working with to find [value]

    $tagregexp = '[a-zA-Z_\-][0-9a-zA-Z_\-\+]{2,}';

    $pattern = 
          '\\['                              // Opening bracket
        . '(\\[?)'                           // 1: Optional second opening bracket for escaping shortcodes: [[tag]]
        . "($tagregexp)"                     // 2: Shortcode name
        . '(?![\\w-])'                       // Not followed by word character or hyphen
        . '('                                // 3: Unroll the loop: Inside the opening shortcode tag
        .     '[^\\]\\/]*'                   // Not a closing bracket or forward slash
        .     '(?:'
        .         '\\/(?!\\])'               // A forward slash not followed by a closing bracket
        .         '[^\\]\\/]*'               // Not a closing bracket or forward slash
        .     ')*?'
        . ')'
        . '(?:'
        .     '(\\/)'                        // 4: Self closing tag ...
        .     '\\]'                          // ... and closing bracket
        . '|'
        .     '\\]'                          // Closing bracket
        .     '(?:'
        .         '('                        // 5: Unroll the loop: Optionally, anything between the opening and closing shortcode tags
        .             '[^\\[]*+'             // Not an opening bracket
        .             '(?:'
        .                 '\\[(?!\\/\\2\\])' // An opening bracket not followed by the closing shortcode tag
        .                 '[^\\[]*+'         // Not an opening bracket
        .             ')*+'
        .         ')'
        .         '\\[\\/\\2\\]'             // Closing shortcode tag
        .     ')?'
        . ')'
        . '(\\]?)';                          // 6: Optional second closing bracket for escaping shortcodes: [[tag]]

example here

This regex (\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? matches an attribute and a value. example here

I would like the regex to match [value] in the following examples

<div [value] ></div>
<div>[value]</div>

but not find a match in this example

<input attr="attribute[value]"/>

Just need to make it into a single regex to use in my preg_replace_callback

preg_replace_callback($pattern, replace_matches, $html);

It's PHP strings, not Java strings, you don't need to double escape all. Instead of using concatenations, use the x modifier (and if you can a nowdoc string). If you want to deal with html (or xml), forget regex and use DOMDocument (and eventually DOMXPath). — Casimir et Hippolyte
– Casimir et Hippolyte, Commented May 17, 2016 at 23:09
Other thing, the closing square bracket isn't a special character, you don't need to escape it. An opening square bracket in a character class has nothing special too, you can write [^[] instead of [^\\[]. (You can even write [^]] and []] because at first position the closing square bracket is seen as a literal character.) — Casimir et Hippolyte
– Casimir et Hippolyte, Commented May 17, 2016 at 23:17

Ro Yo Mi · Accepted Answer · 2016-05-18 02:33:21Z

Foreward

On the surface it looks like you're attempting to parse html code with a regular expression. I feel obligated to point out that it's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.

Description

<\w+\s(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\[(?<DesiredValue>[^\]]*)\])
|
<\w+\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:(?!<\/div>)(?!\[).)*\[(?<DesiredValue>[^\]]*)\]

Regular expression visualization

This regular expression will do the following:

capture the substring inside square brackets [some value]
- were [value] is in the attributes of a tag
- were [value] is in not inside the attributes area of a tag
- providing the substring is not nested inside another value <input attrib=" [value] ">
the captured substring will not include the wrapping square brackets
allow any tag name, or replace the \w with the desired tag names
allow value to be any string of characters
avoid difficult edge cases

Note: this regex is best used with the following flags:

global
dot matches new line
ignore white space in expression
allow duplicate named capture groups

Examples

Live Demo

https://regex101.com/r/tT0bN5/1

Sample Text

<div [value 1] ></div>
<div>[value 2]</div>
but not find a match in this example

<div attr="attribute[value 3]"/>
<img [value 4]>
<a href="http://[value 5]">[value 6]</a>

Sample Matches

MATCH 1
DesiredValue    [6-13]  `value 1`
MATCH 2
DesiredValue    [29-36] `value 2`
MATCH 3
DesiredValue    [121-128]   `value 4`
MATCH 4
DesiredValue    [159-166]   `value 6`

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  <div                     '<div'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \[                       '['
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      [^\]]*                   any character except: '\]' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    \]                       ']'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  <div                     '<div'
----------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      <                        '<'
----------------------------------------------------------------------
      \/                       '/'
----------------------------------------------------------------------
      div>                     'div>'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      \[                       '['
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    .                        any character
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [^\]]*                   any character except: '\]' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  \]                       ']'

incredible answer, I appreciate the amount of time and effort that was put into your answer. I Still haven't quite solved it but this should help considerably.

Collectives™ on Stack Overflow

PHP preg_replace find match in html but not if its a html attribute

1 Answer 1

Foreward

Description

Examples

Explanation

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Foreward

Description

Examples

Explanation

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related