2

I need to parse some text file searching for php classes. So, for example, if I have a text file with this source:

... some text ...

... some other text ...

class Foo{

function Bar($param){ ... do stuff ... }

}

... some other text ...

class Bar{

function Foo(){ ... do something .... }

}

... some else ...

In this case, my regular expression must match the two classes and the content of the classes, to get back this results:

first result:

class Foo{

function Bar($param){ ... do stuff ... }

}

second result:

class Bar{

function Foo(){ ... do something .... }

}

I've tried a lot of times but unlucky. My last test was

/^[\n\r\t ](?:abstract|class|interface){1}(.)[^(?:class|interface)]*$/im

but it only matches

class Foo{

and

class Bar{

without the content of the class.

Thanks for your help :)

2
  • 1
    Are you asking how to match the contents of a possibly nested { .. } block structure? Commented Nov 11, 2010 at 12:19
  • Hi and welcome to Stack Overflow. For posting code, please don't use > but rather paste the code as it, select it and press Ctrl-K. This is much better. Commented Nov 11, 2010 at 12:23

1 Answer 1

2

This cannot be done with "classic" regular expressions because you'd need to be able to handle arbitrarily nested parentheses, and structures like these are by definition irregular. Some programming languages (.NET, PCRE, Perl 5.6 and up) have augmented regular expressions to support recursive matching, but most implementations can't handle recursion yet.

I'd also wager a bet that even if your favorite language's regex engine can handle recursion, it's usually not the best way to go. Most of the time, you rather want a parser for this.

That said, even without recursive regexes you might have a chance if your code is formatted in a consistent manner (start column of the class definition == column of the closing }, no mix of tabs and spaces, and every sub-level structure is indented).

Then you could try

/^([\t ]*)(?:abstract|class|interface).*?^\1\}/sim

But this is sure to fail horribly if your code is not exactly formatted according to those rules.

Explanation:

^                             # start of line
([\t\ ]*)                     # match and remember whitespace
(?:abstract|class|interface)  # match keyword
.*?                           # match as few characters as possible
^\1                           # until the next line that starts with the same amount of whitespace
\}                            # followed by a }
Sign up to request clarification or add additional context in comments.

5 Comments

Tim Tim Tim, please stop saying this "cannot be done with regexes" stuff. It's not true.
@tchrist: OK, I have clarified my answer. A little :). I still don't think it's a good thing to use recursion in regular expressions even if some modern dialects can. Regexes are hard enough already...
Not perl6. perl5 has had it since at least 5.6 from back last millennium. The cooler buffer recursion thing though is from 5.10 and about three years old.
@TimPietzcker: It depends on what you’re doing. I think a regex can be very maintainable, moreso than a dedicated parser. You just have to use "grammatical" regexes, like here and here.
@tchrist: How about handling } s inside comments or strings? Is it feasible to write a regex that finds the correct matching brace for { foo { bar "baz{" /* {{comment} */ tutu } tata }?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.