1

Not certain if this can be done in regexp under javascript, but thought it would be interesting to see if it is possible. So thought I would clean up a piece of html to remove most tags, literally just dropping them, so <H1><img><a href ....>. And that would be relatively simple (well, stole the basis from another post, thanks karim79 Remove HTML Tags in Javascript with Regex).

function(inString, maxlength, callback){
        console.log("Sting is " + inString)
        console.log("Its " + inString.length)

        var regex = /(<([^>]+)>)/ig
        var outString =  inString.replace(regex, "");
        console.log("No HTML sting " + outString);
        if ( outString.length < maxlength){
            callback(outString)
        } else {
            console.log("Lets cut first bit")
        }
    }

But then I started thinking, is there a way where I can control regex execution. So lets say that I want to keep certain tabs, like b,br,i and maybe change H1-6 to b. So in pseudo code, something like:

for ( var i in inString.regex.hits ) {
   if ( hits[i] == H1 ) {
         hits[i] = b;
   }
}

The issue is that I want the text thats not HTML tags to stay as it is, and I want it to just cut out by default. One option would of course be to change the ones I want to keep. Say change <b> to [[b]], once that is done to all the ones of interest. Then put them back to <b> once all unknown have been removed. So like this (only for b, and not certain the code below would work):

 function(inString, maxlength, callback){
        console.log("Sting is " + inString)
        console.log("Its " + inString.length)

        var regex-remHTML = /(<([^>]+)>)/ig
        var regex-hideB = /(<b>)/ig
        var regex-showB = /([b])/ig
        var outString =  inString.replace(regex-hideB, "[b]");
        outString = outString.replace(regex-remHTML, "");
        outString = outString.replace(regex-showB, "<b>");
        console.log("No HTML sting " + outString);
        if ( outString.length < maxlength){
            callback(outString)
        } else {
            console.log("Lets cut first bit")
        }
    }

But would it be possible to be smarter, writing cod ethat says here is a peice of HTML tag, run this code against the match.

2
  • For any manipulation of HTML other than very simple cases, you might want to consider using a parser, rather than regex. Commented Jul 29, 2016 at 9:38
  • I was thinking about that at first, but are there any "configurable" ones. In this case, the security aspect is only half of it. The reason is that the HTML that goes in is from an article, and the code is expected to take the first "n" number of characters and make it pretty as a intoduction to the article. Commented Jul 29, 2016 at 10:30

1 Answer 1

2

As Tim Biegeleisen sai in its comment, maybe a better solution could be using a parser instead of a Regex...

By the way, if you want to control what is going to be changed by the regex you can pass a callback to the String.prototype.replace:

var input = "<div><h1>CIAO Bello</h1></div>";

var output = input.replace(/(<([^>]+)>)/gi, (val) => {
    
    if(val.indexOf("div") > -1) {
      return "";
    }
    
    return val;
  })
;

console.log("output", output);

Sign up to request clarification or add additional context in comments.

3 Comments

Looks good. Maybe a stupid question, what language is that (the if statement for val.indexOf does not look like javascript to me, but that might be because I am just not hardcore enough.
Makes sense now, old ksh scriptie my self, so thought it might be some strange regexp code. But it is clear to me now that it is just that you write code differently to me (and probabbly better, after all, you answered my question). I would have written War and Piece in else if :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.