Splitting by html elements in Javascript

Question

I have a bunch of strings that typically looks something like this:

string 1<div>string 2<br></div>string 3

string 1<div>string 2<br></div><div>string 3<br></div>

<div>string 1<br></div><div>string 2<br></div><div>string 3<br></div>

And I need to extract the text (both inside and outside/between elements, as seen above) into an array like this:

['string 1', 'string 2', 'string 3']

Is there a way to do this in pure Javascript?

I tried something like this:

console.log(text.split(/<div>(.*)<br><\/div>/g))

But it only works for the first one:

[ 'string 1', 'string 2', 'string 3' ]

While it fails on the two last variations:

[ 'string 1', 'string 2<br></div><div>string 3', '' ]

[ '', 'string 1<br></div><div>string 2<br></div><div>string 3', '' ]

I'll try be sneaky with .match(/(?<=>|^)[^<]+/g), but if the inputs can vary in some other way, this fails catastrophically. — ASDFGerte
– ASDFGerte, Commented Mar 20, 2021 at 23:34
There are lots of good approaches, including regex. But I'd STRONGLY encourage you to consider jQuery. "Pure Javascript" is NOT necessarily the best approach all time. IMHO... — paulsm4
– paulsm4, Commented Mar 20, 2021 at 23:36
@ASDFGerte Thanks, I'll use this approach. It seems to work well for my case, for now. If you'd like, it'd be nice if you make a full answer elaborating how it works. — what the
– what the, Commented Mar 20, 2021 at 23:41
@paulsm4 I need to use this in Anki, which doesn't have great support for Javascript in the first place, so I want to keep it minimal. Plus I don't want my cards getting too heavy. — what the
– what the, Commented Mar 20, 2021 at 23:43
It's abusing the fact, that the text is always between > (or string start) and < (or string end), and nothing else is (at least no non-empty strings). It ignores everything else. Note, that look-behind isn't yet available everywhere, but if that is a problem, you can do something like match the > as well, and remove it afterwards. — ASDFGerte
– ASDFGerte, Commented Mar 20, 2021 at 23:44

Nick · Accepted Answer · 2021-03-20 23:47:10Z

4

A pure JavaScript approach is generally better than regex for parsing HTML. You can create a template element, load the HTML into it and then use Array.filter to get all the child nodes which are text nodes, finally returning their textContent:

const html = [
  'string 1<div>string 2<br></div>string 3',
  'string 1<div>string 2<br></div><div>string 3<br></div>',
  '<div>string 1<br></div><div>string 2<br></div><div>string 3<br></div>'
]

const getTextContent = (html) => {
  let tmp = document.createElement('template');
  tmp.innerHTML = html;
  const textNodes = [].filter.call(tmp.content.childNodes, n => n.nodeType = Node.TEXT_NODE);
  return textNodes.map(o => o.textContent);
}

html.forEach(h => console.log(getTextContent(h)));

answered Mar 20, 2021 at 23:47

Nick

147k23 gold badges67 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Nicolas Bertho · Accepted Answer · 2021-03-20 23:53:34Z

0

It may not be the best solution but I tried your tree examples with this code :

let regex = /(<([^>]+)>)/ig;
let myArray = myString.replace(regex, "-").split("-");

You might want to change the - character by something else to be sure and you also need to filter your array to remove empty elements but it works

answered Mar 20, 2021 at 23:53

Nicolas Bertho

222 bronze badges

Collectives™ on Stack Overflow

Splitting by html elements in Javascript

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related