2

I have a bunch of strings that typically looks something like this:

string 1<div>string 2<br></div>string 3
string 1<div>string 2<br></div><div>string 3<br></div>
<div>string 1<br></div><div>string 2<br></div><div>string 3<br></div>

And I need to extract the text (both inside and outside/between elements, as seen above) into an array like this:

['string 1', 'string 2', 'string 3']

Is there a way to do this in pure Javascript?

I tried something like this:

console.log(text.split(/<div>(.*)<br><\/div>/g))

But it only works for the first one:

[ 'string 1', 'string 2', 'string 3' ]

While it fails on the two last variations:

[ 'string 1', 'string 2<br></div><div>string 3', '' ]
[ '', 'string 1<br></div><div>string 2<br></div><div>string 3', '' ]
7
  • 1
    I'll try be sneaky with .match(/(?<=>|^)[^<]+/g), but if the inputs can vary in some other way, this fails catastrophically. Commented Mar 20, 2021 at 23:34
  • There are lots of good approaches, including regex. But I'd STRONGLY encourage you to consider jQuery. "Pure Javascript" is NOT necessarily the best approach all time. IMHO... Commented Mar 20, 2021 at 23:36
  • @ASDFGerte Thanks, I'll use this approach. It seems to work well for my case, for now. If you'd like, it'd be nice if you make a full answer elaborating how it works. Commented Mar 20, 2021 at 23:41
  • @paulsm4 I need to use this in Anki, which doesn't have great support for Javascript in the first place, so I want to keep it minimal. Plus I don't want my cards getting too heavy. Commented Mar 20, 2021 at 23:43
  • 1
    It's abusing the fact, that the text is always between > (or string start) and < (or string end), and nothing else is (at least no non-empty strings). It ignores everything else. Note, that look-behind isn't yet available everywhere, but if that is a problem, you can do something like match the > as well, and remove it afterwards. Commented Mar 20, 2021 at 23:44

2 Answers 2

4

A pure JavaScript approach is generally better than regex for parsing HTML. You can create a template element, load the HTML into it and then use Array.filter to get all the child nodes which are text nodes, finally returning their textContent:

const html = [
  'string 1<div>string 2<br></div>string 3',
  'string 1<div>string 2<br></div><div>string 3<br></div>',
  '<div>string 1<br></div><div>string 2<br></div><div>string 3<br></div>'
]

const getTextContent = (html) => {
  let tmp = document.createElement('template');
  tmp.innerHTML = html;
  const textNodes = [].filter.call(tmp.content.childNodes, n => n.nodeType = Node.TEXT_NODE);
  return textNodes.map(o => o.textContent);
}

html.forEach(h => console.log(getTextContent(h)));

Sign up to request clarification or add additional context in comments.

Comments

0

It may not be the best solution but I tried your tree examples with this code :

let regex = /(<([^>]+)>)/ig;
let myArray = myString.replace(regex, "-").split("-");

You might want to change the - character by something else to be sure and you also need to filter your array to remove empty elements but it works

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.