Open source java library for HTML to text conversion [duplicate]

Question

Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&, , etc.) and handles and tables properly.

More Info

I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:

String convertHtmlToPlainText(String html)

Also jsoup is mentioned here, which is distributed under the liberal MIT license. — cubanacan
– cubanacan, Commented Oct 9, 2013 at 15:32
At least according the documentation it does not do what I've asked (convert the page to plain text, NOT HTML manipulation) — David Rabinowitz
– David Rabinowitz, Commented Oct 10, 2013 at 7:00
@cubanacan Thanks, good to know there is another alternative — David Rabinowitz
– David Rabinowitz, Commented Oct 14, 2013 at 7:07

рüффп · Accepted Answer · 2013-09-03 08:47:45Z

21

Try Jericho.

The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

edited Sep 3, 2013 at 8:47

рüффп

5,57434 gold badges74 silver badges125 bronze badges

answered Oct 5, 2009 at 12:14

Chris R

2,5944 gold badges26 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Chris R Over a year ago

Here's the link to that class: jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/…

David Rabinowitz Over a year ago

Thanks! I actually used the Renderer at the end

Mike Gleason jr Couturier Over a year ago

For the lazy: String plainText = new Source(html).getRenderer().toString();

Sean Patrick Floyd · Accepted Answer · 2016-01-26 14:44:57Z

3

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

edited Jan 26, 2016 at 14:44

Sean Patrick Floyd

301k72 gold badges481 silver badges598 bronze badges

answered Oct 5, 2009 at 7:37

Ahmed Ashour

5,66310 gold badges41 silver badges63 bronze badges

3 Comments

David Rabinowitz Over a year ago

I see how it gives me the response as HTML, not text

Ahmed Ashour Over a year ago

Check .asText() [htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/…

David Rabinowitz Over a year ago

Thanks. I went for Jericho at the end, but I'll keep HtmlUnit in mind

Pkunk · Accepted Answer · 2016-04-03 10:55:54Z

The bliki engine can do this, in two steps. See info.bliki.wiki / Home

How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
How to convert Mediawiki text to plain text -- your goal.

It will be some 7-8 lines of code, like this:

// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
  HTML2WikiConverter conv = new HTML2WikiConverter();
  conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );

Jsoup can do this simpler:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();

but in the result you lose all paragraph formatting -- there will be no any newlines.

Rich Seller · Accepted Answer · 2009-10-05 07:57:16Z

0

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

answered Oct 5, 2009 at 7:57

Rich Seller

84.3k23 gold badges176 silver badges179 bronze badges

2 Comments

David Rabinowitz Over a year ago

Thanks, but I need the final result in plain text

Rich Seller Over a year ago

Once it is in XML, you can implement a SAX parser to output only the text nodes (e.g. a DefaultHandler no-op implementations of all methods apart from characters)

firefly2442 · Accepted Answer · 2013-02-26 18:41:39Z

-1

I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils.

answered Feb 26, 2013 at 18:41

firefly2442

5578 silver badges20 bronze badges

2 Comments

David Rabinowitz Over a year ago

I can't find any htmlToText() method - there are escaping of the HTML methods so that "hello" will be converted to "hello" instead of to "hello"

firefly2442 Over a year ago

Ahh, yes, I didn't see you wanted plain text. This is true.

Collectives™ on Stack Overflow

Open source java library for HTML to text conversion [duplicate]

5 Answers 5

3 Comments

3 Comments

Comments

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

3 Comments

Comments

2 Comments

2 Comments

Linked

Related