22

Can you recommend an open source Java library (preferably ASL/BSD/LGPL license) that converts HTML to plain text - cleans all the tags, converts entities (&,  , etc.) and handles <br> and tables properly.

More Info

I have the HTML as a string, there's no need to fetch it from the web. Also, what I'm looking is for a method like this:

String convertHtmlToPlainText(String html)
6
  • 2
    Also jsoup is mentioned here, which is distributed under the liberal MIT license. Commented Oct 9, 2013 at 15:32
  • By the way, jsoup supports HTML5 Commented Oct 9, 2013 at 15:44
  • At least according the documentation it does not do what I've asked (convert the page to plain text, NOT HTML manipulation) Commented Oct 10, 2013 at 7:00
  • 5
    Here you are Jsoup.parse(html).text() Commented Oct 10, 2013 at 22:33
  • @cubanacan Thanks, good to know there is another alternative Commented Oct 14, 2013 at 7:07

5 Answers 5

21

Try Jericho.

The TextExtractor class sounds like it will do what you want. Sorry can't post a 2nd link as I'm a new user but scroll down the homepage a bit and there's a link to it.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! I actually used the Renderer at the end
For the lazy: String plainText = new Source(html).getRenderer().toString();
3

HtmlUnit, it even shows the page after processing JavaScript / Ajax.

3 Comments

I see how it gives me the response as HTML, not text
Thanks. I went for Jericho at the end, but I'll keep HtmlUnit in mind
2

The bliki engine can do this, in two steps. See info.bliki.wiki / Home

  1. How to convert HTML to Mediawiki text -- nediawiki text is already a rather plain text format, but you can convert it further
  2. How to convert Mediawiki text to plain text -- your goal.

It will be some 7-8 lines of code, like this:

// html to wiki
import info.bliki.html.HTML2WikiConverter;
import info.bliki.html.wikipedia.ToWikipedia;
// wiki to plain text
import info.bliki.wiki.filter.PlainTextConverter;
import info.bliki.wiki.model.WikiModel;
...
String sbodyhtml = readFile( infilepath ); //get content as string
  HTML2WikiConverter conv = new HTML2WikiConverter();
  conv.setInputHTML( sbodyhtml );
String resultwiki = conv.toWiki(new ToWikipedia());
  WikiModel wikiModel = new WikiModel("${image}", "${title}");
String plainStr = wikiModel.render(new PlainTextConverter(false), resultwiki );
System.out.println( plainStr );

Jsoup can do this simpler:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
...
Document doc = Jsoup.parse(sbodyhtml);
String plainStr = doc.body().text();

but in the result you lose all paragraph formatting -- there will be no any newlines.

Comments

0

I use TagSoup, it is available for several languages and does a really good job with HTML found "in the wild". It produces either a cleaned up version of the HTML or XML, that you can then process with some DOM/SAX parser.

2 Comments

Thanks, but I need the final result in plain text
Once it is in XML, you can implement a SAX parser to output only the text nodes (e.g. a DefaultHandler no-op implementations of all methods apart from characters)
-1

I've used Apache Commons Lang to go the other way. But it looks like it can do what you need via StringEscapeUtils.

2 Comments

I can't find any htmlToText() method - there are escaping of the HTML methods so that "<b>hello</b>" will be converted to "&lt;b&gt;hello&lt;/b&gt;" instead of to "hello"
Ahh, yes, I didn't see you wanted plain text. This is true.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.