I recommend to use JSOUP parser to extract your data from HTML code
1.) Parse your data as Document using Jsoup.parse(string) function.
2.) Get the data of body tag as Element.
3.) Fetch the text of Element tag using element.text().
4.) Optionally you can use replaceAll("\\s*[,.]\\s*","") to remove all commans and dots and format spaces.
String stringText = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
Document document =Jsoup.parse(stringText);
Element element=document.body();
String plain_String = element.text().replaceAll("\\s*[,.]\\s*"," ");
System.out.println(element.text()); // Actual text
System.out.println(plain_String); // Formatted text
Output :
Aa , aa. Aa aa, aa. Aa aa aa, aa. Aa, aa. B, b, b.Aa aa, aa.
Aa aa Aa aa aa Aa aa aa aa Aa aa B b b Aa aa aa
Download Jsoup and add it as a dependency
\\s*[,.]\\s* :\\s* match zero or more spaces
[,.] : match any character mentioned inside [] mean ,.
If you insist the regex solution then use
1.) First remove all unwanted characters like ,. and spaces with replaceAll("\\s*[.,]\\s*", " ")
2.) Use regex <p[<>ib]*>([\\w\\s]+)<\\/[\\w]> with Pattern and Matcher to find your text between tags
3.) Append the found text in StringBuilder and display the result
Code
String str = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
Pattern pattern = Pattern.compile("<p[<>ib]*>([\\w\\s]+)<\\/[\\w]>");
Matcher matcher = pattern.matcher(str.replaceAll("\\s*[.,]\\s*", " "));
StringBuilder builder = new StringBuilder();
while (matcher.find()) {
builder.append(matcher.group(1));
}
System.out.println(builder);
Output :
Aa aa Aa aa aa Aa aa aa aa Aa aa B b b
<b>Aa aa, aa.</b>can be between<p>and</p>Jsoupcan also pick all theptags data but again issue will be this<b>Aa aa, aa.</b>because is not insidepand somewhere you also havebtag insidep