1

I have code:

private static final Pattern TAG_REGEX = Pattern.compile("<p>(.+?)</p>");
private static List<String> getTagValues(final String str) {
    final List<String> tagValues = new ArrayList<String>();
    final Matcher matcher = TAG_REGEX.matcher(str);
    while (matcher.find()) {
        tagValues.add(matcher.group(1));
    }
    return tagValues;
}
            System.out.println(Arrays.toString(getTagValues(stringText).toArray()));

and i want get from this:

"<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>" 

I want only the text beetwen <p> and </p>

i want get only this: 

"Aa aa Aa aa aa Aa aa aa aa Aa aa B b b" 

But i don't know what i have to write in Pattern.compile("");anyone help?

2
  • this <b>Aa aa, aa.</b> can be between <p> and </p> Commented Feb 21, 2017 at 18:13
  • Jsoup can also pick all the p tags data but again issue will be this <b>Aa aa, aa.</b> because is not inside p and somewhere you also have b tag inside p Commented Feb 21, 2017 at 18:17

3 Answers 3

2

I recommend to use JSOUP parser to extract your data from HTML code

1.) Parse your data as Document using Jsoup.parse(string) function.

2.) Get the data of body tag as Element.

3.) Fetch the text of Element tag using element.text().

4.) Optionally you can use replaceAll("\\s*[,.]\\s*","") to remove all commans and dots and format spaces.

    String stringText = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
    Document document =Jsoup.parse(stringText);
    Element element=document.body();
    String plain_String = element.text().replaceAll("\\s*[,.]\\s*"," ");
    System.out.println(element.text()); // Actual text
    System.out.println(plain_String);   // Formatted text

Output :

Aa , aa. Aa aa, aa. Aa aa aa, aa. Aa, aa. B, b, b.Aa aa, aa.
Aa aa Aa aa aa Aa aa aa aa Aa aa B b b Aa aa aa 

Download Jsoup and add it as a dependency

\\s*[,.]\\s* :\\s* match zero or more spaces

[,.] : match any character mentioned inside [] mean ,.


If you insist the regex solution then use

1.) First remove all unwanted characters like ,. and spaces with replaceAll("\\s*[.,]\\s*", " ")

2.) Use regex <p[<>ib]*>([\\w\\s]+)<\\/[\\w]> with Pattern and Matcher to find your text between tags

3.) Append the found text in StringBuilder and display the result

Code

    String str = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
    Pattern pattern = Pattern.compile("<p[<>ib]*>([\\w\\s]+)<\\/[\\w]>");
    Matcher matcher = pattern.matcher(str.replaceAll("\\s*[.,]\\s*", " "));
    StringBuilder builder = new StringBuilder();
    while (matcher.find()) {
        builder.append(matcher.group(1));
    }
    System.out.println(builder);

Output :

Aa aa Aa aa aa Aa aa aa aa Aa aa B b b 
Sign up to request clarification or add additional context in comments.

2 Comments

just in case , if you want all p tags data , use document.getElementsByTag("p").text() but again resultant string will not include <b>Aa aa, aa.</b> text data because it's not inside p tag
yes i edit my post that i want this text without <b>Aa aa, aa.</b> but i can;t use this Jsuop because i must send only file in java without jsoup
0

You don't need Pattern nor Matcher for that, you could do a String replace instead:

str.replaceAll(".*?(<p>.*</p>).*", " $1 ").replaceAll(".*?<p>(.*?)</p>.*?", " $1 ").replaceAll("<[/a-z]+>", " ").replaceAll("[,.]", " ").replaceAll(" +", " ")

It doesn't look pretty but it gets the job done :)

6 Comments

thanks it is helpful but i add some edit to my post because outpus was wrong, i don't need <b>Aa aa, aa.</b> in my output, so you know what i must edit you code that it will work?
I've update my answer to align with the edit to your answer. Please up-vote my answer if it works for you. Thanks.
" \$1 " this give me a error in eclipse, Invalid escape sequence (valid ones are \b \t \n \f \r \" \' \\ ) this is error
I've fixed my answer. Please try now
Yes it works ! Thank you. Is it any site where i put the text and it return the characters that i have to enter ? like this : replaceAll(".*?(<p>.*</p>).*", " $1 ") ? Or i have to learn all characters what it means?
|
0

You can try this :

String str = "<html><head></head><body><p>Aa , aa.</p><p><b>Aa aa, aa.</b></p><p>Aa aa aa, aa.</p><p><i>Aa, aa.</i></p><p><b><i>B, b, b.</i></b></p><b>Aa aa, aa.</b></body></html>";
String start = ">", end = "<";
String regexString = Pattern.quote(start) + "(.*?)" + Pattern.quote(end);
Pattern pattern = Pattern.compile(regexString);
Matcher matcher = pattern.matcher(str.replaceAll("[.,]", ""));
while (matcher.find()) {
    if (!matcher.group(1).replaceAll("\\s{2,}", " ").trim().equals("")) {
        System.out.print(matcher.group(1).replaceAll("\\s{2,}", " ") + " ");
    }
}

This gives you :

Aa aa Aa aa aa Aa aa aa aa Aa aa B b b Aa aa aa 

1 Comment

thanks it is helpful but i add some edit to my post because outpus was wrong, i don't need <b>Aa aa, aa.</b> in my output, so you know what i must edit you code that it will work?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.