2

In a certain part of a website(https://codeforces.com/contest/1352/problem/D) that I am scraping, there is a code like this

<p>
Alice and Bob play an interesting and tasty game: they eat candy.       Alice will eat candy 
    <span class="tex-font-style-bf">from left to       right</span>
, and Bob — 
    <span class="tex-font-style-bf">from right         to left</span>
. The game ends if all the candies are eaten.
</p>

I want to get the text inside the 'p' tags. For that, I used this python code

source_code = raw_html.find('p') # Reading the html
text = source_code.get_text('\n') # Getting all the text from the 'p' tags
text = text.replace("    ", " ") # Replacing the tabs with single white space
print(text)

(I am using BeautifulSoup4)

This is the result I was expecting:

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy from left to right, and Bob — from right to left. The game ends if all the candies are eaten.

But my output ended up looking like this:

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
from left to right
, and Bob —
from right to left
. The game ends if all the candies are eaten.

What I know is that this problem was caused by the 'span' tags inside the 'p' tags. How can I format the code properly? Or more precisely, how can I get rid of the newlines caused by the span tags?

0

1 Answer 1

1

It's not the most elegant, but you can get there with some list comprehensions and text manipulation:

final_text = ' '.join([item for item in source_code.text.replace('\n','').split(' ') if len(item)>0]) 
print(final_text)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.