In a certain part of a website(https://codeforces.com/contest/1352/problem/D) that I am scraping, there is a code like this
<p>
Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
<span class="tex-font-style-bf">from left to right</span>
, and Bob —
<span class="tex-font-style-bf">from right to left</span>
. The game ends if all the candies are eaten.
</p>
I want to get the text inside the 'p' tags. For that, I used this python code
source_code = raw_html.find('p') # Reading the html
text = source_code.get_text('\n') # Getting all the text from the 'p' tags
text = text.replace(" ", " ") # Replacing the tabs with single white space
print(text)
(I am using BeautifulSoup4)
This is the result I was expecting:
Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy from left to right, and Bob — from right to left. The game ends if all the candies are eaten.
But my output ended up looking like this:
Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
from left to right
, and Bob —
from right to left
. The game ends if all the candies are eaten.
What I know is that this problem was caused by the 'span' tags inside the 'p' tags. How can I format the code properly? Or more precisely, how can I get rid of the newlines caused by the span tags?