Properly format web scraped text in python

Question

In a certain part of a website(https://codeforces.com/contest/1352/problem/D) that I am scraping, there is a code like this

<p>
Alice and Bob play an interesting and tasty game: they eat candy.       Alice will eat candy 
    <span class="tex-font-style-bf">from left to       right</span>
, and Bob — 
    <span class="tex-font-style-bf">from right         to left</span>
. The game ends if all the candies are eaten.
</p>

I want to get the text inside the 'p' tags. For that, I used this python code

source_code = raw_html.find('p') # Reading the html
text = source_code.get_text('\n') # Getting all the text from the 'p' tags
text = text.replace("    ", " ") # Replacing the tabs with single white space
print(text)

(I am using BeautifulSoup4)

This is the result I was expecting:

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy from left to right, and Bob — from right to left. The game ends if all the candies are eaten.

But my output ended up looking like this:

Alice and Bob play an interesting and tasty game: they eat candy. Alice will eat candy
from left to right
, and Bob —
from right to left
. The game ends if all the candies are eaten.

What I know is that this problem was caused by the 'span' tags inside the 'p' tags. How can I format the code properly? Or more precisely, how can I get rid of the newlines caused by the span tags?

Saykat · Accepted Answer · 2020-05-22 05:11:30Z

1

It's not the most elegant, but you can get there with some list comprehensions and text manipulation:

final_text = ' '.join([item for item in source_code.text.replace('\n','').split(' ') if len(item)>0]) 
print(final_text)

edited May 22, 2020 at 5:11

Saykat

1242 silver badges13 bronze badges

answered May 22, 2020 at 2:31

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Properly format web scraped text in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related