Web scraping, python and beautifulsoup

Question

I wanted to get a paragraph from a site but ive done it this way. i get the texts of the webpage removing all html tags and i wanted to find out if its possible ta get a certain paragraph form all the text it returned.

heres my code

import requests
from bs4 import BeautifulSoup

response = requests.get("https://en.wikipedia.org/wiki/Aras_(river)")
txt = response.content

soup = BeautifulSoup(txt,'lxml')
filtered = soup.get_text()
print(filtered)

heres part of the text it printed out

>>>>Basin


    Main source
    Erzurum Province, Turkey


    River mouth
    Kura river


    Physical characteristics


    Length
    1,072 km (666 mi)


    The Aras or Araxes is a river in and along the countries of Turkey,     
    Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser 
    Caucasus Mountains and then joins the Kura River which drains the north 
    side of those mountains. Its total length is 1,072 kilometres (666 mi). 
    Given its length and a basin that covers an area of 102,000 square 
    kilometres (39,000 sq mi), it is one of the largest rivers of the 
    Caucasus.



    Contents


    1 Names
    2 Description
    3 Etymology and history
    4 Iğdır Aras Valley Bird Paradise
    5 Gallery
    6 See also
    7 Footnotes

And i only want to get this paragraph

    The Aras or Araxes is a river in and along the countries of Turkey,     
    Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser 
    Caucasus Mountains and then joins the Kura River which drains the north 
    side of those mountains. Its total length is 1,072 kilometres (666 mi). 
    Given its length and a basin that covers an area of 102,000 square 
    kilometres (39,000 sq mi), it is one of the largest rivers of the 
    Caucasus.

is it possible to filter out this paragraph?

You should read up on the BeautifulSoup documents a bit more. You can supply classnames and xpaths to specify exactly which element you want to retrieve data from. — JosephGarrone
– JosephGarrone, Commented Jan 5, 2017 at 3:28

宏杰李 · Accepted Answer · 2017-01-05 03:33:15Z

1

soup = BeautifulSoup(txt,'lxml')
filtered = soup.p.get_text() # get the first p tag.
print(filtered)

out:

The Aras or Araxes is a river in and along the countries of Turkey, Armenia, Azerbaijan, and Iran. It drains the south side of the Lesser Caucasus Mountains and then joins the Kura River which drains the north side of those mountains. Its total length is 1,072 kilometres (666 mi). Given its length and a basin that covers an area of 102,000 square kilometres (39,000 sq mi), it is one of the largest rivers of the Caucasus.

answered Jan 5, 2017 at 3:33

宏杰李

12.2k2 gold badges32 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bman · Accepted Answer · 2017-01-05 03:39:04Z

Use XPath instead! It is much easier, more accurate, and it has designed specifically for these use cases. Unfortunately BeautifulSoup does not support XPath directly. You need to use lxml package instead

import urllib2
from lxml import etree

response = urllib2.urlopen("https://en.wikipedia.org/wiki/Aras_(river)")
parser = etree.HTMLParser()
tree = etree.parse(response, parser)
tree.xpath('string(//*[@id="mw-content-text"]/p[1])')

Explanation on XPath:

// refers to the root element in the document.

* matches any tag

[@id="mw-content-text"] specify a condition.

p[1] selects first element of type p inside the container.

string function that gives you the string representation of element(s)

By the way, If you use Google Chrome or Firefox you can test the XPath expression inside DevTools using $x function:

$x('string(//*[@id="mw-content-text"]/p[1])')

Collectives™ on Stack Overflow

Web scraping, python and beautifulsoup

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related