3

I have a hard time figuring out a correct path with my web scraping code.

I am trying to scrape different info from http://financials.morningstar.com/company-profile/c.action?t=AAPL. I have tried several paths, and some seem to work and some not. I am interested in CIK under Operation Details

page = requests.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
tree=html.fromstring(page.text)


#desc = tree.xpath('//div[@class="r_title"]/span[@class="gry"]/text()')  #works

#desc = tree.xpath('//div[@class="wrapper"]//div[@class="headerwrap"]//div[@class="h_Logo"]//div[@class="h_Logo_row1"]//div[@class="greeter"]/text()')    #works

#desc = tree.xpath('//div[@id="OAS_TopLeft"]//script[@type="text/javascript"]/text()')   #works

desc = tree.xpath('//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tbody//tr//th[@class="row_lbl"]/text()')

I can't figure the last path. It seems like I am following the path correctly, but I get empty list.

3
  • the last element, th, which is table header in html, so you probably need to change that to td which is for table data. Commented Oct 14, 2015 at 18:33
  • stackoverflow.com/questions/24163745/… This might be a similar problem to yours take a look Commented Oct 14, 2015 at 18:49
  • stackoverflow.com/questions/33110734/… here an error in the html like <a href="#"/></a> that causes an empty parse Commented Oct 14, 2015 at 19:00

1 Answer 1

3

The problem is that Operational Details are loaded separately with an additional GET request. Simulate it in your code maintaining a web-scrapin session:

import requests
from lxml import html


with requests.Session() as session:
    page = session.get('http://financials.morningstar.com/company-profile/c.action?t=AAPL')
    tree = html.fromstring(page.text)

    # get the operational details
    response = session.get("http://financials.morningstar.com/company-profile/component.action", params={
        "component": "OperationDetails",
        "t": "XNAS:AAPL",
        "region": "usa",
        "culture": "en-US",
        "cur": "",
        "_": "1444848178406"
    })

    tree_details = html.fromstring(response.content)
    print tree_details.xpath('.//th[@class="row_lbl"]//text()')

Old answer:

It's just that you should remove tbody from the expression:

//div[@class="col2"]//div[@id="OperationDetails"]//table[@class="r_table1 r_txt2"]//tr//th[@class="row_lbl"]/text()

tbody is an element that is inserted by the browser to define the data rows in a table.

Sign up to request clarification or add additional context in comments.

6 Comments

I still get an empty list. I believe my problem is that there are several tr in the table. So I should give it a number for tr like //table[@class="r_table1 r_txt2"]//tr[3]//th[@class="row_lbl"]/text(). But I still get an empty list
@AK9309 the problem is that the operational details are loaded dynamically with an additional get request to http://financials.morningstar.com/company-profile/component.action.
I understand now. Thank you for taking your time and explaining it.
It works for AAPL(Apple) and GOOGL(Google). When I try AA or AB I get XMLSyntaxError. But I know that the page exists
@AK9309 might be something wrong in GET parameters - open browser developer tools and mimic exactly the same parameters as you've seen sent in the browser. Also, try printing out the response or response.status_code and see what's there..
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.