Scraping using selenium

Question

Hi I am trying to scrape this website I originally was using Bs4 and that was fine to get certain elements. Sector, name etc. But I am not able to use it to get the financial data. Below I have copied some of the page_source the "-" should be in this case 0.0663. I believe I am trying to scrape javascript and I have looked around and none of the solutions I have seen have worked for me. I was wondering if someone could help me crack this.

Although I will be grateful if someone can post some working code I would also really appreciate if you can point me in the right direction as well to understand what to look for in the html which shows me what I need to do and how to get it kinda thing.

URL: https://www.tradingview.com/symbols/LSE-TSCO/

HTML:

<span class="tv-widget-fundamentals__label apply-overflow-tooltip">
    Return on Equity (TTM)
</span>
<span class="tv-widget-fundamentals__value apply-overflow-tooltip">
    —
</span>

Python Code:

url = "https://www.tradingview.com/symbols/LSE-TSCO/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
html = driver.page_source

I'm sorry, but we do not have enough information to help you. Please post a minimal reproducible example. — Greg Burghardt
– Greg Burghardt, Commented Jun 2, 2020 at 19:21
I've edited it but its more that I have no clue what to do than the code not working all I can do at the moment is to scrape thje page_source which doesnt show me the numbers I need they come up as "_" so if i scrape nothing will be returned so the code on the site must be like backdoored or somthing but I dont know how to access that as I am new to this — JPWilson
– JPWilson, Commented Jun 2, 2020 at 19:35

KunduK · Accepted Answer · 2020-06-02 19:56:49Z

2

To get the equity value.Induce WebDriverWait() and wait for visibility_of_element_located() and below xpath.

driver.get(url)
print(WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.XPATH,"//span[contains(.,'Return on Equity (TTM)')]/following-sibling::span[1]"))).text)

You need to import below libraries.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

answered Jun 2, 2020 at 19:56

KunduK

33.4k5 gold badges19 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Greg Burghardt Over a year ago

@JPWilson: This should be your answer. The answer proposed by 'sleep' is a very brittle XPath that is prone to errors if the site owner changes the page for any reason.

JPWilson Over a year ago

Thanks for the help i've implemented your script but for whatever reason it doesn't seem to wait 10 seconds if it fails to get the datapoint and gives the "-". Therefore if I do the script 5 times I can get everything or the first 3/4 items im looking at will be "-" how would you suggest I fix this issue?

KunduK Over a year ago

I have tested the code before posted and it is giving proper value every time.Don’t know why you are getting “-“ instead of value.Did you copy my entire code and runs it?

JPWilson Over a year ago

I added a time.sleep(2) which seems to fix it. One question I do have is how to make this viable for scraping large amounts of data for example this is one page but If I want to do this for all tickers selenium takes ages compared to using bs4 is there a way to speed this up or a good work around?

KunduK Over a year ago

@JPWilson : Since the data rendered by javascripts you need selenium to get the pagesouce and then you can use bs4.Bs4 only works on static content not dynamic content.If you post a new question with your expected output then i can help you with selenium and bs4 to answer your queries.

sleep · Accepted Answer · 2020-06-02 19:46:26Z

1

You can get the return on equity using xpath

equity = driver.find_element_by_xpath('/html/body/div[2]/div[4]/div/div/div/div/div/div[2]/div[2]/div[2]/div/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/span[2]').text
print(equity)

edited Jun 2, 2020 at 19:46

answered Jun 2, 2020 at 19:29

sleep

1041 gold badge1 silver badge8 bronze badges

4 Comments

JPWilson Over a year ago

Hi Thanks, Would you suggest one method over the other?

JPWilson Over a year ago

Cheers mate is xpath literally just following the tags all the way through? is it just something that comes with experience or is there a quick way to get through it? Finally do you have to use selenium for this? Or are you able to use requests and bs4?

sleep Over a year ago

If you're using google chrome you can use inspect element and then right click and hit "Copy Full XPATH" and it will do it for you, and you can do this using requests and bs4. Does not need to be selenium

Nic Laforge Over a year ago

Down vote. This is the worst way of working with xpath. Any changes to the page and guaranty your xpath will be broken. If you use xPath, use relative path not absolute like the answer did. For the selector used, I would recommend to use id than css_selector and in last resort xPath.

Nic Laforge · Accepted Answer · 2020-06-03 02:35:52Z

The issue here is not with the element being present or not, but the time the page takes to load. The page looks very heavy with all those dynamic graphs..Even before the page is fully loaded in, the DOM start to get created and default values are taking place.

WebDriverWait with find_element_* works when the element is currently not present but will take a certain time to appear. In your context, it is present from the start and adding it won't do much. This is also why you get '-', as the element is present with its default value.

To fix this or reduce the issue, you can add code to wait until the document readyState is completed

Something like this can be used:

def wait_for_page_ready_state(driver):
    wait = WebDriverWait(driver, 20)

    def _ready_state_script(driver):
        return driver.execute_async_script(
                """
                var callback = arguments[arguments.length - 1]; 
                callback(document.readyState);
                """) == 'complete'
    wait.until(_ready_state_script)

wait_for_page_ready_state(driver)

Then since you brought bs4 in play, this is where I would use it:

financials = {}
for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
    try:
        key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
                                                       "apply-overflow-tooltip"}).text.strip())
        value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())


        financials[key] = value
    except AttributeError:
        pass

This will give you every value you need from the financial card.

You can now print the value you need:

print(financials['Return on Equity (TTM)'])

Output:

'0.0663'

Of course you can do the above with selenium as well, but wanted to provide with what you started to work with.

To be noted that this does not guaranty to always return the proper value. It might and did in my case, but since you know the default value you could add a while loop until the default change.

[EDIT] After running my code in a loop, I was hitting the default value 1/5 times. One way to work around it would be to create a method and loop until a threshold is reached. In my finding, you will always have ~90% of the value updated with digit. When it fails with the default value, all other values were also at '-'. One way will be to use a threshold (i.e 50% and only return the values once it is reached).

    def get_financial_card_values(default_value='—', threshold=.5):
        financials = {}
        while True:
            for el in BeautifulSoup(driver.page_source, "lxml").find_all('div', {"class": "tv-widget-fundamentals__row"}):
                try:
                    key = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__label "
                                                                       "apply-overflow-tooltip"}).text.strip())
                    value = re.sub('\s+', ' ', el.find('span', {"class": "tv-widget-fundamentals__value"}).text.strip())

                    financials[key] = value
                except AttributeError:
                    pass
            number_of_updated_values = [value for value in financials.values() if value != default_value]
            if len(number_of_updated_values) / len(financials) > threshold:
                return financials

With this method, I was able to always retrieve the value you are expecting. Note that if all values won't change (site issue) you will be in a loop for ever, you might want to use a timer instead of while True. Just want to point this out, but I don't think it will happen.

Thanks for that detailed response I have a feeling that making a threshold may not work as trading view seems to be a bit random in terms of which companies it has complete data for and not and so, for example, a logistics company I saw called Eddie Stobbart had barely any data which would mess up the freshold therefore I may have to use Yfinance which I already have a decent scraper and i guess I can use the api too if needed.
This is one example how to solve the value, you can use the key that must always change or be a digit. In your case you would use Return on Equity (TTM) and ensure this one will change.

Collectives™ on Stack Overflow

Scraping using selenium

3 Answers 3

5 Comments

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related