4

I'm trying to extract some data using Selenium for a pet project of mine. I've already loaded up a few pages successfully and got their data, however this one site stops loading everytime I test it. Things I have tried:

  • Using geckodriver with Firefox both headless & non-headless (headed??) versions
  • Using chromedriver with Chrome both headless & non-headless versions
  • Checked that pip3 & Selenium are all latest stable versions
  • Opening Chrome with a user agent profile
  • Opening Chrome with a random user agent profile (from random_user_agent library)
  • Hardcoding waits for up to 30 seconds (time.sleep)
  • Loading page in requests (in hindsight this was silly if I was looking for javascript - didn't work)

The URL

My theory is that they're blocking Selenium somehow, maybe this? But I have no way to test it. Issue does not persist when loading the page not using a Selenium browser instance (i.e. regular browser). My code below:

from selenium import webdriver

# requirements to wait until specific part of page is open
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--lang=en_US")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
options.add_argument("disable-infobars")

browser = webdriver.Chrome(options=options)        
delay = 5
browser.get("https://shop.coles.com.au/a/alexandria/product/nutella-spread-chocolate-hazelnut-2620684p")
# this is where the page is not loading & therefore throwing ElementNotFound exception

try:
    price_dollars = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, "price-dollars")))
    price_cents = browser.find_element_by_class_name("price-cents")
    
    # converts strings into floats with decimals (to one place only)
    fl_price_dollars = float(price_dollars.text)
    fl_price_cents = float(price_cents.text)
    fl_price_concat = fl_price_dollars + fl_price_cents*10**-2
    print(type(fl_price_concat)) # check this is a float type not string
    print(fl_price_concat)
except TimeoutException:
    print("Timeout1")
    pass
except NoSuchElementException:  # need to catch all exceptions & pass to quit() or processes will continue to run
    print("Element not found")
    pass

browser.quit()

Page source code that loads up when I open the page using Selenium browser instance:


<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <link rel="shortcut icon" href="about:blank">
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/j.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/f.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=c2e6cd9a-e76e-cd51-288d-f604aea52023"></script>
</body>
</html>

EDIT the following answer worked for me in June 2020

4
  • 1
    Looks like they're using FingerPrint2 which is blocking you, even with JS disabled, there seem to be other WAF mechanisms in place. They're actively blocking you from scraping their website. Commented Jun 26, 2020 at 9:38
  • @Lucan thanks for that, how did you determine it was FingerPrint2? Also do you know how to circumnavigate this? Commented Jun 26, 2020 at 9:48
  • 1
    It's in their sources, it's easier to spot when you're looking at the blocked page. I tried the basics to get around it much like yourself (UA, Proxy, Options), but I had no success. Commented Jun 26, 2020 at 10:06
  • Looks like this will be a longer road than anticipated... thank you for the heads up, much appreciated ! Commented Jun 26, 2020 at 10:13

1 Answer 1

3

Try adding this argument: options.add_argument("--disable-blink-features=AutomationControlled")

The key is to make 'navigator.webdriver' return undefined. It returns 'true' if Chrome is controlled by the Webdriver (used by Selenium).

If you add this argument then the javascript invocation (you can test it in dev tools console) navigator.webdriver will return 'undefined' which is the same as if you run this in a regular Chrome.

Sign up to request clarification or add additional context in comments.

1 Comment

This works wonderfully in my case. Thanks much @matteo84 !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.