1

I'm trying to web scrape an ecommerce website. However, the page is dynamic. Within the html source code is the script that generates a json format of the products.

My code is

from bs4 import BeautifulSoup, SoupStrainer
import requests
import json

url = "https://www.lazada.com.ph/chuwi-pilipinas/?q=All-Products&langFlag=en&from=wangpu&lang=en&pageTypeId=2"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data,'html.parser')


scripts = soup.find_all('script')

jsonObj = None
for script in scripts:
    if 'window.pageData = ' in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('window.pageData = ')[1]
        jsonObj = json.loads(jsonStr)
        
products = jsonObj['mods']['listItems']

for item in products:
    print (item['productUrl'])

the result is:

PS C:\Users\nate\Documents\Python\LazadaScapper> & "C:/Program Files/Python39/python.exe" c:/Users/nate/Documents/Python/LazadaScapper/LazadaScraper3.py
Traceback (most recent call last):
  File "c:\Users\nate\Documents\Python\LazadaScapper\LazadaScraper3.py", line 21, in <module>
    products = jsonObj['mods']['listItems']
TypeError: 'NoneType' object is not subscriptable
PS C:\Users\nate\Documents\Python\LazadaScapper> 

I did a research and it seems that for loop doesn't work thus, dictionary products is empty.

This is related to this thread that was posted 2 years ago but not working anymore.

I'm new at python and still studying, I hope you guys can help me.

1 Answer 1

2

The issue is beautifulsoup doesn't parse the content of <script> property into .text, you have to use .contents (the type is bs4.element.Script):

from bs4 import BeautifulSoup, SoupStrainer
import requests
import json

url = "https://www.lazada.com.ph/chuwi-pilipinas/?q=All-Products&langFlag=en&from=wangpu&lang=en&pageTypeId=2"

page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")


scripts = soup.find_all("script")

jsonObj = None
for script in scripts:
    if script.contents and "window.pageData = " in script.contents[0]:
        jsonStr = script.contents[0]
        jsonStr = jsonStr.split("window.pageData = ")[1].strip().strip(";")
        jsonObj = json.loads(jsonStr)

products = jsonObj["mods"]["listItems"]
for item in products:
    print(item["productUrl"])

Prints:

//www.lazada.com.ph/products/chuwi-hi10x-2-in-1-tablet-with-detachable-keyboard-and-stylus-i2197648497-s9878152829.html?mp=1
//www.lazada.com.ph/products/chuwi-herobook-pro-intel-celeron-windows-10-home-i2194930372-s9864035095.html?mp=1
//www.lazada.com.ph/products/chuwi-mijabook-intel-celeron-n3450-3k-display-i2194877054-s9863142699.html?mp=1
//www.lazada.com.ph/products/chuwi-aerobook-pro-intel-core-m3-windows-10-home-i2189380140-s9832528924.html?mp=1
//www.lazada.com.ph/products/chuwi-gemibook-intel-celeron-windows-10-home-i2189593108-s9833799252.html?mp=1
//www.lazada.com.ph/products/chuwi-corebook-pro-intel-core-i3-windows-10-home-i2189120736-s9831912160.html?mp=1
//www.lazada.com.ph/products/chuwi-corebox-pro-intel-core-i3-i2206581951-s9920301744.html?mp=1
//www.lazada.com.ph/products/chuwi-hi-dock-4-ports-usb-charger-i2234845803-s10064267033.html?mp=1
//www.lazada.com.ph/products/chuwi-herobox-mini-pc-intel-celeron-n4100-i2206416268-s9919983007.html?mp=1
Sign up to request clarification or add additional context in comments.

2 Comments

I was just about to post the exact same code. One optimization you can do is add a break after creating jsonObj. Once he's found it, he doesn't need to search any more. Probably that should be in a function that returns at that point.
Thank you Andrej it worked! Thank you Tim for the optimization

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.