web scraping parsing nested json and creating a list

Question

I'm trying to web scrape an ecommerce website. However, the page is dynamic. Within the html source code is the script that generates a json format of the products.

My code is

from bs4 import BeautifulSoup, SoupStrainer
import requests
import json

url = "https://www.lazada.com.ph/chuwi-pilipinas/?q=All-Products&langFlag=en&from=wangpu&lang=en&pageTypeId=2"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data,'html.parser')


scripts = soup.find_all('script')

jsonObj = None
for script in scripts:
    if 'window.pageData = ' in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('window.pageData = ')[1]
        jsonObj = json.loads(jsonStr)
        
products = jsonObj['mods']['listItems']

for item in products:
    print (item['productUrl'])

the result is:

PS C:\Users\nate\Documents\Python\LazadaScapper> & "C:/Program Files/Python39/python.exe" c:/Users/nate/Documents/Python/LazadaScapper/LazadaScraper3.py
Traceback (most recent call last):
  File "c:\Users\nate\Documents\Python\LazadaScapper\LazadaScraper3.py", line 21, in <module>
    products = jsonObj['mods']['listItems']
TypeError: 'NoneType' object is not subscriptable
PS C:\Users\nate\Documents\Python\LazadaScapper>

I did a research and it seems that for loop doesn't work thus, dictionary products is empty.

This is related to this thread that was posted 2 years ago but not working anymore.

I'm new at python and still studying, I hope you guys can help me.

Andrej Kesely · Accepted Answer · 2021-08-23 19:03:27Z

2

The issue is beautifulsoup doesn't parse the content of <script> property into .text, you have to use .contents (the type is bs4.element.Script):

from bs4 import BeautifulSoup, SoupStrainer
import requests
import json

url = "https://www.lazada.com.ph/chuwi-pilipinas/?q=All-Products&langFlag=en&from=wangpu&lang=en&pageTypeId=2"

page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")


scripts = soup.find_all("script")

jsonObj = None
for script in scripts:
    if script.contents and "window.pageData = " in script.contents[0]:
        jsonStr = script.contents[0]
        jsonStr = jsonStr.split("window.pageData = ")[1].strip().strip(";")
        jsonObj = json.loads(jsonStr)

products = jsonObj["mods"]["listItems"]
for item in products:
    print(item["productUrl"])

Prints:

//www.lazada.com.ph/products/chuwi-hi10x-2-in-1-tablet-with-detachable-keyboard-and-stylus-i2197648497-s9878152829.html?mp=1
//www.lazada.com.ph/products/chuwi-herobook-pro-intel-celeron-windows-10-home-i2194930372-s9864035095.html?mp=1
//www.lazada.com.ph/products/chuwi-mijabook-intel-celeron-n3450-3k-display-i2194877054-s9863142699.html?mp=1
//www.lazada.com.ph/products/chuwi-aerobook-pro-intel-core-m3-windows-10-home-i2189380140-s9832528924.html?mp=1
//www.lazada.com.ph/products/chuwi-gemibook-intel-celeron-windows-10-home-i2189593108-s9833799252.html?mp=1
//www.lazada.com.ph/products/chuwi-corebook-pro-intel-core-i3-windows-10-home-i2189120736-s9831912160.html?mp=1
//www.lazada.com.ph/products/chuwi-corebox-pro-intel-core-i3-i2206581951-s9920301744.html?mp=1
//www.lazada.com.ph/products/chuwi-hi-dock-4-ports-usb-charger-i2234845803-s10064267033.html?mp=1
//www.lazada.com.ph/products/chuwi-herobox-mini-pc-intel-celeron-n4100-i2206416268-s9919983007.html?mp=1

answered Aug 23, 2021 at 19:03

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Tim Roberts Over a year ago

I was just about to post the exact same code. One optimization you can do is add a break after creating jsonObj. Once he's found it, he doesn't need to search any more. Probably that should be in a function that returns at that point.

Nate Over a year ago

Thank you Andrej it worked! Thank you Tim for the optimization

Collectives™ on Stack Overflow

web scraping parsing nested json and creating a list

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related