JSON Error in Web Scraping code, How to fix?

Question

I'm trying to use this code to collect reviews from the Consumer Affairs review site. But I kept getting errors, specifically in the dateElements & jsonData section. Could someone help me fix this code to be compatible with the site I'm going to web scrape?

from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []

    link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('div', {'class':'rvw js-rvw'})
    for article in articles:
        names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
        try:
            bodies.append(article.find('p', attrs={'class':'rvw-bd'}).text.strip())
        except:
            bodies.append('')

        try:
            ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
        except:
            ratings.append('')
        dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()

        jsonData = json.loads(dateElements)
        published.append(jsonData['publishedDate'])
        updated.append(jsonData['updatedDate'])
        reported.append(jsonData['reportedDate'])


    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('pass1')


df.to_csv('AllureReviews.csv', index=False, encoding='utf-8')
print ('excel done')

This is the error I'm getting

Traceback (most recent call last): File "C:/Users/Sara Jitkresorn/PycharmProjects/untitled/venv/Caffairs.py", line 37, in jsonData = json.loads(dateElements) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json__init__.py", line 348, in loads return _default_decoder.decode(s) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Prakhar Jhudele · Accepted Answer · 2020-03-06 07:36:02Z

3

In addition to the above code we can get the ratings and non-duplicated data as below:-

from bs4 import BeautifulSoup
import requests
import pandas as pd
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []
    dateElements = []

    link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('div', {'class':'rvw js-rvw'})
    for article in articles:
        names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
        try:
            bodies.append(article.find('div', attrs={'class':'rvw-bd'}).text.strip())
        except:
            bodies.append('NA')

        try:
            ratings.append(article.find('meta', attrs={'itemprop': 'ratingValue'})['content'])
        except:
            ratings.append('NA')
        dateElements.append(article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip())
    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': dateElements})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('df')

answered Mar 6, 2020 at 7:36

Prakhar Jhudele

9651 gold badge7 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

petezurich Over a year ago

Can you add a quick explanation what you changed?

Prakhar Jhudele Over a year ago

Sure @petezurich The except condition previously didn't append any element to the list. Thus when an empty result is returned it corrupts the next values. So, NA helps in that case. Also the rating class was picking a wrong element so had to find it using a different attribute. These help in building the right dataframe.

petezurich · Accepted Answer · 2020-03-06 06:30:59Z

1

dateElements doesn't contain a string that can be parsed by json.loads() because it is simply a text string e.g. Original review: Feb. 15, 2020

Change these lines to circumvent this:

try:
    ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
except:
    ratings.append('')
dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()

published.append(dateElements)

temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published})
df = df.append(temp_df, sort=False).reset_index(drop=True)

You also have to comment out these two lines:

# updated = []
# reported = []

Than your code runs without errors, although you still don't get data for Body and Rating.

df print out to this:

    User Name   Body    Rating  Published Date
0   M. M. of Dallas, GA             Original review: Feb. 15, 2020
1   Malinda of Aston, PA            Original review: Sept. 21, 2019
2   Ping of Tarzana, CA             Original review: July 18, 2019

answered Mar 6, 2020 at 6:30

petezurich

10.3k10 gold badges48 silver badges63 bronze badges

6 Comments

Sara Jitkresorn Over a year ago

Hmm. Is it because I got the class wrong for the body and rating? That I don't get the data for those ones.

petezurich Over a year ago

Body you can fix by changing the p tag to a div in the respective code line. Use this: bodies.append(article.find('div', attrs={'class':'rvw-bd'}).text.strip()). Then you get the body data.

petezurich Over a year ago

If you scrape with requests you should look in the raw HTML of your site to properly identify the tags, ids and classes and not in the rendered page.

Sara Jitkresorn Over a year ago

So I was able to get the data for the body now. But then the data collected within the excel contains alot of duplicates. In the review site, there are 86 reviews. But I got 630 rows of data/reviews. Which part in the code is causing reviews that are being collected to be duplicated?

petezurich Over a year ago

I suggest you post this as a new question with the revised code you have so far. If you just want to get rid of the duplicates you can use Pandas' drop_duplicates().

|

Collectives™ on Stack Overflow

JSON Error in Web Scraping code, How to fix?

2 Answers 2

2 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related