2

I'm trying to use this code to collect reviews from the Consumer Affairs review site. But I kept getting errors, specifically in the dateElements & jsonData section. Could someone help me fix this code to be compatible with the site I'm going to web scrape?

from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []

    link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('div', {'class':'rvw js-rvw'})
    for article in articles:
        names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
        try:
            bodies.append(article.find('p', attrs={'class':'rvw-bd'}).text.strip())
        except:
            bodies.append('')

        try:
            ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
        except:
            ratings.append('')
        dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()

        jsonData = json.loads(dateElements)
        published.append(jsonData['publishedDate'])
        updated.append(jsonData['updatedDate'])
        reported.append(jsonData['reportedDate'])


    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('pass1')


df.to_csv('AllureReviews.csv', index=False, encoding='utf-8')
print ('excel done')

This is the error I'm getting

Traceback (most recent call last): File "C:/Users/Sara Jitkresorn/PycharmProjects/untitled/venv/Caffairs.py", line 37, in jsonData = json.loads(dateElements) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json__init__.py", line 348, in loads return _default_decoder.decode(s) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

2 Answers 2

3

In addition to the above code we can get the ratings and non-duplicated data as below:-

from bs4 import BeautifulSoup
import requests
import pandas as pd
print ('all imported successfuly')

# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 5):
    names = []
    headers = []
    bodies = []
    ratings = []
    published = []
    updated = []
    reported = []
    dateElements = []

    link = (f'https://www.consumeraffairs.com/online/allure-beauty-box.html?page={x}')
    print (link)
    req = requests.get(link)
    content = req.content
    soup = BeautifulSoup(content, "lxml")
    articles = soup.find_all('div', {'class':'rvw js-rvw'})
    for article in articles:
        names.append(article.find('strong', attrs={'class': 'rvw-aut__inf-nm'}).text.strip())
        try:
            bodies.append(article.find('div', attrs={'class':'rvw-bd'}).text.strip())
        except:
            bodies.append('NA')

        try:
            ratings.append(article.find('meta', attrs={'itemprop': 'ratingValue'})['content'])
        except:
            ratings.append('NA')
        dateElements.append(article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip())
    # Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
    temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': dateElements})
    df = df.append(temp_df, sort=False).reset_index(drop=True)

print ('df')
Sign up to request clarification or add additional context in comments.

2 Comments

Can you add a quick explanation what you changed?
Sure @petezurich The except condition previously didn't append any element to the list. Thus when an empty result is returned it corrupts the next values. So, NA helps in that case. Also the rating class was picking a wrong element so had to find it using a different attribute. These help in building the right dataframe.
1

dateElements doesn't contain a string that can be parsed by json.loads() because it is simply a text string e.g. Original review: Feb. 15, 2020

Change these lines to circumvent this:

try:
    ratings.append(article.find('div', attrs={'class':'stars-rtg stars-rtg--sm'}).text.strip())
except:
    ratings.append('')
dateElements = article.find('span', attrs={'class':'ca-txt-cpt'}).text.strip()

published.append(dateElements)

temp_df = pd.DataFrame({'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': published})
df = df.append(temp_df, sort=False).reset_index(drop=True)

You also have to comment out these two lines:

# updated = []
# reported = []

Than your code runs without errors, although you still don't get data for Body and Rating.

df print out to this:

    User Name   Body    Rating  Published Date
0   M. M. of Dallas, GA             Original review: Feb. 15, 2020
1   Malinda of Aston, PA            Original review: Sept. 21, 2019
2   Ping of Tarzana, CA             Original review: July 18, 2019

6 Comments

Hmm. Is it because I got the class wrong for the body and rating? That I don't get the data for those ones.
Body you can fix by changing the p tag to a div in the respective code line. Use this: bodies.append(article.find('div', attrs={'class':'rvw-bd'}).text.strip()). Then you get the body data.
If you scrape with requests you should look in the raw HTML of your site to properly identify the tags, ids and classes and not in the rendered page.
So I was able to get the data for the body now. But then the data collected within the excel contains alot of duplicates. In the review site, there are 86 reviews. But I got 630 rows of data/reviews. Which part in the code is causing reviews that are being collected to be duplicated?
I suggest you post this as a new question with the revised code you have so far. If you just want to get rid of the duplicates you can use Pandas' drop_duplicates().
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.