How to scrape data off page loaded via Javascript

Question

I want to scrape the comments off this page using beautifulsoup - https://www.x....s.com/video_id/the-suburl

The comments are loaded on click via Javascript. The comments are paginated and each page loads comments on click too. I wish to fetch all comments, for each comment, I want to get the poster profile url, the comment, no. of likes, no of dislikes, and time posted (as stated on the page).

The comments can be a list of dictionaries.

How do I go about this?

shekwo · Accepted Answer · 2020-07-23 05:28:59Z

2

This script will print all comments found on the page:

import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')

u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()

for id_ in comments['posts']['ids']:
    print(comments['posts']['posts'][id_]['date'])
    print(comments['posts']['posts'][id_]['name'])
    print(comments['posts']['posts'][id_]['url'])
    print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
    # ...etc.
    print('-'*80)

edited Jul 23, 2020 at 5:28

shekwo

1,4572 gold badges24 silver badges57 bronze badges

answered Jul 22, 2020 at 19:10

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

shekwo Over a year ago

Works! Please how did you get this url? - 'x......com/threads/video-comments/get-posts/top{video_id}/0/0'.format(video_id=video_id)'. Also, please mask the url to this - x......com/video_id/gggjggjj

Timaayy · Accepted Answer · 2020-07-22 17:21:05Z

This would be done with Selenium. Selenium emulates a browser. Depending on your preferences you can use a chrome driver or the Firefox driver which is the geckodriver.

Here is a link on how to install the chrome webdriver: http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/

Then in your code here is how you would set it up:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# this part may change depending on where you installed the webdriver. 
# You may have to define the path to the driver. 
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()

# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless') 

driver = webdriver.Chrome(options=chrome_options)

driver.get(your_url)
html = driver.page_source # downloads the html from the driver

Selenium has several functions that you can use to perform certain actions such as click on elements on the page. Once you find an element with selenium you can use the .click() method to interact with the element. Let me know if this helps

Collectives™ on Stack Overflow

How to scrape data off page loaded via Javascript

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related