2

I want to scrape the comments off this page using beautifulsoup - https://www.x....s.com/video_id/the-suburl

The comments are loaded on click via Javascript. The comments are paginated and each page loads comments on click too. I wish to fetch all comments, for each comment, I want to get the poster profile url, the comment, no. of likes, no of dislikes, and time posted (as stated on the page).

The comments can be a list of dictionaries.

How do I go about this?

2 Answers 2

2

This script will print all comments found on the page:

import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.x......com/video_id/gggjggjj/'
video_id = url.rsplit('/', maxsplit=2)[-2].replace('video', '')

u = 'https://www.x......com/threads/video/ggggjggl/{video_id}/0/0'.format(video_id=video_id)
comments = requests.post(u, data={'load_all':1}).json()

for id_ in comments['posts']['ids']:
    print(comments['posts']['posts'][id_]['date'])
    print(comments['posts']['posts'][id_]['name'])
    print(comments['posts']['posts'][id_]['url'])
    print(BeautifulSoup(comments['posts']['posts'][id_]['message'], 'html.parser').get_text())
    # ...etc.
    print('-'*80)
Sign up to request clarification or add additional context in comments.

1 Comment

Works! Please how did you get this url? - 'x......com/threads/video-comments/get-posts/top{video_id}/0/0'.format(video_id=video_id)'. Also, please mask the url to this - x......com/video_id/gggjggjj
0

This would be done with Selenium. Selenium emulates a browser. Depending on your preferences you can use a chrome driver or the Firefox driver which is the geckodriver.

Here is a link on how to install the chrome webdriver: http://jonathansoma.com/lede/foundations-2018/classes/selenium/selenium-windows-install/

Then in your code here is how you would set it up:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# this part may change depending on where you installed the webdriver. 
# You may have to define the path to the driver. 
# For me my driver is in C:/bin so I do not need to define the path
chrome_options = Options()

# or '-start maximized' if you want the browser window to open
chrome_options.add_argument('--headless') 

driver = webdriver.Chrome(options=chrome_options)

driver.get(your_url)
html = driver.page_source # downloads the html from the driver

Selenium has several functions that you can use to perform certain actions such as click on elements on the page. Once you find an element with selenium you can use the .click() method to interact with the element. Let me know if this helps

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.