How to Scrape a popup using python and selenium

Question

I'm trying to scrape Ngo's data like name,mobile no,city etc from https://ngodarpan.gov.in/index.php/search/. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO.I am making a get followed by a post request to extract the data. I've tried accessing it using selenium but the json data is not coming.

list_of_cells = []
for cell in row.find_all('td'):
        text = cell.text.replace("&nbsp;", "") 
        list_of_cells.append(text)
 list_of_rows.append(list_of_cells)
 writer=csv.writer(f)
 writer.writerow(list_of_cells)

By implementing above portion we can get entire details of the table of all the pages .In this website there are 7721 pages.we can simply change number_of_pages var.

But our motive is to find Ngo phone no/email id which is the main purpose we will get after clicking ngo name link.But it is not a href to link rather it a api get req followed by post request to fetch data.find in network section of inspect

driver.get("https://ngodarpan.gov.in/index.php/search/") # load the web page
sleep(2)
....
....
driver.find_element(By.NAME,"commit").submit()
for page in range(number_of_pages - 1):
    list_of_rows = []
    src = driver.page_source # gets the html source of the page
    parser = BeautifulSoup(src,'html.parser') 
    sleep(1)
    table = parser.find("table",{ "class" : "table table-bordered table-striped" })
    sleep(1)
    for row in table.find_all('tr')[:]:
        list_of_cells = []
        for cell in row.find_all('td'):
                x = requests.get("https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf")
                dat=x.json()
                z=dat["csrf_token"]
                print(z) # prints csrf token
                r= requests.post("https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", data = {'id':'','csrf_test_name':'z'})
                json_data=r.text  # i guess here is something not working it is printing html text but we need text data of post request like mob,email,and here it will print all the data .
                with open('data1.json', 'a') as outfile:
                    json.dump(json_data, outfile)
    driver.find_element_by_xpath("//a[contains(text(),'»')]").click()

There is no such error message the code is running but it is printing html content

<html>
...
...
<body>
        <div id="container">
                <h1>An Error Was Encountered</h1>
                <p>The action you have requested is not allowed.</p>    </div>
</body>
</html>

Could you edit your question to show a sample search, and the start of the results you are trying to extract. — Martin Evans
– Martin Evans, Commented Jul 8, 2019 at 19:37
My main motive is to extract mobile no or email id which is appearing after clicking the ngo name. — Akash Rao
– Akash Rao, Commented Jul 10, 2019 at 7:49
Your token is passed as a string z not the variable, try ata = {'id':'','csrf_test_name':z}. You would also need to pass a suitable id — Martin Evans
– Martin Evans, Commented Jul 10, 2019 at 12:01

Martin Evans · Accepted Answer · 2019-07-11 18:24:27Z

1

This could be done much faster by avoiding the use of Selenium. Their site appears to continually request a token prior to each request, you might find it is possible to skip this.

The following shows how to get the JSON containing the mobile number and email address:

from bs4 import BeautifulSoup
import requests
import time

def get_token(sess):
    req_csrf = sess.get('https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf')
    return req_csrf.json()['csrf_token']


search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/{}"
details_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info"

sess = requests.Session()

for page in range(0, 10000, 10):    # Advance 10 at a time
    print(f"Getting results from {page}")

    for retry in range(1, 10):

        data = {
            'state_search' : 7, 
            'district_search' : '',
            'sector_search' : 'null',
            'ngo_type_search' : 'null',
            'ngo_name_search' : '',
            'unique_id_search' : '',
            'view_type' : 'detail_view',
            'csrf_test_name' : get_token(sess), 
        }

        req_search = sess.post(search_url.format(page), data=data, headers={'X-Requested-With' : 'XMLHttpRequest'})
        soup = BeautifulSoup(req_search.content, "html.parser")
        table = soup.find('table', id='example')

        if table:
            for tr in table.find_all('tr'):
                row = [td.text for td in tr.find_all('td')]
                link = tr.find('a', onclick=True)

                if link:
                    link_number = link['onclick'].strip("show_ngif(')")
                    req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
                    json = req_details.json()
                    details = json['infor']['0']

                    print([details['Mobile'], details['Email'], row[1], row[2]])
            break
        else:
            print(f'No data returned - retry {retry}')
            time.sleep(3)

This would give you the following kind of output for the first page:

['9871249262', '[email protected]', 'Pragya Network Educational Society', 'S-52559, Narela, DELHI']
['9810042046', '[email protected]', 'HelpAge India', '9270, New Delhi, DELHI']
['9811897589', '[email protected]', 'All India Parivartan Sewa Samiti', 's-43282, New Delhi, DELHI']

edited Jul 11, 2019 at 18:24

answered Jul 10, 2019 at 12:25

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Akash Rao Over a year ago

For iterating over all pages it is showing error for tr in table.find_all('tr')[:]: AttributeError: 'NoneType' object has no attribute 'find_all'

Martin Evans Over a year ago

I have noticed that site sometimes does not return the correct page, they might have rate limiting enabled. You might have to test table for None and retry with a delay.

Akash Rao Over a year ago

The code is noot iterating all the pages it is only printing the first page

Akash Rao Over a year ago

If your code is running properly please send me on my mail it woulb be a great help sir

Martin Evans Over a year ago

Correct, it was only designed to show you how to get the first page. You would need to add another loop to work your way through all pages.

|

Austin Ulfers · Accepted Answer · 2019-07-08 21:30:57Z

0

Switch to an iframe through Selenium and python

You can use an XPath to locate the :

iframe = driver.find_element_by_xpath("//iframe[@name='Dialogue Window']")

Then switch_to the :

driver.switch_to.frame(iframe)

Here's how to switch back to the default content (out of the ):

driver.switch_to.default_content()

In your instance, I believe the 'Dialogue Window' name would be CalendarControlIFrame

Once you switch to that frame, you will be able to use Beautiful Soup to get the frame's html.

answered Jul 8, 2019 at 21:30

Austin Ulfers

4791 gold badge8 silver badges21 bronze badges

2 Comments

Akash Rao Over a year ago

where do i use this iframe?

Akash Rao Over a year ago

should i use this where i am making request.get in my code

Akash Rao · Accepted Answer · 2019-07-23 05:23:22Z

I am tying to iterate over all the pages and extract data in one attempt After extracting data from one page it is not iterating other pages

....
....

    ['9829059202', '[email protected]', 'CECOEDECON', '206, Jaipur, RAJASTHAN']
    ['9443382475', '[email protected]', 'ODAM', '43/1995, TIRUCHULI, TAMIL NADU']
    ['9816510096', '[email protected]', 'OPEN EDUCATIONAL DEVELOPMENT RESEARCH AND WELFARE', '126/2004, SUNDERNAGAR, HIMACHAL PRADESH']
    ['9425013029', '[email protected]', 'Centre for Advanced Research and Development', '25634, Bhopal, MADHYA PRADESH']
    ['9204645161', '[email protected]', 'Srijan Mahila Vikas Manch', '833, Chakradharpur, JHARKHAND']
    ['9419107550', '[email protected]', 'J and K Sai Star Society', '4680-S, Jammu, JAMMU & KASHMIR']
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
...
...

Collectives™ on Stack Overflow

How to Scrape a popup using python and selenium

3 Answers 3

16 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

16 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related