1

I'm trying to scrape Ngo's data like name,mobile no,city etc from https://ngodarpan.gov.in/index.php/search/. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO.I am making a get followed by a post request to extract the data. I've tried accessing it using selenium but the json data is not coming.

list_of_cells = []
for cell in row.find_all('td'):
        text = cell.text.replace(" ", "") 
        list_of_cells.append(text)
 list_of_rows.append(list_of_cells)
 writer=csv.writer(f)
 writer.writerow(list_of_cells)

By implementing above portion we can get entire details of the table of all the pages .In this website there are 7721 pages.we can simply change number_of_pages var.

But our motive is to find Ngo phone no/email id which is the main purpose we will get after clicking ngo name link.But it is not a href to link rather it a api get req followed by post request to fetch data.find in network section of inspect

driver.get("https://ngodarpan.gov.in/index.php/search/") # load the web page
sleep(2)
....
....
driver.find_element(By.NAME,"commit").submit()
for page in range(number_of_pages - 1):
    list_of_rows = []
    src = driver.page_source # gets the html source of the page
    parser = BeautifulSoup(src,'html.parser') 
    sleep(1)
    table = parser.find("table",{ "class" : "table table-bordered table-striped" })
    sleep(1)
    for row in table.find_all('tr')[:]:
        list_of_cells = []
        for cell in row.find_all('td'):
                x = requests.get("https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf")
                dat=x.json()
                z=dat["csrf_token"]
                print(z) # prints csrf token
                r= requests.post("https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", data = {'id':'','csrf_test_name':'z'})
                json_data=r.text  # i guess here is something not working it is printing html text but we need text data of post request like mob,email,and here it will print all the data .
                with open('data1.json', 'a') as outfile:
                    json.dump(json_data, outfile)
    driver.find_element_by_xpath("//a[contains(text(),'»')]").click()

There is no such error message the code is running but it is printing html content

<html>
...
...
<body>
        <div id="container">
                <h1>An Error Was Encountered</h1>
                <p>The action you have requested is not allowed.</p>    </div>
</body>
</html>
3
  • Could you edit your question to show a sample search, and the start of the results you are trying to extract. Commented Jul 8, 2019 at 19:37
  • My main motive is to extract mobile no or email id which is appearing after clicking the ngo name. Commented Jul 10, 2019 at 7:49
  • Your token is passed as a string z not the variable, try ata = {'id':'','csrf_test_name':z}. You would also need to pass a suitable id Commented Jul 10, 2019 at 12:01

3 Answers 3

1

This could be done much faster by avoiding the use of Selenium. Their site appears to continually request a token prior to each request, you might find it is possible to skip this.

The following shows how to get the JSON containing the mobile number and email address:

from bs4 import BeautifulSoup
import requests
import time

def get_token(sess):
    req_csrf = sess.get('https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf')
    return req_csrf.json()['csrf_token']


search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/{}"
details_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info"

sess = requests.Session()

for page in range(0, 10000, 10):    # Advance 10 at a time
    print(f"Getting results from {page}")

    for retry in range(1, 10):

        data = {
            'state_search' : 7, 
            'district_search' : '',
            'sector_search' : 'null',
            'ngo_type_search' : 'null',
            'ngo_name_search' : '',
            'unique_id_search' : '',
            'view_type' : 'detail_view',
            'csrf_test_name' : get_token(sess), 
        }

        req_search = sess.post(search_url.format(page), data=data, headers={'X-Requested-With' : 'XMLHttpRequest'})
        soup = BeautifulSoup(req_search.content, "html.parser")
        table = soup.find('table', id='example')

        if table:
            for tr in table.find_all('tr'):
                row = [td.text for td in tr.find_all('td')]
                link = tr.find('a', onclick=True)

                if link:
                    link_number = link['onclick'].strip("show_ngif(')")
                    req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
                    json = req_details.json()
                    details = json['infor']['0']

                    print([details['Mobile'], details['Email'], row[1], row[2]])
            break
        else:
            print(f'No data returned - retry {retry}')
            time.sleep(3)

This would give you the following kind of output for the first page:

['9871249262', '[email protected]', 'Pragya Network Educational Society', 'S-52559, Narela, DELHI']
['9810042046', '[email protected]', 'HelpAge India', '9270, New Delhi, DELHI']
['9811897589', '[email protected]', 'All India Parivartan Sewa Samiti', 's-43282, New Delhi, DELHI']
Sign up to request clarification or add additional context in comments.

16 Comments

For iterating over all pages it is showing error for tr in table.find_all('tr')[:]: AttributeError: 'NoneType' object has no attribute 'find_all'
I have noticed that site sometimes does not return the correct page, they might have rate limiting enabled. You might have to test table for None and retry with a delay.
The code is noot iterating all the pages it is only printing the first page
If your code is running properly please send me on my mail it woulb be a great help sir
Correct, it was only designed to show you how to get the first page. You would need to add another loop to work your way through all pages.
|
0

Switch to an iframe through Selenium and python

You can use an XPath to locate the :

iframe = driver.find_element_by_xpath("//iframe[@name='Dialogue Window']")

Then switch_to the :

driver.switch_to.frame(iframe)

Here's how to switch back to the default content (out of the ):

driver.switch_to.default_content()

In your instance, I believe the 'Dialogue Window' name would be CalendarControlIFrame

Once you switch to that frame, you will be able to use Beautiful Soup to get the frame's html.

2 Comments

where do i use this iframe?
should i use this where i am making request.get in my code
0

I am tying to iterate over all the pages and extract data in one attempt After extracting data from one page it is not iterating other pages

....
....

    ['9829059202', '[email protected]', 'CECOEDECON', '206, Jaipur, RAJASTHAN']
    ['9443382475', '[email protected]', 'ODAM', '43/1995, TIRUCHULI, TAMIL NADU']
    ['9816510096', '[email protected]', 'OPEN EDUCATIONAL DEVELOPMENT RESEARCH AND WELFARE', '126/2004, SUNDERNAGAR, HIMACHAL PRADESH']
    ['9425013029', '[email protected]', 'Centre for Advanced Research and Development', '25634, Bhopal, MADHYA PRADESH']
    ['9204645161', '[email protected]', 'Srijan Mahila Vikas Manch', '833, Chakradharpur, JHARKHAND']
    ['9419107550', '[email protected]', 'J and K Sai Star Society', '4680-S, Jammu, JAMMU & KASHMIR']
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
    No data returned - retry 2
...
...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.