1

I am trying to scrape a JS enabled webpage, but I am not able to access the HTML code that is seen in my web browser. I am successfully logging into and navigating to the relevant url. However, getting the inner html is not working.

from selenium import webdriver

browser = webdriver.Chrome("path-to-webdriver")

page = browser.get(url)
inner_html = browser.execute_script("return document.body.innerHTML")
print(inner_html)

Below is the part of the HTML code that I want to become accessible; it is inside the first <div></div> tags. The JS script generating the content is found below. The output of my python script contains no extra information compared to the HTML code presented below.

So, how I can get the inner HTML code of this page?

<div class="divmyTrReport" id="divmyTrReport">
    </div>
    <script>
     function loadForm()
    {
                $('#divmyTrReport').html('<img src="/jottonia/gfx/ajaxbar.gif">' );
                $.get( "/jottonia/news/jottoniantimes/frontpageo.jsp", function( data ) {
                    $('#divmyTrReport').html(data );
                });       
    }
            $(document).ready(function(){
                loadForm('');
            });
    </script>

Edit:


The part of the HTML I want is printed below, particularly the "Last update:" part.

<html><head>
 <div id="divContent1" class="clearfix">
<div id="divmyTrReport" class="divmyTrReport">
<title>Jottonian Times</title>
<p>
</p>
<p>&nbsp;</p>
<table width="610" border="0" cellspacing="0" cellpadding="0">
  <tbody><tr>
    <td colspan="2"><img src="img/logo.jpg" alt="The Jottonian Times"></td>
  </tr>
  <tr>
    <td colspan="2"><img src="img/invisible.gif" width="10" height="5"></td>
  </tr>
  <tr>
    <td colspan="2"><table width="100%" border="0" cellpadding="0" cellspacing="0" bgcolor="#9A9A9A">
        <tbody><tr>
          <td><table width="100%" border="0" cellspacing="1" cellpadding="0">
              <tbody><tr>
                <td bgcolor="#EBEBEB"> <div align="center">
                    <table width="600" border="0" cellpadding="0" cellspacing="0">
                      <tbody><tr>
                        <td><font size="-2" face="Verdana, Arial, Helvetica, sans-serif">
                          &nbsp;Jottonian time: 2018-02-26 09:24 </font></td>
                        <td> <div align="center"><font size="-2" face="Verdana, Arial, Helvetica, sans-serif">

                            Last update:
                             166:24 hours ago</font></div></td>
                        <td> <div align="right"><font size="-2" face="Verdana, Arial, Helvetica, sans-serif">Issues: Quite some&nbsp; </font></div></td>
                      </tr>
</body></html>

Running this

news_page = browser.get(news_url)
inner_html = wait(browser, 20).until(lambda browser: browser.find_element_by_id("divContent1").get_attribute("innerHTML").strip())
print(inner_html)

results in

<div id="divmyTrReport" class="divmyTrReport"><img src="/jottonia/gfx/ajaxbar.gif"></div>

 <script>
    function loadForm()
    {
                $('#divmyTrReport').html('<img src="/jottonia/gfx/ajaxbar.gif">' );
                $.get( "/jottonia/news/jottoniantimes/frontpageo.jsp", function( data ) {
                    $('#divmyTrReport').html(data );
                });       
    }
            $(document).ready(function(){
                loadForm('');
            });
      </script>





 <script type="text/javascript">
<!--
    $( document ).ready(function() {
        newMail(1);
    });
//-->
</script>

2 Answers 2

1

If you want to get innerHTML which is generated dynamically you can try below code:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait as wait

browser = webdriver.Chrome("path-to-webdriver")

page = browser.get(url)
inner_html = wait(browser, 10).until(lambda browser: browser.find_element_by_id("divmyTrReport").get_attribute("innerHTML").strip())
wait(browser, 10).until(lambda browser: browser.find_element_by_id("divmyTrReport").get_attribute("innerHTML").strip() != inner_html)
inner_html = browser.find_element_by_id("divmyTrReport").get_attribute("innerHTML")
print(inner_html)

This should allow you to wait up to 10 seconds (increase timeout if needed) until innerHTML of target div returned non-empty value

Sign up to request clarification or add additional context in comments.

8 Comments

This results in something, but not the same code I can inspect in my browser. This gives "<img src="/jottonia/gfx/ajaxbar.gif">".
And what is your desired output?
So content of target div is changed two times: 1) Interim image, e.g. loading or something 2) Table with data... And you want to get table, right?
Yes, the table.
No. First time we are waiting for appearing non-empty content (the image) and assign this content to inner_html variable. Second time we are waiting for inner_html to change its value (second wait returns boolean True/False). With third line we assign new value to inner_html
|
0

If you want to inner html like javascript you need to behave like javascript for example:

browser.execute_script('''document.getElementById("divmyTrReport").innerHTML = '<img src="/jottonia/gfx/ajaxbar.gif">';''')

1 Comment

You are missing a "(", I think. Not sure where to place it. If I place it such that I get innerHTML =( '<img src="/jottonia/gfx/ajaxbar.gif">'), then I get "None" returned.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.