1

I am trying to scrape this link and here is the code I wrote

import requests
from bs4 import BeautifulSoup
rlink = requests.get('http://videohost.site/play/A11QStEaNdVZfvV/')
print(rlink.content)

Now when I am running the link into the Browser, I am getting a well formed HTML from which I can select the tag. Example:

<video class="jw-video jw-reset" x-webkit-airplay="allow" webkit-playsinline="" playsinline="" jw-loaded="data" src="https://redirector.googlevideo.com/videoplayback?requiressl=yes&amp;id=99e7c0d36ff950d2&amp;itag=22&amp;source=webdrive&amp;ttl=transient&amp;app=explorer&amp;ip=2001:67c:2db8:7::3e0&amp;ipbits=32&amp;expire=1483730468&amp;sparams=requiressl%2Cid%2Citag%2Csource%2Cttl%2Cip%2Cipbits%2Cexpire%2Cmm%2Cmn%2Cms%2Cmv%2Cpl&amp;signature=7EFB542F7CE372D5DAD8376254F577926AF8CBEA.857A11ACEB6C65D5D075759B557CE1E114F94F03&amp;key=ck2&amp;mm=31&amp;mn=sn-bungvh5op5-vu2e&amp;ms=au&amp;mt=1483715949&amp;mv=u&amp;pl=48"></video>

But the request module is returning a Script, which gets executed in the browser,

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta charset="UTF-8" />
      <title>Banjo HD</title>
      <meta property="og:image" content="https://lh6.googleusercontent.com/Eo6aYbkMPiltQ1HE8QXK-2RvCOB8wCgzvqiJqIYEu9DJMSodJwd24g=w1200-h630-p" />
      <link rel="stylesheet" type="text/css" href="http://videohost.site/player/jwplayer/assets/style.css">
      <script src="http://videohost.site/player/jwplayer/assets/jwplayer.js"></script> <script>jwplayer.key = "qCeaX98IpNerwNN2Vlz69NLXFAyMM5a4dyK7Pw==";</script>
   </head>
   <body>
      <div id="player"></div>
      <script type="text/javascript"> eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('1k 5=v("5");5.1l({1m:"14%",1i:"14%",1h:"1n",1q:"w",1p:17,1o:w,1r:"O://15.19/5/v/1a/v.1b.1g",1f:"16:9",1c:"17",1e:"1d",1j:"O",1w:w,1G:[{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=1F&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1I.1K&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1J","2":"1\\/4"},{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=1E&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1C.1D&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1s","2":"1\\/4"},{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=18&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1z.1A&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1B","2":"1\\/4"}],2:"1/4",1y:{3:"",1x:"",},1t:"1u 1v",1H:"O://15.19"});',62,109,'requiressl|video|type|file|mp4|player|32||expire||ipbits|3e0|ip|2001|67c|1483730468|sparams|2Cid|2Citag|2Cttl|2Cip|2Cipbits|explorer|2Csource|ttl|googlevideo|com|videoplayback|redirector|https||jwplayer|false|yes|id|2Cexpire|transient|webdrive|source|99e7c0d36ff950d2|itag|app|2db8|au|mt|ms|vu2e|bungvh5op5|1483715949|pl|http|2Cmm|label|48|sn|mv|2Cmv|2Cpl|signature|2Cms|key|ck2|2Cmn|mn|31|mm|100|videohost||true||site|assets|flash|fullscreen|html5|primary|aspectratio|swf|skin|height|provider|var|setup|width|seven|displaytitle|controls|preload|flashplayer|480P|abouttext|Video|Host|autostart|link|logo|3648867A489010D7BFA1A2E6C64F4035FDEB3814|6617735E622564ACA4793459986706DA936E58DE|360P|9FBCFB9752833B2DD83BFD6547551604AA6A340D|A55D1440195C2AF6945EE4A20DB8147CDC50F337|59|22|sources|aboutlink|7EFB542F7CE372D5DAD8376254F577926AF8CBEA|720P|857A11ACEB6C65D5D075759B557CE1E114F94F03'.split('|'),0,{})) </script><!-- Code --><script type="text/javascript" data-cfasync="false"> var _pop = _pop || []; _pop.push(['siteId', 1630926]); _pop.push(['minBid', 0]); _pop.push(['popundersPerIP', 0]); _pop.push(['delayBetween', 0]); _pop.push(['default', false]); _pop.push(['defaultPerDay', 0]); _pop.push(['topmostLayer', false]); (function() { var pa = document.createElement('script'); pa.type = 'text/javascript'; pa.async = true; var s = document.getElementsByTagName('script')[0]; pa.src = '//c1.popads.net/pop.js'; pa.onerror = function() { var sa = document.createElement('script'); sa.type = 'text/javascript'; sa.async = true; sa.src = '//c2.popads.net/pop.js'; s.parentNode.insertBefore(sa, s); }; s.parentNode.insertBefore(pa, s); })();</script><!-- Code End --><script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-88363984-1', 'auto'); ga('send', 'pageview');</script>
   </body>
</html>

Any pointers how to proceed with getting the final HTML will be highly appreciated.

Any ideas on PhantomJS, I am running the same as suggested below but with PhantomJS driver and the search for voideo tag times out as I think the script does not get executed as with FireFox.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get('http://videohost.site/play/A11QStEaNdVZfvV/')
# driver.execute_script('')

# wait for "video" to be present
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "video")))

# get the src value
print(video.get_attribute("src"))

driver.close()
1
  • 1
    Requests and webscrapping won't render the JavaScript. You need to run something like Selenium. Commented Jan 6, 2017 at 16:09

2 Answers 2

2

To extend Emett's answer, here is an example working code using selenium that would open up Firefox (you don't have to use Firefox - there are several browser supported, including the headless PhantomJS), wait for the video element to be present and get the src value:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://videohost.site/play/A11QStEaNdVZfvV/')

# wait for "video" to be present
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "video")))

# get the src value
print(video.get_attribute("src"))

driver.close()
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks guys for your helpful response . Much appreciated
Any ideas how to get it working with PhantomJS , it timesout and does n't find the video tag.
1

Requests and webscraping won't render the JavaScript. You need to run something like Selenium. The only issue there is that it will open a browser and it can be rather slow. To further solve that problem you will want to use a headless browser system like ghost.py.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.