1

The source code of html page is show as below

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=gb2312">
<script>
    document.domain = "xxxx.com";
    var jsonObj = {
        list: [
            {ip: "166.255.255.25", port: 1080, path: "/data/pps.jpeg"}
        ]
    }
    var jsParObj = {param1: 25532, param2: 54463}
</script>
</head>
<body>
</body>
</html>

I try to extract the data from that html page and store them in json format.

soup = BeautifulSoup(html_doc, 'html.parser')
script_text = soup.find('script')

Using python library BeautifulSoup4, I get this

<script>
    document.domain = "xxxx.com";
    var jsonObj = {
        list: [
            {ip: "166.255.255.25", port: 1080, path: "/data/pps.jpeg"}
        ]
    }
    var jsParObj = {param1: 25532, param2: 54463}
</script>

How can I remove the <script> tag and translate that data into json format? Also, I use python.

2
  • I'd just search for the substring starting after regexp jsonObj[ \t]+=, and parse the largest JSON out of it. This approach is not really reliable, though Commented Oct 15, 2016 at 9:58
  • It is not json. It is just javascript, @SuperSaiyan Commented Oct 16, 2016 at 6:39

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.