0

I'm trying to parse the testObj in the html into JSON, but it includes so much formatting.

I already tried to remove the non-ascii characters in the object, but json.loads() and yaml still can't parse the string into an object.

How can I parse the string into an object?

html

<!DOCTYPE html>
<html lang="en">
    <head>
        <title>Sample Document</title>
    </head>
    <body></body>
    <script>
        const testObj = {
            a: 1,
            b: 2,
            c: 3,
        };
    </script>
</html>

Python Script

import lxml.html
import urllib.request
import os
import json
import yaml

def removeNonAscii(str):
    return ''.join(i for i in str if ord(i)>31 and ord(i)<126)

with urllib.request.urlopen('file:///'+os.path.abspath('./test.html')) as url:
    page = url.read()
    tree = lxml.html.fromstring(page)
    x = tree.xpath("//script")[0].text_content()
    json_str = x.strip().split('testObj = ')[1][:-1]
    str = removeNonAscii(json_str)
    print(str)
    # >>> {a: 1,b: 2,c: 3,}

    # Attempt 1 - This doesn't work as object doesn't originally have double quotes 
    # data = json.loads(str)
    # >>> json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes

    # Attempt 2 - Not sure how to detect or get rid of formatting
    # data = yaml.load(str, yaml.SafeLoader)
    # >>> ScannerError: While scanning for the next token found character '\t' that cannot start any token
    
    print(data.a)
    # >>> Should return 1

Edit: In my actual use case, the JSON object is very large and I cannot recreate the string. I need to remove the formatting and/or add double quotes to make it proper JSON so it can parse, but not sure how to do it. I'm close getting it to {a: 1,b: 2,c: 3,} but it still doesn't want to parse.

4
  • Does this answer your question? Convert html source code to json object Commented Apr 14, 2021 at 1:38
  • @PacketLoss Thanks for your response. In that question, the answer was to rebuild the whole json object by hand. In my actual use case, the object is way too large to recreate by hand. Essentially, I still need to figure out how to parse the string. I'm really close getting it to {a: 1,b: 2,c: 3,} but it's still detecting odd formatting and I'm not sure how to add proper double quotes to the keys. Commented Apr 14, 2021 at 1:46
  • You're asking us to help you write a parser for a syntax with no specification, not even an example of what it looks like? Seriously? Commented Apr 14, 2021 at 7:42
  • @MichaelKay I'm not clear what you mean when you say no example. If the answer can do the basic case of a simple unformatted dicitonary/object into JSON, then it will have solved the question. The example is the testObj in the script tag. In my attempted answer, I've gotten it down to {a: 1,b: 2,c: 3,} but it still doesn't parse. Not sure where the sentiment from your comment is coming from, nor what you want me to do better my friend. Commented Apr 14, 2021 at 15:21

1 Answer 1

2

If it is as shown (not minified) then you can use the following regex to extract the string then hjson to add the quoted keys

import hjson, re

html = '''
<!DOCTYPE html>
    <html lang="en">
        <head>
            <title>Sample Document</title>
        </head>
        <body></body>
        <script>
            const testObj = {
                a: 1,
                b: 2,
                c: 3,
            };
        </script>
    </html>'''

s = re.search(r'const testObj = ([\s\S]+?);', html).group(1)
res = hjson.loads(s)
print(res)

Regex:

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! hjson was able to parse the string! But it still becomes odict_items([('a', 1), ('b', 2), ('c', 3)]). The ultimate goal is to be able to call testObj.a for example and get 1. Is the only way to parse it into a JSON object to manually replace the single quotes at some point?
Did you try print res['a'] it will return 1 .a is a method call. I thought you wanted a dictionary object?
You're right. My apologies. Coming from javascript, and was trying res.a. Thank you, and marked as correct.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.