I'm trying to parse the testObj in the html into JSON, but it includes so much formatting.
I already tried to remove the non-ascii characters in the object, but json.loads() and yaml still can't parse the string into an object.
How can I parse the string into an object?
html
<!DOCTYPE html>
<html lang="en">
<head>
<title>Sample Document</title>
</head>
<body></body>
<script>
const testObj = {
a: 1,
b: 2,
c: 3,
};
</script>
</html>
Python Script
import lxml.html
import urllib.request
import os
import json
import yaml
def removeNonAscii(str):
return ''.join(i for i in str if ord(i)>31 and ord(i)<126)
with urllib.request.urlopen('file:///'+os.path.abspath('./test.html')) as url:
page = url.read()
tree = lxml.html.fromstring(page)
x = tree.xpath("//script")[0].text_content()
json_str = x.strip().split('testObj = ')[1][:-1]
str = removeNonAscii(json_str)
print(str)
# >>> {a: 1,b: 2,c: 3,}
# Attempt 1 - This doesn't work as object doesn't originally have double quotes
# data = json.loads(str)
# >>> json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes
# Attempt 2 - Not sure how to detect or get rid of formatting
# data = yaml.load(str, yaml.SafeLoader)
# >>> ScannerError: While scanning for the next token found character '\t' that cannot start any token
print(data.a)
# >>> Should return 1
Edit: In my actual use case, the JSON object is very large and I cannot recreate the string. I need to remove the formatting and/or add double quotes to make it proper JSON so it can parse, but not sure how to do it. I'm close getting it to {a: 1,b: 2,c: 3,} but it still doesn't want to parse.

{a: 1,b: 2,c: 3,}but it's still detecting odd formatting and I'm not sure how to add proper double quotes to the keys.testObjin the script tag. In my attempted answer, I've gotten it down to{a: 1,b: 2,c: 3,}but it still doesn't parse. Not sure where the sentiment from your comment is coming from, nor what you want me to do better my friend.