3

I have a large json file. Its log data and I have compressed it to bz2 format (myfile.json.bz2). Size of the bz2 file is 90MB. I searched to find a good solution or a blog post that explain parsing compressed bz2 json file efficiently but was not able to find any.

Since the file is large doing something like is impossible.

with open('data.json') as data_file:    
    data = json.load(data_file)

what is the best approach?

After some digging around I found there is a python package to read bz2

input_file = bz2.BZ2File(filename, 'r')
2
  • 2
    You want an incremental json parser, e.g. see this answer: (link) Another possibility is this: (link) Commented Jan 21, 2015 at 19:24
  • since BZ2File has a read method that returns an arbitrary number of bytes, I would probably consider trying to read the json as a stream, with something like pypi.python.org/pypi/ijson Commented Jan 21, 2015 at 19:24

3 Answers 3

2

If someone is looking for a way to parse wikidata json dump compressed in bz2 here is a snippet of code:

import bz2
import json

f = bz2.BZ2File("latest-all.json.bz2", "r")
next(f)  # skip the first line
for line in f:
    print(json.loads(line[:-2]))
Sign up to request clarification or add additional context in comments.

2 Comments

And the magic -2 stands for white endline character \n\r?
The -2 stands for comma , and the end of the line. The fatest performance wise, would be to parse with a json streaming parser.
1

In the absence of any other suggestions or existing code I would recommend opening a stream and manually parsing the braces and brackets ({ and [ respectively) until you have a complete object { ... } and run deserialization on that. This will allow you to chunk the JSON while leveraging existing JSON libraries.

This is not a solution I would typically recommend but it's the quickest and most reliable solution I can think of if existing libraries don't suit your needs.

2 Comments

It would be easier and safer to use an incremental JSON parser like pykler.github.io/yajl-py, rather than figuring it out yourself.
Ah, I was unaware of that library.
0

Although you have compressed the file itself, once you load it into python using Python's json package, you end up loading the entire thing into memory. Due to how Python works, if say the file is 100MB, you typically end up using a fair bit more. I recently observed that loading a 324MB JSON used up 1.5GB of memory.

Now, if the issue is storage, then compression is the way to go, however, if you're needing to run it into a program you'd probably want to think about how to read the JSON one object at a time as supposed to load the entire thing into memory.

What @amirouche has suggested should work if you're happy to do it "by hand" for go it. For something already available, https://pypi.org/project/json-lineage/ might be a possible solution. Disclaimer, I did write the code for this.

I'm sure there are other tools out there that do the same - read JSON one object at a time.

if you do end up using json-lineage, here is a small guide that could do the trick for you:

from json_lineage import load

jsonl_iter = load("path/to/file.json")


for obj in jsonl_iter:
    do_something(obj)

.

1 Comment

/help/promotion you need to very clearly and very explicitly state any affiliation you have with what you're promoting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.