Python: Parse large json file

Question

I have a large json file. Its log data and I have compressed it to bz2 format (myfile.json.bz2). Size of the bz2 file is 90MB. I searched to find a good solution or a blog post that explain parsing compressed bz2 json file efficiently but was not able to find any.

Since the file is large doing something like is impossible.

with open('data.json') as data_file:    
    data = json.load(data_file)

what is the best approach?

After some digging around I found there is a python package to read bz2

input_file = bz2.BZ2File(filename, 'r')

You want an incremental json parser, e.g. see this answer: (link) Another possibility is this: (link) — ErikR
– ErikR, Commented Jan 21, 2015 at 19:24
since BZ2File has a read method that returns an arbitrary number of bytes, I would probably consider trying to read the json as a stream, with something like pypi.python.org/pypi/ijson — njzk2
– njzk2, Commented Jan 21, 2015 at 19:24

amirouche · Accepted Answer · 2019-03-22 12:34:33Z

2

If someone is looking for a way to parse wikidata json dump compressed in bz2 here is a snippet of code:

import bz2
import json

f = bz2.BZ2File("latest-all.json.bz2", "r")
next(f)  # skip the first line
for line in f:
    print(json.loads(line[:-2]))

answered Mar 22, 2019 at 12:34

amirouche

7,9417 gold badges42 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dzieciou Over a year ago

And the magic -2 stands for white endline character \n\r?

amirouche Over a year ago

The -2 stands for comma , and the end of the line. The fatest performance wise, would be to parse with a json streaming parser.

Spencer Ruport · Accepted Answer · 2015-01-21 19:27:19Z

1

In the absence of any other suggestions or existing code I would recommend opening a stream and manually parsing the braces and brackets ({ and [ respectively) until you have a complete object { ... } and run deserialization on that. This will allow you to chunk the JSON while leveraging existing JSON libraries.

This is not a solution I would typically recommend but it's the quickest and most reliable solution I can think of if existing libraries don't suit your needs.

answered Jan 21, 2015 at 19:27

Spencer Ruport

35.2k12 gold badges89 silver badges150 bronze badges

2 Comments

Daniel Robinson Over a year ago

It would be easier and safer to use an incremental JSON parser like pykler.github.io/yajl-py, rather than figuring it out yourself.

Spencer Ruport Over a year ago

Ah, I was unaware of that library.

Salaah Amin · Accepted Answer · 2023-06-26 00:59:54Z

0

Although you have compressed the file itself, once you load it into python using Python's json package, you end up loading the entire thing into memory. Due to how Python works, if say the file is 100MB, you typically end up using a fair bit more. I recently observed that loading a 324MB JSON used up 1.5GB of memory.

Now, if the issue is storage, then compression is the way to go, however, if you're needing to run it into a program you'd probably want to think about how to read the JSON one object at a time as supposed to load the entire thing into memory.

What @amirouche has suggested should work if you're happy to do it "by hand" for go it. For something already available, https://pypi.org/project/json-lineage/ might be a possible solution. Disclaimer, I did write the code for this.

I'm sure there are other tools out there that do the same - read JSON one object at a time.

if you do end up using json-lineage, here is a small guide that could do the trick for you:

from json_lineage import load

jsonl_iter = load("path/to/file.json")


for obj in jsonl_iter:
    do_something(obj)

.

edited Jun 26, 2023 at 0:59

answered Jun 25, 2023 at 20:55

Salaah Amin

5104 silver badges16 bronze badges

1 Comment

starball Over a year ago

/help/promotion you need to very clearly and very explicitly state any affiliation you have with what you're promoting.

Collectives™ on Stack Overflow

Python: Parse large json file

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related