How to parse a BIG JSON file in python

Question

I am working with a very big dataset and I got a problem that could not find any answer for that. I am trying to parse the data from JSON and here is what I did for a piece of data from the whole dataset and works:

import json

s = set()

with open("data.raw", "r") as f:

    for line in f:
        d = json.loads(line)

The confusing part is that when I apply this code on my main data (the size is about 200G) it shows the following error (without going out of memory):

    d = json.loads(line)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

The type(f) = TextIOWrapper if it helps... but this data type is also for the small dataset...

Here are few lines of my data to see the format:

{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}

It is Json because I'm already parsing first 2000 lines and it works perfectly. But when I try to use the same procedure for the big file it shows the error from very first lines of the data.

Is data.raw a json file or a file with a json object on each line? If the former, use json.load — wbadart
– wbadart, Commented Jul 3, 2017 at 20:36
Your file is not valid JSON. It seems to contain valid JSON text on each line, though. My advice, fix whatever is generating this "JSON" (it is not a JSON actually). Other than that, I suppose you could go line-by-line and accumulate the deserialized objects into a list or something. — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Jul 3, 2017 at 20:39
Can you do more data.raw | head to see the format of your file ? — glegoux
– glegoux, Commented Jul 3, 2017 at 21:04

Alex Hall · Accepted Answer · 2017-07-03 20:44:12Z

3

Here's some simple code to see what data isn't valid JSON and where it is:

for i, line in enumerate(f):
    try:
        d = json.loads(line)
    except json.decoder.JSONDecodeError:
        print('Error on line', i + 1, ':\n', repr(line))

answered Jul 3, 2017 at 20:44

Alex Hall

36.2k5 gold badges63 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mina Over a year ago

Thank you @alex. I used this code and the result is wierd! According to the result I have error for every even line! But I used the first 2000 lines of my big file and it doesn't show any error... That is so confusing...

Alex Hall Over a year ago

@Mina can you show us one of the error messages? In particular I want to see a line that failed.

Mina Over a year ago

you can't believe it but that was the point: The main big file I have contained extra enters between the lines and that was the reason for the error message! By the way, your suggestion was very helpful for me to find the source of error. Thank you.

glegoux · Accepted Answer · 2017-07-03 21:14:29Z

3

A good solution to read a big json dataset, it is to use a generator like yield in python, because 200G it is too big for your RAM if your json parser stored whole file in memory, step by step the memory is saved with an iterator.

You can use iterative JSON parser with Pythonic interface http://pypi.python.org/pypi/ijson/.

But here your file has .raw extension, it is not a json file.

To read that do:

import numpy as np

content = np.fromfile("data.raw", dtype=np.int16, sep="")

But this solution can crash for big file.

If fact a .raw seems to a .csv file, then you can create your reader like that:

import csv

def read_big_file(filename):
    with open(filename, "rb") as csvfile:
         reader = csv.reader(csvfile)
         for row in reader:
             yield row

Or like taht for a text file:

def read_big_file(filename):
    with open(filename, "r") as _file:
         for line in _file:
             yield line

Use rb only if your file is binary.

Execute:

for line in read_big_file(filename):
    <treatment>
    <free memory after a size of chunk>

I can precise my answer if you give the first lines of your file.

edited Jul 3, 2017 at 21:14

answered Jul 3, 2017 at 20:50

glegoux

3,62317 silver badges32 bronze badges

1 Comment

malat Over a year ago

The solution should include more details about ijson usage.

Phantom · Accepted Answer · 2019-09-26 02:03:32Z

A sample json data is below. It contains records of two people. But it could as well be a million. The code below is one solution were it reads the file line by line and retrieves data from one person at a time and returns it as a json object.

Data:

[
  {
    "Name" : "Joy",
    "Address" : "123 Main St",
    "Schools" : [
      "University of Chicago",
      "Purdue University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Guitar",
        "Level" : "Expert"
      },
      {
        "percussion" : "Drum",
        "Level" : "Professional"
      }
    ],
    "Status" : "Student",
    "id" : 111,
    "AltID" : "J111"
  },
  {
    "Name" : "Mary",
    "Address" : "452 Jubal St",
    "Schools" : [
      "University of Pensylvania",
      "Washington University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Violin",
        "Level" : "Expert"
      },
      {
        "percussion" : "Piano",
        "Level" : "Professional"
      }
    ],
    "Status" : "Employed",
    "id" : 112,
    "AltID" : "M112"
  }
  }
]

Code: import json

curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
    #Reading file line by line
    line = fp.readline()
    lnum = 0
    while line:
        for a in line:
            if a == '{':
                curly_idx.append(lnum)
                first_curly_found = True
            elif a == '}':
                curly_idx.pop()

        # when the right curly for every left curly is found,
        # it would mean that one complete data element was read
        if len(curly_idx) == 0 and first_curly_found:
            jstr = f'{jstr}{line}'
            jstr = jstr.rstrip()
            jstr = jstr.rstrip(',')
            jstr[:-1]
            print("------------")
            if len(jstr) > 10:
                print("making json")
                j = json.loads(jstr)
            print(jstr)
            jstr = ""
            line = fp.readline()
            lnum += 1
            continue

        if first_curly_found:
            jstr = f'{jstr}{line}'

        line = fp.readline()
        lnum += 1
        if lnum > 100:
            break

I find this answer very useful. I modified the above code to run in Linux

Collectives™ on Stack Overflow

How to parse a BIG JSON file in python

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related