1

I am working with a very big dataset and I got a problem that could not find any answer for that. I am trying to parse the data from JSON and here is what I did for a piece of data from the whole dataset and works:

import json

s = set()

with open("data.raw", "r") as f:

    for line in f:
        d = json.loads(line)

The confusing part is that when I apply this code on my main data (the size is about 200G) it shows the following error (without going out of memory):

    d = json.loads(line)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

The type(f) = TextIOWrapper if it helps... but this data type is also for the small dataset...

Here are few lines of my data to see the format:

{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}

It is Json because I'm already parsing first 2000 lines and it works perfectly. But when I try to use the same procedure for the big file it shows the error from very first lines of the data.

6
  • what changes should be done upon that json data? Commented Jul 3, 2017 at 20:30
  • Is data.raw a json file or a file with a json object on each line? If the former, use json.load Commented Jul 3, 2017 at 20:36
  • Your file is not valid JSON. It seems to contain valid JSON text on each line, though. My advice, fix whatever is generating this "JSON" (it is not a JSON actually). Other than that, I suppose you could go line-by-line and accumulate the deserialized objects into a list or something. Commented Jul 3, 2017 at 20:39
  • .raw from matlab ? Commented Jul 3, 2017 at 20:55
  • Can you do more data.raw | head to see the format of your file ? Commented Jul 3, 2017 at 21:04

3 Answers 3

3

Here's some simple code to see what data isn't valid JSON and where it is:

for i, line in enumerate(f):
    try:
        d = json.loads(line)
    except json.decoder.JSONDecodeError:
        print('Error on line', i + 1, ':\n', repr(line))
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you @alex. I used this code and the result is wierd! According to the result I have error for every even line! But I used the first 2000 lines of my big file and it doesn't show any error... That is so confusing...
@Mina can you show us one of the error messages? In particular I want to see a line that failed.
you can't believe it but that was the point: The main big file I have contained extra enters between the lines and that was the reason for the error message! By the way, your suggestion was very helpful for me to find the source of error. Thank you.
3

A good solution to read a big json dataset, it is to use a generator like yield in python, because 200G it is too big for your RAM if your json parser stored whole file in memory, step by step the memory is saved with an iterator.

You can use iterative JSON parser with Pythonic interface http://pypi.python.org/pypi/ijson/.

But here your file has .raw extension, it is not a json file.

To read that do:

import numpy as np

content = np.fromfile("data.raw", dtype=np.int16, sep="")

But this solution can crash for big file.

If fact a .raw seems to a .csv file, then you can create your reader like that:

import csv

def read_big_file(filename):
    with open(filename, "rb") as csvfile:
         reader = csv.reader(csvfile)
         for row in reader:
             yield row

Or like taht for a text file:

def read_big_file(filename):
    with open(filename, "r") as _file:
         for line in _file:
             yield line

Use rb only if your file is binary.

Execute:

for line in read_big_file(filename):
    <treatment>
    <free memory after a size of chunk>

I can precise my answer if you give the first lines of your file.

1 Comment

The solution should include more details about ijson usage.
2

A sample json data is below. It contains records of two people. But it could as well be a million. The code below is one solution were it reads the file line by line and retrieves data from one person at a time and returns it as a json object.

Data:

[
  {
    "Name" : "Joy",
    "Address" : "123 Main St",
    "Schools" : [
      "University of Chicago",
      "Purdue University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Guitar",
        "Level" : "Expert"
      },
      {
        "percussion" : "Drum",
        "Level" : "Professional"
      }
    ],
    "Status" : "Student",
    "id" : 111,
    "AltID" : "J111"
  },
  {
    "Name" : "Mary",
    "Address" : "452 Jubal St",
    "Schools" : [
      "University of Pensylvania",
      "Washington University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Violin",
        "Level" : "Expert"
      },
      {
        "percussion" : "Piano",
        "Level" : "Professional"
      }
    ],
    "Status" : "Employed",
    "id" : 112,
    "AltID" : "M112"
  }
  }
]

Code: import json

curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
    #Reading file line by line
    line = fp.readline()
    lnum = 0
    while line:
        for a in line:
            if a == '{':
                curly_idx.append(lnum)
                first_curly_found = True
            elif a == '}':
                curly_idx.pop()

        # when the right curly for every left curly is found,
        # it would mean that one complete data element was read
        if len(curly_idx) == 0 and first_curly_found:
            jstr = f'{jstr}{line}'
            jstr = jstr.rstrip()
            jstr = jstr.rstrip(',')
            jstr[:-1]
            print("------------")
            if len(jstr) > 10:
                print("making json")
                j = json.loads(jstr)
            print(jstr)
            jstr = ""
            line = fp.readline()
            lnum += 1
            continue

        if first_curly_found:
            jstr = f'{jstr}{line}'

        line = fp.readline()
        lnum += 1
        if lnum > 100:
            break

1 Comment

I find this answer very useful. I modified the above code to run in Linux

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.