How to convert Json array list with multiple possible values into columns in a dataframe using pyspark

Question

I am using the Google Admin Report API via the Python SDK in Databricks (Spark + Python 3.5).

It returns data in the following format (Databricks pyspark code):

dbutils.fs.put("/tmp/test.json", '''{
    "userEmail": "[email protected]", 
    "parameters": [
        {
            "intValue": "0",
            "name": "classroom:num_courses_created"
        },
        {
            "boolValue": true,
            "name": "accounts:is_disabled"
        },
        {
            "name": "classroom:role",
            "stringValue": "student"
        }
    ]
}''', True)

There are 188 parameters and for each param it could be an int, bool, date or string. Depending on the field type the Api returns the value in the appropriate value (e.g. intValue for an int field and boolValue for a boolean).

I am writing out this JSON untouched into my datalake and processing it later by loading it into a spark dataframe:

testJsonData = sqlContext.read.json("/tmp/test.json", multiLine=True)

This results in a dataframe with this schema:

userEmail:string
parameters:array
- element:struct
  - boolValue:boolean
  - intValue:string
  - name:string
  - stringValue:string

If I display the dataframe it shows as

{"boolValue":null,"intValue":"0","name":"classroom:num_courses_created","stringValue":null} {"boolValue":true,"intValue":null,"name":"accounts:is_disabled","stringValue":null}
{"boolValue":null,"intValue":null,"name":"classroom:role","stringValue":"student"}

As you can see, it has inferred nulls for the typeValues that do not exist.

The end state that I want is columns in a dataframe like:

and the pivoted columns would be typed correctly (e.g classroom:num_courses_created would be of type int - see yellow columns above)

Here is what I have tried so far:

from pyspark.sql.functions import explode
tempDf = testJsonData.select("userEmail", explode("parameters").alias("parameters_exploded"))
explodedColsDf = tempDf.select("userEmail", "parameters_exploded.*")

This results in a dataframe with this schema:

userEmail:string
boolValue:boolean
intValue:string
name:string
stringValue:string

I then pivot the rows into columns based on the Name field (which is ""classroom:num_courses_created", "classroom:role" etc (there are 188 name/value parameter pairs):

#turn intValue into an Int column
explodedColsDf = explodedColsDf.withColumn("intValue", explodedColsDf.intValue.cast(IntegerType()))
pivotedDf = explodedColsDf.groupBy("userEmail").pivot("name").sum("intValue")

Which results in this dataframe:

userEmail:string
accounts:is_disabled:long
classroom:num_courses_created:long
classroom:role:long

which is not correct as the types for the columns are wrong.

What I need to do is somehow look at all the typeValues for a parameter column (there is no way of knowing the type from the name or inferring it - other than in the original Json where it returns just the typeValue that is relevant) and whichever one is not null is the type of that column. Each param only appears once so the string, bool, int and date values just need to be outputed for the email key, not aggregated.

This is beyond my current knowledge however I was thinking a simpler solution might be to go back all the way to the beginning and pivot the columns before I write out the Json so it would be in the format I want when I load it back into Spark, however I was reluctant to transform the raw data at all. I also would prefer not to handcode the schema for 188 fields as I want to dynamically pick which fields I want so it needs to be able to handle that.

Nilesh Ingle · Accepted Answer · 2019-04-14 02:53:00Z

The code below converts the example JSON provided to a dataframe(without using PySpark).

Import Libraries

import numpy as np
import pandas as pd

Assign variables

true = True
false = False

Assign JSON to a variable

data = [{
"userEmail": "[email protected]", 
"parameters": [
    {
        "intValue": "0",
        "name": "classroom:num_courses_created"
    },
    {
        "boolValue": true,
        "name": "accounts:is_disabled"
    },
    {
        "name": "classroom:role",
        "stringValue": "student"
    }
    ]
},
{
"userEmail": "[email protected]", 
"parameters": [
    {
        "intValue": "1",
        "name": "classroom:num_courses_created"
    },
    {
        "boolValue": false,
        "name": "accounts:is_disabled"
    },
    {
        "name": "classroom:role",
        "stringValue": "student2"
    }
    ]
}

]

Function to convert dictionary to columns

def get_col(x):
    y = pd.DataFrame(x, index=[0])
    col_name = y.iloc[0]['name']
    y = y.drop(columns=['name'])
    y.columns = [col_name]
    return y

Iterate through the JSON list

df = pd.DataFrame()

for item in range(len(data)):

    # Initialize empty dataframe
    trow = pd.DataFrame()
    temp = pd.DataFrame(data[item])

    for i in range(temp.shape[0]):

        # Read each row
        x = temp.iloc[i]['parameters']
        trow = pd.concat([trow,get_col(x)], axis=1)
        trow['userEmail'] = temp.iloc[i]['userEmail']


    df = df.append(trow) 

# Rearrange columns, drop those that are not needed
df = df[['userEmail', 'classroom:num_courses_created', 'accounts:is_disabled', 'classroom:role']]

Output:

......................... Previous edit .....................

Convert JSON/nested dictionaries to a dataframe

temp = pd.DataFrame(data)

# Initialize empty dataframe
df = pd.DataFrame()
for i in range(temp.shape[0]):
    # Read each row
    x = temp.iloc[i]['parameters']
    temp1 = pd.DataFrame([x], columns=x.keys())
    temp1['userEmail'] = temp.iloc[i]['userEmail']

    # Convert nested key:value pairs
    y = x['name'].split(sep=':')
    temp1['name_' + y[0]] = y[1]

    # Combine to dataframe
    df = df.append(temp1, sort=False)

# Rearrange columns, drop those that are not needed
df = df[['userEmail', 'intValue', 'stringValue', 'boolValue', 'name_classroom', 'name_accounts']]

Output

Edit-1 Based on the screenshot in updated question, code below should work.

Assign variables

Hi Nilesh, thanks for the code - however the output is not quite right - I have updated my question with a screenshot of the required output so it is clearer.
Hi Rodney, Thank you for the screenshot of the final desired output table. I have updated the code accordingly. Hope it helps.

Collectives™ on Stack Overflow

How to convert Json array list with multiple possible values into columns in a dataframe using pyspark

1 Answer 1

Assign variables

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Assign variables

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related