I am using the Google Admin Report API via the Python SDK in Databricks (Spark + Python 3.5).
It returns data in the following format (Databricks pyspark code):
dbutils.fs.put("/tmp/test.json", '''{
"userEmail": "[email protected]",
"parameters": [
{
"intValue": "0",
"name": "classroom:num_courses_created"
},
{
"boolValue": true,
"name": "accounts:is_disabled"
},
{
"name": "classroom:role",
"stringValue": "student"
}
]
}''', True)
There are 188 parameters and for each param it could be an int, bool, date or string. Depending on the field type the Api returns the value in the appropriate value (e.g. intValue for an int field and boolValue for a boolean).
I am writing out this JSON untouched into my datalake and processing it later by loading it into a spark dataframe:
testJsonData = sqlContext.read.json("/tmp/test.json", multiLine=True)
This results in a dataframe with this schema:
- userEmail:string
- parameters:array
- element:struct
- boolValue:boolean
- intValue:string
- name:string
- stringValue:string
- element:struct
If I display the dataframe it shows as
{"boolValue":null,"intValue":"0","name":"classroom:num_courses_created","stringValue":null}
{"boolValue":true,"intValue":null,"name":"accounts:is_disabled","stringValue":null}
{"boolValue":null,"intValue":null,"name":"classroom:role","stringValue":"student"}
As you can see, it has inferred nulls for the typeValues that do not exist.
The end state that I want is columns in a dataframe like:
and the pivoted columns would be typed correctly (e.g classroom:num_courses_created would be of type int - see yellow columns above)
Here is what I have tried so far:
from pyspark.sql.functions import explode
tempDf = testJsonData.select("userEmail", explode("parameters").alias("parameters_exploded"))
explodedColsDf = tempDf.select("userEmail", "parameters_exploded.*")
This results in a dataframe with this schema:
- userEmail:string
- boolValue:boolean
- intValue:string
- name:string
- stringValue:string
I then pivot the rows into columns based on the Name field (which is ""classroom:num_courses_created", "classroom:role" etc (there are 188 name/value parameter pairs):
#turn intValue into an Int column
explodedColsDf = explodedColsDf.withColumn("intValue", explodedColsDf.intValue.cast(IntegerType()))
pivotedDf = explodedColsDf.groupBy("userEmail").pivot("name").sum("intValue")
Which results in this dataframe:
- userEmail:string
- accounts:is_disabled:long
- classroom:num_courses_created:long
- classroom:role:long
which is not correct as the types for the columns are wrong.
What I need to do is somehow look at all the typeValues for a parameter column (there is no way of knowing the type from the name or inferring it - other than in the original Json where it returns just the typeValue that is relevant) and whichever one is not null is the type of that column. Each param only appears once so the string, bool, int and date values just need to be outputed for the email key, not aggregated.
This is beyond my current knowledge however I was thinking a simpler solution might be to go back all the way to the beginning and pivot the columns before I write out the Json so it would be in the format I want when I load it back into Spark, however I was reluctant to transform the raw data at all. I also would prefer not to handcode the schema for 188 fields as I want to dynamically pick which fields I want so it needs to be able to handle that.


