1

Being new to Spark, I am working on something and facing difficulty. Any leads will help. I am trying to create a JSON from dataframe which I have but toJSON function is not helping me out. So my output data frame is something like below :-

+---------+------------------+-------------------------+
|booking_id|    status           |count(status)|
+---------+------------------+-------------------------+
|  132         |     rent count.       |                        6|
|  132         |     rent booked     |                      24|
|  132         |     rent delayed    |                        6|
|  134         |     rent booked     |                      34|
|  134         |     rent delayed.   |                       21|

The output I am looking for is a dataframe which will contain booking id and status and its count as Json

+---------+-------------------------------------------+
|booking_id|    status_json         
+---------+-------------------------------------------+
|  132         |   { "rent count": 6, "rent booked": 24, "rent delayed": 6}  
|  134        |     { "rent booked": 34, "rent delayed": 21}

Thanks in advance.

1
  • First create a map column with staus and countstatus columns. Then groupBy, agg(collect_list("yourmapcolumn")), finally call toJSON Commented Jun 15, 2020 at 3:44

2 Answers 2

3

For Spark2.4, use map_from_arrays.

from pyspark.sql import functions as F

df.groupBy("booking_id").agg(F.to_json(F.map_from_arrays(F.collect_list("status"),F.collect_list("count(status)")))\
                              .alias("status_json"))\
                              .show(truncate=False)


#+----------+--------------------------------------------------+
#|booking_id|status_json                                       |
#+----------+--------------------------------------------------+
#|132       |{"rent count":6,"rent booked":24,"rent delayed":6}|
#|134       |{"rent booked":34,"rent delayed":21}              |
#+----------+--------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

0
  val sourceDF = Seq(
    (132, "rent count", 6),
    (132, "rent booked", 24),
    (132, "rent delayed", 6),
    (134, "rent booked", 34),
    (134, "rent delayed", 21)
  ).toDF("booking_id", "status", "count(status)")

  val resDF = sourceDF
    .groupBy("booking_id")
    .agg(to_json(collect_list(map(col("status"), col("count(status)")))).alias("status_json"))

  //  +----------+--------------------------------------------------------+
  //  |booking_id|status_json                                             |
  //  +----------+--------------------------------------------------------+
  //  |132       |[{"rent count":6},{"rent booked":24},{"rent delayed":6}]|
  //  |134       |[{"rent booked":34},{"rent delayed":21}]                |
  //  +----------+--------------------------------------------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.