16

I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.

Input dataframe:

+-------+-------+-------+-------+-------+
| name  |mark1  |mark2  |mark3  | Grade |
+-------+-------+-------+-------+-------+
| Jim   | 20    | 30    | 40    |  "C"  |
+-------+-------+-------+-------+-------+
| Bill  | 30    | 35    | 45    |  "A"  |
+-------+-------+-------+-------+-------+
| Kim   | 25    | 36    | 42    |  "B"  |
+-------+-------+-------+-------+-------+

Output dataframe should be

+-------+-----------------+
| name  |marks            |
+-------+-----------------+
| Jim   | [20,30,40,"C"]  |
+-------+-----------------+
| Bill  | [30,35,45,"A"]  |
+-------+-----------------+
| Kim   | [25,36,42,"B"]  |
+-------+-----------------+

4 Answers 4

26

Columns can be merged with sparks array function:

import pyspark.sql.functions as f

columns = [f.col("mark1"), ...] 

output = input.withColumn("marks", f.array(columns)).select("name", "marks")

You might need to change the type of the entries in order for the merge to be successful

Sign up to request clarification or add additional context in comments.

Comments

3

look at this doc : https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["mark1", "mark2", "mark3"],
    outputCol="marks")

output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)

2 Comments

I have string columns as well which I need to merge. With String columns it is giving following error with message StringType is not supported: File "tester.py",line 34, in <module> output = assembler.transform(mydata_df) File"/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/ml/base.py",line 105,in transform return self._transform(dataset) . . spark/2.1.0/libexec/python/pyspark/sql/utils.py", line 79, in deco raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: u'Data type StringType is not supported.'
vector assembler is just merging numerical types, not strings. in this case, there is a string type.
1

You can do it in a select like following:

from pyspark.sql.functions import *    
df.select( 'name' ,
        concat( 
            col("mark1"), lit(","), 
            col("mark2"), lit(","), 
            col("mark3"), lit(","),
            col("Grade")
        ).alias('marks')  
    )

If [ ] necessary, it can be added lit function.

from pyspark.sql.functions import *    
df.select( 'name' ,
        concat(lit("["), 
            col("mark1"), lit(","), 
            col("mark2"), lit(","), 
            col("mark3"), lit(","),
            col("Grade"), lit("]")
        ).alias('marks')  
    )

Comments

0

If this is still relevant, you can use StringIndexer to encode your string values to float substitutes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.