2

I have an array type in my dataset need to convert into string type. I have tried in a conventional way. I feel we can do it in better way. Can you please guide me. Input Dataset1

    +---------------------------+-----------+-------------------------------------------------------------------------------------------------+
    ManufacturerSource         |upcSource  |productDescriptionSource                                                                          |                                                                                                                                                                            |
    +---------------------------+-----------+-------------------------------------------------------------------------------------------------+
    |3M                         |51115665883|[c, gdg, whl, t27, 5, x, 1, 4, x, 7, 8, grindig, flap, wheels, 36, grit, 12, 250, rpm]           |                                                                                                                                                                            |
    |3M                         |51115665937|[c, gdg, whl, t27, q, c, 6, x, 1, 4, x, 5, 8, 11, grinding, flap, wheels, 36, grit, 10, 200, rpm]|                                                                                                                                                                             |
    |3M                         |0          |[3mite, rb, cloth, 3, x, 2, wd]                                                                  |                                                                                                                                                                             |
    |3M                         |0          |[trizact, disc, cloth, 237aaa16x5, hole]                                                         |                                                                                                                                                                             |
    -------------------------------------------------------------------------------------------------------------------------------------------

Expected Output DataSet

     +---------------------------+-----------+---------------------------------------------------------------------------------------------------|
     |ManufacturerSource         |upcSource  |productDescriptionSource                                                                           |                                                                                                                                                                           |
     +---------------------------+-----------+---------------------------------------------------------------------------------------------------|
     |3M                         |51115665883|c gdg whl t27 5 x 1 4 x 7 8 grinding flap wheels 36 grit 12 250 rpm               |                |                                                                                                                                                         |
     |3M                         |51115665937|c gdg whl t27 q c 6 x 1 4 x 5 8 11 grinding flap wheels 36 grit 10 200 rpm                         |                                                                                                                                                                        |
     |3M                         |0          |3mite  rb  cloth  3  x  2  wd                                                                      |                                                                                                                                                                          |
     |3M                         |0          |trizact  disc  cloth  237aaa16x5  hole                                                             |                                                                                                                                                                          |
     +-------------------------------------------------------------------------------------------------------------------------------------------|

conventional Approach 1

        Dataset<Row> afterstopwordsRemoved = 
         stopwordsRemoved.select("productDescriptionSource");
          stopwordsRemoved.show();

        List<Row> individaulRows= afterstopwordsRemoved.collectAsList();

        System.out.println("After flatmap\n");
        List<String> temp;
        for(Row individaulRow:individaulRows){
         temp=individaulRow.getList(0);
        System.out.println(String.join(" ",temp));
        }

Approach2 (Not yielding result)

Exception : Failed to execute user defined function($anonfun$27: (array) => string)

       UDF1 untoken = new UDF1<String,String[]>() {
        public String call(String[] token) throws Exception {
            //return types.replaceAll("[^a-zA-Z0-9\\s+]", "");
             return Arrays.toString(token); 
        }

        @Override
        public String[] call(String t1) throws Exception {
            // TODO Auto-generated method stub
            return null;
        }
    };

    sqlContext.udf().register("unTokenize", untoken, DataTypes.StringType);

    source.createOrReplaceTempView("DataSetOfTokenize");
    Dataset<Row> newDF = sqlContext.sql("select *,unTokenize(productDescriptionSource)FROM DataSetOfTokenize");
    newDF.show(4000,false);

1 Answer 1

3

I'd use concat_ws:

sqlContext.sql("select *, concat_ws(' ', productDescriptionSource) FROM DataSetOfTokenize");

or:

import static org.apache.spark.sql.functions.*;

df.withColumn("foo" ,concat_ws(" ", col("productDescriptionSource")));
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.