Home » Java » How can i transpose csv data using java spark-Exceptionshub

How can i transpose csv data using java spark-Exceptionshub

Posted by: admin February 25, 2020 Leave a comment

Questions:

I am using java spark and i want to know if there is anyway i can transform the sample data given below

Incremental Cost Number | Approver Names                          
---------------------------------------------------------------------------------
S703401                  |Ryan P Cassidy|Christopher J Mattingly|Frank E 
                         LaSota|Ryan P Cassidy|Anthony L Locricchio|Jason Monte                                                                    

into something like this.

Incremental Cost Number| Approver Names                          
-------------------------------------------
S703401                | Ryan P Cassidy
S703401                | Christopher J Mattingly
S703401                | Frank E LaSota
S703401                | Ryan P Cassidy
S703401                | Anthony L Locricchio
S703401                | Jason Monte 

Also the file i am importing is a comma separated csv file, just that a particular column contains multiple values are separated by pipeline symbol. And similarly if i have multiple values of Incremental Cost Number.

How to&Answers:

You can do Something like below, if you have multiple columns

  import org.apache.spark.sql.functions._

   val df = Seq(("S703401","Ryan P Cassidy|Christopher J Mattingly|Frank E 
   LaSota|Ryan P Cassidy|Anthony L Locricchio|Jason 
   Monte","xyz|mnp|abc")).toDF("Incremental Cost Number","Approver 
   Names","3rd Column")


  df.withColumn("Approver Names", explode(split(col("Approver Names"), "\|")))
  .withColumn("3rd Column", explode(split(col("3rd Column"), "\|")))
  .show()


   +-----------------------+--------------------+-----------+
   |Incremental Cost Number|      Approver Names| 3rd Column|
   +-----------------------+--------------------+-----------+
   |                S703401|Ryan P Cassidy|Ch...|xyz|mnp|abc|
   +-----------------------+--------------------+-----------+

   +-----------------------+--------------------+----------+
   |Incremental Cost Number|      Approver Names|3rd Column|
   +-----------------------+--------------------+----------+
   |                S703401|      Ryan P Cassidy|       xyz|
   |                S703401|      Ryan P Cassidy|       mnp|
   |                S703401|      Ryan P Cassidy|       abc|
   |                S703401|Christopher J Mat...|       xyz|
   |                S703401|Christopher J Mat...|       mnp|
   |                S703401|Christopher J Mat...|       abc|
   |                S703401|      Frank E LaSota|       xyz|
   |                S703401|      Frank E LaSota|       mnp|
   |                S703401|      Frank E LaSota|       abc|
   |                S703401|      Ryan P Cassidy|       xyz|
   |                S703401|      Ryan P Cassidy|       mnp|
   |                S703401|      Ryan P Cassidy|       abc|
   |                S703401|Anthony L Locricchio|       xyz|
   |                S703401|Anthony L Locricchio|       mnp|
   |                S703401|Anthony L Locricchio|       abc|
   |                S703401|         Jason Monte|       xyz|
   |                S703401|         Jason Monte|       mnp|
   |                S703401|         Jason Monte|       abc|
   +-----------------------+--------------------+----------+

Answer:

I think you need split the second column by “|” and then use explode() function

df.select(col("id"), explode(split(col("a"), "\|")).as("a")).show()

+-------+--------------------+
|     id|                   a| 
+-------+--------------------+
|S703401|      Ryan P Cassidy|
|S703401|Christopher J Mat...|
|S703401|             Frank E|

Answer:

Note: This is the RDD way of doing things. It might be easier in Scala and Dataframe.

  1. Use SparkContext to read the file
  2. More specifically you need to use textFile() API which will give you RDD.
  3. Once you have the RDD, you can tokenize each record based on the Comma ( this is done invoking map() API on RDD and passing the map function to it. In your case this function can be implemented as splitting comma delimited string into multiple tokens. You can use Tuple datastructure to emit the output.
  4. You can choose the Tuple1 to Tuple22 based on number of fields you have. Refer here.
  5. Step 3 should again give you back a RDD of Tuples. You run flatMap function on this RDD which will use the first field in the Tuple and concat it to the other required Tuple fields.
  6. Once done, you can again put everything together back by concatenating all tuple fields with comma delimiter. ( This will be another map function)
  7. In the end you call saveAsTextFile() to save the updated data.