Home » Java » Spark Java Unique Id for rows on different partitions for parallel spark jobs-Exceptionshub

Spark Java Unique Id for rows on different partitions for parallel spark jobs-Exceptionshub

Posted by: admin February 25, 2020 Leave a comment

Questions:

I need to generate a unique Identifier for each row per partition per job per spark submit.
After going through the answers to similar questions here on SO, I could not find any solution to fit my constraints. Posting below the constraints and the approaches tried:

Generated ID Constraints:

  1. Generated ID should be numeric
  2. Generated ID should be upto 10 digits
  3. Generated ID should be unique for a job across partitions for each row
  4. If we run multiple spark jobs in parallel, the generated IDs should be unique across these spark jobs as well.

Probable Solutions:

  1. monotonically_increasing_id() – unique across partitions for a job but not sequential

  2. window – will move all data in one partition and then give rownumber to each row – OOM error possible

  3. zipwithindex- dataset-rdd-dataset conversion is required which may impact performance

  4. String hash manipulation – integer overflow highly probable will end up with negative numeric keys
  5. Slicing pre-generated keys per table for each executor for current job

Now for solutions 1-4 I have mentioned the associated issues with them.
For solution 5, I am not sure if it is the best available option with Spark.

Any Suggestions are welcome.

How to&Answers: