I need to generate a unique Identifier for each row per partition per job per spark submit.
After going through the answers to similar questions here on SO, I could not find any solution to fit my constraints. Posting below the constraints and the approaches tried:
Generated ID Constraints:
- Generated ID should be numeric
- Generated ID should be upto 10 digits
- Generated ID should be unique for a job across partitions for each row
- If we run multiple spark jobs in parallel, the generated IDs should be unique across these spark jobs as well.
monotonically_increasing_id() – unique across partitions for a job but not sequential
window – will move all data in one partition and then give rownumber to each row – OOM error possible
zipwithindex- dataset-rdd-dataset conversion is required which may impact performance
- String hash manipulation – integer overflow highly probable will end up with negative numeric keys
- Slicing pre-generated keys per table for each executor for current job
Now for solutions 1-4 I have mentioned the associated issues with them.
For solution 5, I am not sure if it is the best available option with Spark.
Any Suggestions are welcome.