We get Transactional Data with a transaction id of length < 255 chars and is always UNIQUE.
But the limitation here is we aren’t allowed to store the transaction id in our database.
Hence to uniquely identify a transaction we thought of using a hash fn to generate a hash using transaction id as the input.
So that, we do not save duplicate transactions as it would corrupt the metadata that we want to calculate. Ex: Averages, Standard Deviations, etc.
For a large amounts of transaction data coming into the system, which is the hash fn you would recommend that has lower collision probabilty and is fast enough ?
By Fast enough i mean, generate a hash in < 100 ns.
The Provider of this transactions data hasn’t faced the same problem cause they are the generators and we are the first consumers.
I also looked up a few answers on StackOverFlow which suggested that SHA-512 is a bit faster than SHA-256 on 64 bit systems.
Also, is there a better approach for solving this ?
Do not (ab)use a hash function for this. Hashing is good for protecting passwords, or for hash maps, where you have a secondary criteria to validate, that an object is in fact the object you expect. But using a hash as a presumably unique key (when it is not guaranteed to be) is inherently risky.
As corrently mentioned, the likelyhood for SHA256-collisions is infinitesimal, so you could do that, at a low risk.
Because a hash function does not (and can never) guarantee absolute non-collision, there might be alternatives.
Question: Can you store the transaction timestamp? – if so, you could couple the timestamp with a numeric postfix, to achieve an internal id (different from the transaction id you originally have). This is way superior to a hash, in terms of uniqueness. It comes with the benefit of being exceedingly fast. But you’d have to store that postfix with the original object to be able to reproduce this internal id.
Essentially what you need is a function to determine a (definetly) unique key from a transaction object, based on attributes that you are allowed to store. Assuming you are allowed to store the timestamp and a numeric postfix, the following example is a possible solution.
In case you get timestamp-collisions, you increase the numeric postfix, so you’d get:
2020-02-18-14-26-15-420-0 (postfic here is -0) 2020-02-18-14-26-15-420-1 (postfic here is -1) 2020-02-18-14-26-15-420-2 (postfic here is -2) 2020-02-18-14-26-15-423-0 (postfic here is -2)
Here some transactions have arrived at the same time. However they are still uniquely identifiable.