Home » Mysql » Efficient method to find collision free random numbers

Efficient method to find collision free random numbers

Posted by: admin November 29, 2017 Leave a comment

Questions:

I have a users table, the user ID is public. But I want to obfuscate the number of
registered user and trends of the project, so I don’t want to have public incrementing IDs.

When a new user is created I want to find a random integer number that is greater than a certain number and that is not yet in the database.

Naive code:

<?php
    $found = false;
    while(!$found) {
      $uid = rand(1000000000,4294967295) // find random number betwen minimum and maximum
      $dbh->beginTransaction();
      // check if user id is in use, and if not insert it
      if($dbh->query("SELECT * FROM users WHERE uid = $uid")) {
        $dbh->exec("INSERT INTO users (uid) VALUES ($uid)");
        $found = true;
      }
      $dbh->commit();
    }
    // we just got our new uid ...
?>

This will work it however may become inefficient. True that there is a big range and the probability of hitting an unused uid is high. But what if I want to use a smaller range, because I don’t want to have so long userids?

Example of my concerns:

  • 60% of all user ids are in use
  • the chance of hitting an unused uid are 0.4
  • the first attempt has 0.4% success rate
  • if 1st not successful the second attempt has 0.6*0.4 probability
  • so with a maximum of two tries i have 0.4 + 0.6*0.4 proability (is that right??)

So one method to optimize is that came to my mind is the following:

  • find a random number, check if its free, if not, increment it by 1 and try again and so on
  • if the maximum number is hit, continue with the minimum number

That should give me a number with a maximum runtime of O(range)

That sounds pretty bad but I think it is not, because I submit random numbers to the database and that they are all at the beginnig is very unlikely. So how good/bad is it really?

I think this would work just fine but I want it BETTER

So what about this?

  • find a random number
  • query the database for how many numbers are occupied in the range whole range, starting from that number (this first step is trivial…)
  • if there are numbers occupied in that range, divide the range by half and try again. starting with the initial number
  • if there are numbers occupied divide the range by half and try again. starting with the initial number

If I am thinking correctly this will give ma a number with a maximum of O(log(range)) time.

That is pretty satisfying because log() is pretty good. However I think this method will often be as bad as possible. Because with our random numbers we will probably always hit numbers in the large intervals.

So at the beginning our pure random method is probably better.

So what about having a limit like this

  • select current number of used numbers
  • is it greater than X, logarithmic range approach
  • if it is not, use pure random method

What would X be and why?

So final question:

This is pretty easy and pretty complicated at the same time.

I think this is a standard problem because lots and lots of system use random ids (support tickets etc), so I cannot imagine I am the first one to stumble across this.

How would you solve this? Any input is appriciated!

Is there maby an existing class / procedure for this I can use?

Or maby some database functions that I can use?

I would like to do it in PHP/Mysql

IMPORTANT EDIT:

I just thought about the range/logarithmic solution. It seems to be complete bullshit sorry for my wording because:

  • what if i hit an occupied number at start?

Then I am dividing my range so long if it is only 1. And even then the number is occoupied.

So its completely the same as the pure random method from start, only worse….

I am a bit embarassed I made this up but I will leave it in because I think its a good example of overcomplicated thinknig!

Answers:

If p is the proportion of ids in use, your “naive” solution will, on average, require 1/(1-p) attempts to find an unused id. (See Exponential distribution). In the case of 60% occupancy, that is a mere 1/0.4 = 2.5 queries …

Your “improved” solution requires about log(n) database calls, where n is the number of ids in use. That is quite a bit more than the “naive” solution. Also, your improved solution is incomplete (for instance, it does not handle the case where all number in a subrange are taken, and does not elaborate with subrange you recurse into) and is more complex to implement to boot.

Finally, note that your implementation will only be thread safe if the database provides very strict transaction isolation, which scales poorly, and might not be the default behaviour of your database system. If that turns out to be a problem, you could speculatively insert with a random id, and retry in the event of a constraint violation.

Questions:
Answers:

If you don’t want to test for used numbers, you can create a function that calculates a random looking id $id_k based on the automatically incrementing id from the database $id:

$id_k = transpose($id);

That function has either a reverse cousin or is able to transpose transparently back (ideally):

$id = transpose($id_k);

You can then used the transposed IDs on your site.

Another idea that comes into my mind is that you precalculate the random id’s per incrementing ids so you can control the use of the database better.

Questions:
Answers:

How about you compose the number from something that won’t clash and a random number in some small range.

 ddddd-(rrr+n)

The ddddd being for example the number of days your system has been live, the rrr is a random number chosen each day, the n is the increment within day.

Given any one number a person not knowing rrr for the day cannot deduce how many users have been created on a given day.

Questions:
Answers:

Joe, just implement your algorithm as above with no worry. Just look: if the probability of hitting used id is p = 0.6, then the probability that you hit used id N consecutive times is p^N. This goes down exponentially! I’d recommend to set the ID density lower, e.g. to p = 0.1. Then the probability that you don’t suceed for 10 consecutive attempts is p^10 = 0.1^10 = 1e-10!!! Absolutely negligible.

Don’t worry of the collisions and go for your solution.

Questions:
Answers:

How about when you start the application, pick a random range of 100 numbers (e.g. 100 – 199, 1000 – 1099, 5400 – 5499), check the first one, if it’s not in the database we know (based on this algorithm) that all 100 are free. Store the start of this range in memory.

Then just allocate these until you run out (or your application recycles) and then pick another random range. So you only need to go to the database once every 100 users.

This is similar to the Nhibernate hi/lo approach (except for the random bit).

Obviously tweak to 100 depending on the rate at which you allocate ids compared to the typical life span of the application in memory.

Questions:
Answers:

You could simply use any hash shuffle algorithm to generate new Id value by the known number of users (holding of this value is a usual practice). This approach could be better than your current solution because the corresponding algorithm will likely generate less collisions. The key point is to choose algorithm with appropriate strength and distribution uniformity.

Questions:
Answers:

To expand on on meriton’s and Tomas Telensky’s answers, if you want to keep your user IDs short while ensuring that you don’t run out of them, you could select each ID randomly from, say, the range 1 to 10*n+1000, where n is the current number of users you have.

That way, your effective user ID space will never be more than 10% full, while the user IDs will (in the long run) be only about one digit longer than if you’d assigned them sequentially. The down side, of course, is that the IDs will no longer be completely uncorrelated with registration order: if someone has the ID 5851, you know they must be at least the 486th registered user and that they’re unlikely to be, say, the 50000th. (Of course, you introduce the same kind of correlations if you ever manually adjust the range to accommodate more users.)

Of course you can adjust the constants 10 and 1000 above: the bigger they are, the longer and more random your user IDs will be.

Questions:
Answers:

You could use symmetric-key encryption (e.g. AES) to encrypt a counter. If you use the entire output (128-bits for AES) then you’re guaranteed no collisions, and it’s a reversible mapping.

128 bits might be more than you want to deal with though—it’s a 32-digit hex number, 39-digit decimal number. You could use a 64-bit encryption algorithm such as DES, Blowfish or Misty (16-digit hex number, 20-digit decimal number).