Home » Mysql » MySQL query to get “intersection” of numerous queries with limits

MySQL query to get “intersection” of numerous queries with limits

Posted by: admin November 1, 2017 Leave a comment

Questions:

Assume I have a single mySQL table (users) with the following fields:

userid  
gender  
region  
age  
ethnicity  
income

I want to be able to return the number of total records based on the number a user enters. Furthermore, they will also be providing additional criteria.

In the simplest example, they may ask for 1,000 records, where 600 records should have gender = ‘Male’ and 400 records where gender = ‘Female’. That’s simple enough to do.

Now, go one step further. Assume they now want to specify Region:

GENDER  
    Male:   600 records  
    Female: 400 records  

REGION  
    North:  100 records  
    South:  200 records  
    East:   300 records  
    West:   400 records

Again, only 1000 records should be returned, but in the end, there must be 600 males, 400 females, 100 Northerners, 200 Southerners, 300 Easterners and 400 Westerners.

I know this isn’t valid syntax, but using pseudo-mySQL code, it hopefully illustrates what I’m trying to do:

(SELECT * FROM users WHERE gender = 'Male' LIMIT 600  
UNION  
SELECT * FROM users WHERE gender = 'Female' LIMIT 400)

INTERSECT

(SELECT * FROM users WHERE region = 'North' LIMIT 100  
UNION  
SELECT * FROM users WHERE region = 'South' LIMIT 200  
UNION  
SELECT * FROM users WHERE region = 'East' LIMIT 300  
UNION  
SELECT * FROM users WHERE region = 'West' LIMIT 400)

Note that I’m not looking for a one-time query. The total number of records and the number of records within each criteria will constantly be changing based on input by the user. So, I’m trying to come up with a generic solution that can be re-used over and over, not a hard-coded solution.

To make things more complicated, now add more criteria. There could also be age, ethnicity and income each with their own set number of records for each group, additional code appended to above:

INTERSECT

(SELECT * FROM users WHERE age >= 18 and age <= 24 LIMIT 300  
UNION  
SELECT * FROM users WHERE age >= 25 and age <= 36 LIMIT 200  
UNION  
SELECT * FROM users WHERE age >= 37 and age <= 54 LIMIT 200  
UNION  
SELECT * FROM users WHERE age >= 55 LIMIT 300)  

INTERSECT

etc.

I’m not sure if this is possible to write in one query or if this requires multiple statements and iterations.

Answers:

Flatten Your Criteria


You can flatten your multi-dimensional criteria into a single level criteria

enter image description here

Now this criteria can be achieved in one query as follow

(SELECT * FROM users WHERE gender = 'Male' AND region = 'North' LIMIT 40) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'South' LIMIT 80) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'East' LIMIT 120) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'West' LIMIT 160) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'North' LIMIT 60) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'South' LIMIT 120) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'East' LIMIT 180) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'West' LIMIT 240)

Problem

  • It does not always return the correct result. For example, if there are less than 40 users whose are male and from north, then the query will return less than 1,000 records.

Adjust Your Criteria


Let say that there is less than 40 users whose are male and from north. Then, you need to adjust other criteria quantity to cover the missing quantity from “Male” and “North”. I believe it is not possible to do it with bare SQL. This is pseudo code that I have in mind. For sake of simplification, I think we will only query for Male, Female, North, and South

conditions.add({ gender: 'Male',   region: 'North', limit: 40  })
conditions.add({ gender: 'Male',   region: 'South', limit: 80  })
conditions.add({ gender: 'Female', region: 'North', limit: 60  })
conditions.add({ gender: 'Female', region: 'South', limit: 120  })

foreach(conditions as condition) {
    temp = getResultFromDatabaseByCondition(condition)
    conditions.remove(condition)

    // there is not enough result for this condition,
    // increase other condition quantity
    if (temp.length < condition.limit) {
        adjust(...);
    }
}

Let say that there are only 30 northener male. So we need to adjust +10 male, and +10 northener.

To Adjust
---------------------------------------------------
Male        +10
North       +10

Remain Conditions
----------------------------------------------------
{ gender: 'Male',   region: 'South', limit: 80 }
{ gender: 'Female', region: 'North', limit: 60  }
{ gender: 'Female', region: 'South', limit: 120  }

‘Male’ + ‘South’ is the first condition that match the ‘Male’ adjustment condition. Increase it by +10, and remove it from the “remain condition” list. Since, we increase the South, we need to decrease it back at other condition. So add “South” condition into “To Adjust” list

To Adjust
---------------------------------------------------
South       -10
North       +10

Remain Conditions
----------------------------------------------------
{ gender: 'Female', region: 'North', limit: 60  }
{ gender: 'Female', region: 'South', limit: 120  }

Final Conditions
----------------------------------------------------
{ gender: 'Male',   region: 'South', limit: 90 }

Find condition that match the ‘South’ and repeat the same process.

To Adjust
---------------------------------------------------
Female      +10
North       +10

Remain Conditions
----------------------------------------------------
{ gender: 'Female', region: 'North', limit: 60  }

Final Conditions
----------------------------------------------------
{ gender: 'Female', region: 'South', limit: 110  }
{ gender: 'Male',   region: 'South', limit: 90 }

And finally

{ gender: 'Female', region: 'North', limit: 70  }
{ gender: 'Female', region: 'South', limit: 110  }
{ gender: 'Male',   region: 'South', limit: 90 }

I haven’t come up with the exact implementation of adjustment yet. It is more difficult than I have expected. I will update once I can figure out how to implement it.

Questions:
Answers:

The problem that you describe is a multi-dimensional modeling problem. In particular, you are trying to get a stratified sample along multiple dimensions at the same time. The key to this is to go down to the smallest level of granularity and build up the sample from there.

I am further assuming that you want the sample to be representative at all levels. That is, you don’t want all the users from “North” to be female. Or all the “males” to be from “West”, even if that does meet the end criteria.

Start by thinking in terms of a total number of records, dimensions, and allocations along each dimension. For instance, for the first sample, think of it as:

  • 1000 records
  • 2 dimensions: gender, region
  • gender split: 60%, 40%
  • region split: 10%, 20%, 30%, 40%

Then, you want to allocate these numbers to each gender/region combination. The numbers are:

  • North, Male: 60
  • North, Female: 40
  • South, Male: 120
  • South, Female: 80
  • East, Male: 180
  • East, Female: 120
  • West, Male: 240
  • West, Female: 160

You’ll see that these add up along the dimensions.

The calculation of the numbers in each cell is pretty easy. It is the product of the percentages times the total. So, “East, Female” is 30%*40% * 1000 . . . Voila! The value is 120.

Here is the solution:

  1. Take the input along each dimension as percentages of the total. And be sure they add up to 100% along each dimension.
  2. Create a table of the expected percentages for each of the cells. This is the product of the percentages along each dimension.
  3. Multiple the expected percentages by the overall total.
  4. The final query is outlined below.

Assume that you have a table cells with the expected count and the original data (users).

select enumerated.*
from (select u.*,
             (@rn := if(@dims = concat_ws(':', dim1, dim2, dim3), @rn + 1,
                        if(@dims := concat_ws(':', dim1, dim2, dim3), 1, 1)
                       )
             ) as seqnum
      from users u cross join
           (select @dims = '', @rn := '') vars
      order by dim1, dim2, dim3, rand()
     ) enumerated join
     cells
     on enumerated.dims = cells.dims
where enuemrated.seqnum <= cells.expectedcount;

Note that this is a sketch of the solution. You have to fill in the details about the dimensions.

This will work as long as you have enough data for all the cells.

In practice, when doing this type of multi-dimensional stratified sampling, you do run the risk that cells will be empty or too small. When this happens, you can often fix this with an additional pass afterwards. Take what you can from the cells that are large enough. These typically account for the majority of the data needed. Then add records in to meet the final count. The records to be added in are those whose values match what is needed along the most needed dimensions. However, this solution simply assumes that there is enough data to satisfy your criteria.

Questions:
Answers:

Problem with your request is that there’s enormous number of options that can be used to achieve proposed numbers:

       Male    Female    Sum
-----------------------------
North:  100         0    100      
South:  200         0    200
East:   300         0    300 
West:     0       400    400 
Sum:    600       400
-----------------------------
North:   99         1    100      
South:  200         0    200
East:   300         0    300 
West:     1       399    400 
Sum:    600       400
-----------------------------
....
-----------------------------
North:    0       100    100      
South:  200         0    200
East:     0       300    300 
West:   400         0    400 
Sum:    600       400

Just by combining North, East and West (with south always male: 200) you’ll get 400 possibilities how to achieve proposed numbers. And it gets even more complicated when you have just a limited amount of records per each “class” (Male/North = “class“).

You may need need up to MIN(COUNT(gender), COUNT(location)) records for every cell in table above (for the case that it’s counterpart will be zero).

That is up to:

       Male    Female    
---------------------
North:  100       100      
South:  200       200
East:   300       300 
West:   400       400 

So you need to count available records of each gender/location pair AVAILABLE(gender, location).

Finding particular fit seems to be close to semimagic squares[1][2].

And there are several questions on math.stackexchange.com about this [3][4].

I’ve ended up reading some paper on how to construct these and I doubt it’s possible to do this with one select.

If you have enough records and won’t end up in situation like this:

       Male    Female    
---------------------
North:  100         0      
South:  200       200
East:   300         0 
West:   200       200 

I would go with iterating trough locations and add proportional number of Males/Females in each step:

  1. M: 100 (16%); F: 0 (0%)
  2. M: 100 (16%); F: 200 (50%)
  3. M: 400 (66%); F: 200 (50%)
  4. M: 600 (100%); F: 400 (100%)

But this will give you only approximate results and after validating those you may want to iterate trough result few times and adjust counts in each category to be “good enough“.

Questions:
Answers:

I’d build a map of the distribution of the database and use that to implement the sampling logic. Bonuses include possibility to add quick demography feedback to the user and no additional burden to the server. On the con side, you’d need to implement a mechanism to keep the database and the map in sync.

It could look like this using JSON:

{"gender":{
  "Male":{
    "amount":35600,
    "region":{
      "North":{
        "amount":25000,
        "age":{
          "18":{
            "amount":2400,
            "ethnicity":{
              ...
              "income":{
                ...
              }
            },
            "income":{
              ...
              "ethnicity":{
                ...
              }
            }
          },
          "19":{
            ...
          },
          ...
          "120":{
            ...
          }
        },
        "ethnicity":{
          ...
        },
        "income":{
          ...
        }
      },
      "South":{
        ...
      },
      ...
    }
    "age":{
      ...
    }
    "ethnicity":{
      ...
    },
    "income":{
      ...
    }
  },
  "Female":{
    ...
  }
},
"region":{
  ...
},
"age":{
  ...
},
"ethnicity":{
  ...
},
"income":{
  ...
}}

So the user selects

total 1000
   600 Male
   400 Female

   100 North
   200 South
   300 East
   400 West

   300 <20 years old
   300 21-29 years old
   400 >=30 years old

Calculate a linear distribution:

male-north-u20: 1000*0.6*0.1*0.3=18
male-north-21to29: 18
male-north-o29: 24 (keep a track of rounding errors)
etc

then we’d check the map:

tmp.male.north.u20=getSumUnder(JSON.gender.Male.region.North.age,20) // == 10
tmp.male.north.f21to29=getSumBetween(JSON.gender.Male.region.North.age,21,29) // == 29
tmp.male.north.o29=getSumOver(JSON.gender.Male.region.north.age,29) // == 200
etc

Mark everything that meets the linear distribution as ok and keep track of surplus. If something (like male.north.u20) is below first adjust in parent (to make sure male.north for example meets the criteria), you get missing 8 for u20 and overused 8 for f21to29. After first run adjust each missing criteria in other regions. So like tmp.male.south.u20+=8;tmp.male.south.f21to29-=8;.

It is pretty tedious to get it right.

In the end you have the correct distribution that can be used to construct a trivial SQL query.

Questions:
Answers:

This can be solved in two steps. I will describe how to do it for the example where gender and region are the dimensions. Then I will describe the more general case. In the first step we solve a system of equations of 8 variables, then we take the disjoint union of 8 select statements limited by the solutions found in step one. Notice that there are only 8 possibilities for any row. They can be male or female and then the region is one of north, south, east or west. Now let,

X1 equal the number of rows that are male and from the north, 
X2 equal the number of rows that are male and from the south,
X3 equal the number of rows that are male and from the east,
X4 equal then number that are male and from the west 
X5 equal the number of rows that are female and from the north, 
X6 equal the number of rows that are female and from the south,
X7 equal the number of rows that are female and from the east,
X8 equal then number that are female and from the west 

The equations are:

 X1+X2+X3+X4=600
 X5+X6+X7+X8=400
 X1+X5=100
 X2+X6=200
 X3+X7=300
 X4+X8=400

Now solve for X1,X2, …X8 in the above. There are many solutions (I will describe how to solve in a moment) Here is a solution:

X1=60, X2=120, X3=180,X4=240,X5=40,X6=80,X7=120,X8=160.

Now we can get the result by a simple union of 8 selects:

(select * from user where  gender='m' and region="north" limit 60)
union distinct(select * from user where  gender='m' and region='south' limit 120)
union distinct(select * from user where  gender='m' and region='east' limit 180)
union distinct(select * from user where  gender='m' and region='west' limit 240)
union distinct(select * from user where  gender='f' and region='north' limit 40)
union distinct(select * from user where  gender='f' and region='south' limit 80)
union distinct(select * from user where  gender='f' and region='east' limit 120)
union distinct(select * from user where  gender='f' and region='west' limit 160);

Notice that if there are not 60 rows in the data base the satisfy the first select above then the particular solution given will not work. So we have to add other constraints, LT:

0<X1 <= (select count(*) from user where  from user where  gender='m' and region="north")
0<X2 <= (select count(*) from user where  gender='m' and region='south')
0<X3 <= (select count(*) from user where  gender='m' and region='east' )
0<X4 <= (select count(*) from user where  gender='m' and region='west')
0<X5 <= (select count(*) from user where  gender='f' and region='north' )
0<X6 <= (select count(*) from user where  gender='f' and region='south')
0<X7 <= (select count(*) from user where  gender='f' and region='east' )
0<X8 <= (select count(*) from user where  gender='f' and region='west');

Now let’s generalize for this case allowing any splits. The equations are E:

 X1+X2+X3+X4=n1
 X5+X6+X7+X8=n2
 X1+X5=m1
 X2+X6=m2
 X3+X7=m3
 X4+X8=m4

Th numbers n1,n2,m1,m2, m3,m4 are given and satisfy n1+n2=(m1+m2+m3+m4). So we have reduced the problem to solving the equations LT and E above. This is a just a linear programming problem and can be solved using the simplex method or other methods. Another possibility is to view this as a System of linear Diophantine equations and use methods for that to find solutions. In any case I have reduced the problem to finding the solution to the equations above. (Given that the equations are of a special form there may be a faster way then using the simplex method or solving a system of linear diophantine equations).Once we solve for Xi the final solution is:

(select * from user where  gender='m' and region="north" limit :X1)
union distinct(select * from user where  gender='m' and region='south' limit :X2)
union distinct(select * from user where  gender='m' and region='east' limit :X3)
union distinct(select * from user where  gender='m' and region='west' limit :X4)
union distinct(select * from user where  gender='f' and region='north' limit :X5)
union distinct(select * from user where  gender='f' and region='south' limit :X6)
union distinct(select * from user where  gender='f' and region='east' limit :X7)
union distinct(select * from user where  gender='f' and region='west' limit :X8);

Lets denote a dimension D with n possibilities as D:n. Suppose you have D1:n1, D2:n2, …DM:nM dimensions. The would generate n1*n2*…nM variables. The number of equations generated is n1+n2+…nM. Rather then define the general method lets take another case of 3 dimensions, 4 dimensions and 2 dimensions; Lets call the possible values for D1 to be d11, d12,d13, D2 is d21, d22, d23, d24, and D3 values are d31,d32. We will have 24 variables, and the equations are:

 X1 + X2 + ...X8=n11
 X9 + X10 + ..X16=n12
 X17+X18 + ...X24=n13
 X1+X2+X9+x10+x17+x18=n21
 X3+X4+X11+x12+x19+x20=n22
 X5+X6+X13+x14+x21+x22=n23
 X7+X8+X15+x116+x23+x24=n24
 X1+X3+X5+...X23=n31
 X2+X4+......X24=n32

Where

X1 equals number with D1=d11  and  D2=d21 and D3=d31
X2 equals number with D1=d11 and D2=d21 and D3 = d31
....
X24 equals number with D1=D13 and D2=d24, and D3=d32.

Add the less then constraints. Then solve for X1,X2, … X24. Create the 24 select statements and take the disjoint union.
We can solve similarly for any dimensions.

So in summary: Given dimensions D1:n1, D2:n2, …DM:nM we can solve the corresponding linear programming problem as describe above for n1*n2*…nM variables and then generate a solution by taking the disjoint union over n1*n2*…nM select statements. So yes, we can generate a solution by select statements but first we have to solve the equations and determine limits by getting counts for each of the n1*n2*…nM variables.

Even though the bounty is over I am going to add a bit more for those you are interested. I claim here that I have completely shown how to solve this if there is a solution.

To clarify my approach. In the case of 3 dimensions, lets say we split age into one of 3 possibilities. Then well use gender and region as in the question. There are 24 different possibilities for each user corresponding to where they fall in those categories. Let Xi be the number of each of those possibilities in the final result. Let me write a matrix where each row is represents one of each possibility. Each user will contribute at most 1 to m or f, 1 to north, south, east or west, and 1 to the age category. And there are only 24 possibilities for the user. Lets show a matrix: (abc) the 3 ages, (nsew) the regions and
(mf) male or female: a is age less then or equal to 10, b is age between 11 and 30 and c is age between 31 and 50.

     abc nsew mf
X1   100 1000 10
X2   100 1000 01
X3   100 0100 10
X4   100 0100 01
X5   100 0010 10
X6   100 0010 01
X7   100 0001 10
X8   100 0001 01

X9   010 1000 10
X10  010 1000 01
X11  010 0100 10
X12  010 0100 01
X13  010 0010 10
X14  010 0010 01
X15  010 0001 10
X16  010 0001 01

X17   001 1000 10
X18   001 1000 01
X19   001 0100 10
X20   001 0100 01
X21   001 0010 10
X22   001 0010 01
X23   001 0001 10
X24   001 0001 01

Each row represents a user where there is a 1 in the column if it contributes to a result. For example, the first row shows 1 for a, 1 for n, and 1 for m. Which means the user’s age is less then or equal to 10, is from the north and is a male.
The Xi represents how many of that kind of row is in the final result. So lets say X1 is 10 that means that we are say the final result has 10 results all of which are from the north, are males and are less then or equal 10. OK so now we just have to add things up. Notice that the first 8 X1+X2+X3+X4+X5+X6+X7+X8 are all the rows that whose age less then or equal to 10. They must add up to whatever we chose for that category. Similarly for the next 2 sets of 8.

So so far we get the equations: (na is the number with age less then 10, nb the age between 10 and 20, nc the number whose age less then 50

X1+X2+X3+X4+X5+X6+X7+X8 =  na
X9+X10+X11 + .... X16 = nb
X17+X18+X19+...           X24=nc

Those are the age splits. Now lets look at the region splits. Just add up the variables in the “n” column,

X1+X2+X9+X10+X17+X18 = nn
X3+X4+X11+X12+X19+20=ns
...

etc.
Do you see how I am getting those equations by just looking down the columns?
Continue for ew and mf. giving 3+4+2 equations in total. So what I did here is quite simple. I have reasoned that any row you pick contributes one to each of the 3 dimensions and there are only 24 possibilities. Then let Xi be the number for each possibility and you get the equations that needs to be solved. It seems to me that whatever method you come up with must be a solution to those equations. In other words I simply reformulated the problem in terms of solving those equations.

Now we want an integer solution since we cannot have a fractional row. Notice these are all linear equations. But we want an integer solution. Here is a link to a paper that describes how to solve these: https://www.math.uwaterloo.ca/~wgilbert/Research/GilbertPathria.pdf

Questions:
Answers:

Forming the business logic in SQL is never a good idea as it’ll hamper ability to absorb even minor changes.

My suggestion would be to do this in an ORM and keep the business logic abstracted from SQL.

For example if your were using Django:

Your model would look like:

class User(models.Model):
    GENDER_CHOICES = (
      ('M', 'Male'),
      ('F','Female')
    )       
    gender = models.CharField(max_length=1, choices=GENDER_CHOICES)
    REGION_CHOICES = (
      ('E', 'East'),
      ('W','West'),
      ('N','North'),
      ('S','South')
    )
    region = models.CharField(max_length=1, choices=REGION_CHOICES)
    age = models.IntegerField()
    ETHNICITY_CHOICES = (
      .......
    ) 
    ethnicity = models.CharField(max_length=1, choices=ETHNICITY_CHOICES)
    income = models.FloatField()

And your query function could be something like this:

# gender_limits is a dict like {'M':400, 'F':600}
# region_limits is a dict like {'N':100, 'E':200, 'W':300, 'S':400}
def get_users_by_gender_and_region(gender_limits,region_limits):
    for gender in gender_limits:
        gender_queryset = gender_queryset | User.objects.filter(gender=gender)[:gender_limits[gender]]
    for region in region_limits:
        region_queryset = region_queryset | User.objects.filter(region=region)[:region_limits[region]]
    return gender_queryset & region_queryset

The query function can be abstracted further with the knowledge of all queries you plan to support, but this should serve as an example.

If you are using a different ORM, the same idea can be translated to that too as any good ORM would have the union and intersection abstraction.

Questions:
Answers:

I would use a programming language to generate the SQL statements, but below is a solution in pure mySQL. One assumption made: There is always enough male/female in one region to fit the numbers (e.g. what if there are no female living in the north?).

The routine is pre-calculating the needed row quantities. Limit cannot be specified using a variable. I am more an oracle guy where we have analytical functions. MySQL also provides this to some extend by allowing variables. So I set the target regions and gender and calculate the breakdown. Then I limit my output using the calculations.

This query shows the counts to proof the concept.

set @male=600;
set @female=400;
set @north=100;
set @south=200;
set @east=300;
set @west=400;
set @[email protected]*(@male/(@[email protected]));
set @[email protected]*(@male/(@[email protected]));
set @east_male [email protected] *(@male/(@[email protected]));
set @west_male [email protected] *(@male/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @east_female [email protected] *(@female/(@[email protected]));
set @west_female [email protected] *(@female/(@[email protected]));

select gender, region, count(*) 
from (
          select * from (select @north_male  :[email protected]_male-1   as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0 
union all select * from (select @south_male  :[email protected]_male-1   as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0
union all select * from (select @east_male   :[email protected]_male-1    as row, userid, gender, region from users where gender = 'Male' and region = 'East'  ) me where row>=0
union all select * from (select @west_male   :[email protected]_male-1    as row, userid, gender, region from users where gender = 'Male' and region = 'West'  ) mw where row>=0
union all select * from (select @north_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0 
union all select * from (select @south_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0
union all select * from (select @east_female :[email protected]_female-1  as row, userid, gender, region from users where gender = 'Female' and region = 'East'  ) fe where row>=0
union all select * from (select @west_female :[email protected]_female-1  as row, userid, gender, region from users where gender = 'Female' and region = 'West'  ) fw where row>=0
) a
group by gender, region
order by gender, region;

Output:

Female  East   120
Female  North   40
Female  South   80
Female  West   160
Male    East   180
Male    North   60
Male    South  120
Male    West   240

Remove the outer part to get the real records:

set @male=600;
set @female=400;
set @north=100;
set @south=200;
set @east=300;
set @west=400;
set @[email protected]*(@male/(@[email protected]));
set @[email protected]*(@male/(@[email protected]));
set @east_male [email protected] *(@male/(@[email protected]));
set @west_male [email protected] *(@male/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @east_female [email protected] *(@female/(@[email protected]));
set @west_female [email protected] *(@female/(@[email protected]));
          select * from (select @north_male  :[email protected]_male-1   as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0 
union all select * from (select @south_male  :[email protected]_male-1   as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0
union all select * from (select @east_male   :[email protected]_male-1    as row, userid, gender, region from users where gender = 'Male' and region = 'East'  ) me where row>=0
union all select * from (select @west_male   :[email protected]_male-1    as row, userid, gender, region from users where gender = 'Male' and region = 'West'  ) mw where row>=0
union all select * from (select @north_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0 
union all select * from (select @south_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0
union all select * from (select @east_female :[email protected]_female-1  as row, userid, gender, region from users where gender = 'Female' and region = 'East'  ) fe where row>=0
union all select * from (select @west_female :[email protected]_female-1  as row, userid, gender, region from users where gender = 'Female' and region = 'West'  ) fw where row>=0
;

For testing I have written a procedure which does create 10000 sample records fully random:

use test;
drop table if exists users;
create table users (userid int not null auto_increment, gender VARCHAR (20), region varchar(20), primary key (userid) );
drop procedure if exists load_users_table;
delimiter #
create procedure load_users_table()
begin
    declare l_max int unsigned default 10000;
    declare l_cnt int unsigned default 0;
    declare l_gender varchar(20);
    declare l_region varchar(20);
    declare l_rnd smallint;
    truncate table users;
    start transaction;
    WHILE l_cnt < l_max DO
        set l_rnd = floor( 0 + (rand()*2) );
        if l_rnd = 0 then
            set l_gender = 'Male';
        else
            set l_gender = 'Female';
        end if;
        set l_rnd=floor(0+(rand()*4));
        if l_rnd = 0 then
            set l_region = 'North';
        elseif l_rnd=1 then
            set l_region = 'South';
        elseif l_rnd=2 then
            set l_region = 'East';
        elseif l_rnd=3 then
            set l_region = 'West';
        end if;
        insert into users (gender, region) values (l_gender, l_region);
        set l_cnt=l_cnt+1;
    end while;
    commit;
end #
delimiter ;
call load_users_table();

select gender, region, count(*) 
from users
group by gender, region
order by gender, region;

Hope this all helps you. The bottom line is: Use a UNION ALL and restrict with pre-calculated variables not LIMIT.

Questions:
Answers:

Well, I think the question is about randomly getting the records and not in the proportion of 60/40 for all regions. I have done for Region and Gender. It can be generalized to other fields like age, income and ethnicity in the same way.

    Declare @Mlimit bigint
    Declare @Flimit bigint
    Declare @Northlimit bigint
    Declare @Southlimit bigint 
    Declare @Eastlimit bigint
    Declare @Westlimit bigint  

    Set @Mlimit= 600
    Set @Flimit=400
    Set @Northlimit= 100
    Set @Southlimit=200
    Set @Eastlimit=300
    Set @Westlimit=400

    CREATE TABLE #Users(
        [UserId] [int]  NOT NULL,
        [gender] [varchar](10) NULL,
        [region] [varchar](10) NULL,
        [age] [int] NULL,
        [ethnicity] [varchar](50) NULL,
        [income] [bigint] NULL

    )
      Declare @MnorthCnt bigint
      Declare @MsouthCnt bigint
      Declare @MeastCnt bigint
      Declare @MwestCnt bigint

       Declare @FnorthCnt bigint
      Declare @FsouthCnt bigint
      Declare @FeastCnt bigint
      Declare @FwestCnt bigint

      Select @MnorthCnt=COUNT(*) from users where gender='male' and region='north' 
      Select @FnorthCnt=COUNT(*) from users where gender='female' and region='north' 

      Select @MsouthCnt=COUNT(*) from users where gender='male' and region='south' 
      Select @FsouthCnt=COUNT(*) from users where gender='female' and region='south' 

      Select @MeastCnt=COUNT(*) from users where gender='male' and region='east' 
      Select @FeastCnt=COUNT(*) from users where gender='female' and region='east' 
      Select @MwestCnt=COUNT(*) from users where gender='male' and region='west' 
      Select @FwestCnt=COUNT(*) from users where gender='female' and region='west' 

    If (@[email protected][email protected])
    begin
     Insert into #Users select * from Users where region='north' 
    set @Northlimit=0
    set @[email protected]
    set @[email protected]
    set @MnorthCnt=0 
    set @FnorthCnt=0
    end

    If (@[email protected][email protected])
    begin
     Insert into #Users select * from Users where region='South' 
    set @Southlimit=0
    set @[email protected]
    set @[email protected]
    set @MsouthCnt=0
    set @FsouthCnt=0
    end

    If (@[email protected][email protected])
    begin
     Insert into #Users select * from Users where region='East' 
    set @Eastlimit=0
    set @[email protected]
    set @[email protected]
    set @MeastCnt=0
    set @FeastCnt=0
    end

    If (@[email protected][email protected])
    begin
     Insert into #Users select * from Users where region='West' 
    set @Westlimit=0
    set @[email protected]
    set @[email protected]
    set @MwestCnt=0
    set @FwestCnt=0
    end 

If @MnorthCnt<@Northlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='female' and region='north'
 and userid not in (select userid from #users)
 set @Flimit-=(@[email protected])
 set @FNorthCnt-=(@[email protected])
 set @Northlimit-=(@[email protected])
 End

 If @FnorthCnt<@Northlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='male' and region='north'
 and userid not in (select userid from #users)
 set @Mlimit-=(@[email protected])
 set @MNorthCnt-=(@[email protected])
 set @Northlimit-=(@[email protected])
 End

 if @MsouthCnt<@southlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='female' and region='south'
 and userid not in (select userid from #users)
 set @Flimit-=(@[email protected])
 set @FSouthCnt-=(@[email protected])
 set @southlimit-=(@[email protected])
 End

 if @FsouthCnt<@southlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='male' and region='south'
 and userid not in (select userid from #users)
 set @Mlimit-=(@[email protected])
 set @MSouthCnt-=(@[email protected])
 set @southlimit-=(@[email protected])
 End

if @MeastCnt<@eastlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='female' and region='east'
 and userid not in (select userid from #users)
 set @Flimit-=(@[email protected])
 set @FEastCnt-=(@[email protected])
 set @eastlimit-=(@[email protected])
 End

if @FeastCnt<@eastlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='male' and region='east'
 and userid not in (select userid from #users)
 set @Mlimit-=(@[email protected])
 set @MEastCnt-=(@[email protected])
 set @eastlimit-=(@[email protected])
End

if @MwestCnt<@westlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='female' and region='west'
 and userid not in (select userid from #users)
 set @Flimit-=(@[email protected])
 set @FWestCnt-=(@[email protected])
 set @westlimit-=(@[email protected])
 End

if @FwestCnt<@westlimit
 Begin
 insert into #Users select top (@[email protected]) * from Users where gender='male' and region='west'
 and userid not in (select userid from #users)
 set @Mlimit-=(@[email protected])
 set @MWestCnt-=(@[email protected])
 set @westlimit-=(@[email protected])
 End     


    IF (@MnorthCnt>[email protected] and @FnorthCnt>[email protected] and @MsouthCnt>[email protected] and @FsouthCnt>[email protected] and @MeastCnt>[email protected] and @FeastCnt>[email protected] and @MwestCnt>[email protected] and @FwestCnt>[email protected] and not(@Mlimit=0 and @Flimit=0))
    Begin

    ---Create Cursor
    DECLARE UC CURSOR FAST_forward
    FOR
    SELECT *
    FROM Users
    where userid not in (select userid from #users) 

    Declare @UserId [int]  ,
        @gender [varchar](10) ,
        @region [varchar](10) ,
        @age [int] ,
        @ethnicity [varchar](50) ,
        @income [bigint]   
    OPEN UC

    FETCH NEXT FROM UC
    INTO @UserId ,@gender, @region, @age, @ethnicity, @income

    WHILE @@FETCH_STATUS = 0 and not (@Mlimit=0 and @Flimit=0) 
    BEGIN
    If @gender='male' and @region='north' and @Northlimit>0 AND @Mlimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @Mlimit-=1
    set @MNorthCnt-=1
    set @Northlimit-=1
    end  
    If @gender='male' and @region='south' and @southlimit>0 AND @Mlimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @Mlimit-=1
    set @MsouthCnt-=1
    set @Southlimit-=1
    end 
    If @gender='male' and @region='east' and @eastlimit>0 AND @Mlimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @Mlimit-=1
    set @MeastCnt-=1
    set @eastlimit-=1
    end  
    If @gender='male' and @region='west' and @westlimit>0 AND @Mlimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @Mlimit-=1
    set @MwestCnt-=1
    set @westlimit-=1
    end 

    If @gender='female' and @region='north' and @Northlimit>0 AND @flimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @Flimit-=1
    set @FNorthCnt-=1
    set @Northlimit-=1
    end  
    If @gender='female' and @region='south' and @southlimit>0 AND @flimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @Flimit-=1
    set @FsouthCnt-=1
    set @Southlimit-=1
    end 
    If @gender='female' and @region='east' and @eastlimit>0 AND @flimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @flimit-=1
    set @feastCnt-=1
    set @eastlimit-=1
    end  
    If @gender='female' and @region='west' and @westlimit>0 AND @flimit>0
    begin
    insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
    set @flimit-=1
    set @fwestCnt-=1
    set @westlimit-=1
    end   
    FETCH NEXT FROM UC
    INTO @UserId ,@gender, @region, @age, @ethnicity, @income
    END

    CLOSE UC

    DEALLOCATE UC

    end

    Select * from #Users

    SELECT GENDER, REGION, COUNT(*) AS COUNT FROM #USERS 
    GROUP BY GENDER, REGION
    DROP TABLE #Users

Questions:
Answers:

I expect you’d want to generate a bunch of queries based on the required filters.

I’ll explain a possible approach, with a full code sample – but note the caveats later on.
I’ll also address the issue where you can’t fulfil the requested sample from a proportional distribution, but you can from an adjusted distribution – and explain how to do that adjustment

The basic algorithm goes like this:

Start with a set of filters {F1, F2, ... Fn}, each which has a group of values, and percentages which should be distributed amongst those values. For example F1 might be gender, with 2 values (F1V1 = Male: 60%, F1V2 = Female: 40%) You’ll also want the total sample size required (call this X ) From this starting point you can then combine all the filters items from each filter to get a single set all of the combined filter items, and the quantities required for each.
The code should be able to handle any number of filters, with any number of values (either exact values, or ranges)

EG: suppose 2 filters, F1: gender, {F1V1 = Male: 60%, F1V2 = Female: 40%}, F2: region, {F2V1 = North: 50%, F2V2 = South: 50%} and a total sample required of X = 10 people.
In this sample we’d like 6 of them to be male, 4 of them to be female, 5 to be from the north, and 5 to be from the south.

Then we do

  1. Create an sql stub for each value in F1 – with an associated fraction of the initial percentage (i.e.
    • WHERE gender = 'Male' : 0.6,
    • WHERE gender = 'Female': 0.4 )
  2. For each item in F2 – create a new sql stub from every item from the step above – with the filter now being both the F1 Value & the F2 Value, and the associated fraction being the product of the 2 fractions. So we now have 2 x 2 = 4 items of
    • WHERE gender = 'Male' AND region = 'North': 0.6 * 0.5 = 0.3,
    • WHERE gender = 'Female' AND region = 'North': 0.4 * 0.5 = 0.2,
    • WHERE gender = 'Male' AND region = 'South': 0.6*0.5 = 0.3,
    • WHERE gender = 'Female' AND region = 'South': 0.4*0.5 = 0.2
  3. Repeat step 2 above for every additional Filter F3 to Fn. (in our example there were only 2 filters, so we are already done)
  4. Calculate the limit for each SQL stub as being [fraction associated with stub] * X = total required sample size (so for our example thats 0.3 * 10 = 3 for Male/North, 0.2 * 10 = 2 for Female/North etc)
  5. Finally for every sql stub – turn it into a complete SQL statement , and add the limit

Code Sample

I’ll provide C# code for this, but it should be easy enough to translate this to other languages.
It would be pretty tricky to attempt this in pure dynamic SQL

Note this is untested – and probably full of errors – but its an idea of the approach you could take.

I’ve defined a public method and a public class – which would be the entry point.

// This is an example of a public class you could use to hold one of your filters
// For example - if you wanted 60% male / 40% female, you could have an item with 
//    item1 = {Fraction: 0.6, ValueExact: 'Male', RangeStart: null, RangeEnd: null}
//  & item2 = {Fraction: 0.4, ValueExact: 'Female', RangeStart: null, RangeEnd: null}
public class FilterItem{
    public decimal Fraction {get; set;}
    public string ValueExact {get; set;}
    public int? RangeStart {get; set;}
    public int? RangeEnd {get; set;}
}

// This is an example of a public method you could call to build your SQL 
// - passing in a generic list of desired filter
// for example the dictionary entry for the above filter would be 
// {Key: "gender", Value: new List<FilterItem>(){item1, item2}}
public string BuildSQL(Dictionary<string, List<FilterItem>> filters, int TotalItems)
{
    // we want to build up a list of SQL stubs that can be unioned together.
    var sqlStubItems = new List<SqlItem>();
    foreach(var entry in filters)
    {
        AddFilter(entry.Key, entry.Value, sqlStubItems);
    }
    // ok - now just combine all of the sql stubs into one big union.
    var result = ""; // Id use a stringbuilder for this normally, 
                     // but this is probably more cross-language readable.
    int limitSum = 0;
    for(int i = 0; i < sqlStubItems.Count; i++) // string.Join() would be more succinct!
    {
       var item = sqlStubItems[i];
       if (i > 0)
       {
           result  += " UNION ";
       }
       int limit = (int)Math.Round(TotalItems * item.Fraction, 0);
       limitSum+= limit;
       if (i == sqlStubItems.Count - 1 && limitSum != TotalItems)
       {
          //may need to adjust one of the rounded items to account 
          //for rounding errors making a total that is not the 
          //originally required total limit.
          limit += (TotalItems - limitSum);
       }
       result +=  item.Sql + " LIMIT " 
              + Convert.ToString(limit);

    }
    return result;
}

// This method expands the number of SQL stubs for every filter that has been added.
// each existing filter is split by the number of items in the newly added filter.
private void AddFilter(string filterType, 
                       List<FilterItem> filterValues, 
                       List<SqlItem> SqlItems)
{
   var newItems = new List<SqlItem>();

   foreach(var filterItem in filterValues)
   {
       string filterAddon; 
       if (filterItem.RangeStart.HasValue && filterItem.RangeEnd.HasValue){
           filterAddon = filterType + " >= " + filterItem.RangeStart.ToString() 
                       + " AND " + filterType + " <= " + filterItem.RangeEnd.ToString();
       } else {
           filterAddon = filterType + " = '" 
                         + filterItem.ValueExact.Replace("'","''") + "'"; 
                         //beware of SQL injection. (hence the .Replace() above)
       }
       if(SqlItems.Count() == 0)
       {
           newItems.Add(new SqlItem(){Sql = "Select * FROM users WHERE " 
                                      + filterAddon, Fraction = filterItem.Fraction});
       } else {
           foreach(var existingItem in SqlItems)
           {
               newItems.Add(new SqlItem()
               {
                 Sql = existingItem +  " AND " + filterAddon, 
                 Fraction = existingItem.Fraction * filterItem.Fraction
               });
           }
       }
   }
   SqlItems.Clear();
   SqlItems.AddRange(newItems);
}



// this class is for part-built SQL strings, with the fraction
private class SqlItem{
  public string Sql { get; set;}
  public decimal Fraction{get; set;}
}

Notes (as per comment by Sign)

  • Rounding errors may mean you don’t get exactly the 600 / 400 split you were aiming for when applying a large number of filters – but should be close.
  • If your dataset is not very diverse then it may not be possible to always generate the required split. This method will require an even distribution amongst the filters (so if you were doing a total of 10 people, 6 male, 4 female , 5 from the north, 5 from the south it would require 3 males from the north, 3 males from the south, 2 females from the north and 2 females from the south.)
  • The people are not going to be retrieved at random – just whatever the default sort is. You would need to add something like ORDER BY RAND() (but not that as its VERY inefficient) to get a random selection.
  • Beware of SQL injection. Sanitise all user input, replacing single quote ' chars.

Badly distributed sample problem

How do you address the problem of there being insufficient items in one of our buckets to create our sample as per a representative split (that the above algorithm gives)? Or what if your numbers are not integers?

Well I won’t go so far as to provide code, but I will describe a possible approach. You’d need to alter the code above quite a bit, because a flat list of sql stubs isn’t going to cut it anymore. Instead you’d need to build a n-dimensional matrix of SQL stubs (adding a dimension for every filter F1 – n) After step 4 above has been completed (where we have our desired, but not necessarily possible numbers for each SQL stub item), what I’d expect to do is

  1. generate SQL to select counts for all the combined sql WHERE stubs.
  2. Then you’d iterate the collection – and if you hit an item where the requested limit is higher than the count (or not an integer),
    • adjust the requested limit down to the count (or nearest integer).
    • Then pick another item on each of the axis that is at least the above adjustment lower that its max count, and adjust it up by the same. If its not possible to find qualifying items then your requested split is not possible.
    • Then adjust all the intersecting items for the upward adjusted items down again
    • Repeat the step above for intersects between the intersecting points for every additional dimension to n (but toggle the adjustment between negative and positive each time)

So suppose continuing our previous example – our representative split is:
Male/North = 3, Female/North = 2, Male/South = 3, Female/South = 2, but there are only 2 Males in the north (but theres loads of people in the other groups we could pick)

  • We adjust Male/North down to 2 (-1)
  • We adjust Female/North to 3 (+1) and Male/South to 4 (+1)
  • We adjust the Intersecting Female/South to 1 (-1). Voila! (there are no additional dimensions as we only had 2 criteria/dimensions)

This illustration may be helpful when adjusting intersecting items in higher dimensions (only showing up to 4 dimensions, but should help to picture what needs to be done! Each point represents one of our SQL stub items in the n-dimensional matrix (and has an associated limit number) A line represents a common criteria value (such as gender = male). The objective is that the total along any line should remain the same after adjustments have finished! We start with the red point, and continue for each additional dimension… In the example above we would only be looking at 2 dimensions – a square formed from the red point, the 2 orange points above and to the right of it, and the 1 green point to the NE to complete the square.

adjustments

Questions:
Answers:

I’d go with GROUP BY:

SELECT gender,region,count(*) FROM users GROUP BY gender,region

+----------------------+
|gender|region|count(*)|
+----------------------+
|f     |E     |     129|
|f     |N     |      43|
|f     |S     |      84|
|f     |W     |     144|
|m     |E     |     171|
|m     |N     |      57|
|m     |S     |     116|
|m     |W     |     256|
+----------------------+

You can verify you have 600 males, 400 females, 100 North, 200 South, 300 East and 400 West.

You can include other fields as well.

For range fields, like age and income, you can follow this example:

SELECT
  gender,
  region,
  case when age < 30 then 'Young'
       when age between 30 and 59 then 'Middle aged'
       else 'Old' end as age_range,
  count(*)
FROM users
GROUP BY gender,region, age_range

So, the results would be like:

+----------------------------------+
|gender|region|age        |count(*)|
+----------------------------------+
|f     |E     |Middle aged|      56|
|f     |E     |Old        |      31|
|f     |E     |Young      |      42|
|f     |N     |Middle aged|      14|
|f     |N     |Old        |      11|
|f     |N     |Young      |      18|
|f     |S     |Middle aged|      40|
|f     |S     |Old        |      23|
|f     |S     |Young      |      21|
|f     |W     |Middle aged|      67|
|f     |W     |Old        |      42|
|f     |W     |Young      |      35|
|m     |E     |Middle aged|      77|
|m     |E     |Old        |      56|
|m     |E     |Young      |      38|
|m     |N     |Middle aged|      13|
|m     |N     |Old        |      25|
|m     |N     |Young      |      19|
|m     |S     |Middle aged|      46|
|m     |S     |Old        |      39|
|m     |S     |Young      |      31|
|m     |W     |Middle aged|     103|
|m     |W     |Old        |      66|
|m     |W     |Young      |      87|
+----------------------------------+