Assume I have a single mySQL table (users) with the following fields:

```
userid
gender
region
age
ethnicity
income
```

I want to be able to return the number of total records based on the number a user enters. Furthermore, they will also be providing additional criteria.

In the simplest example, they may ask for 1,000 records, where 600 records should have gender = ‘Male’ and 400 records where gender = ‘Female’. That’s simple enough to do.

Now, go one step further. Assume they now want to specify Region:

```
GENDER
Male: 600 records
Female: 400 records
REGION
North: 100 records
South: 200 records
East: 300 records
West: 400 records
```

Again, only 1000 records should be returned, but in the end, there must be 600 males, 400 females, 100 Northerners, 200 Southerners, 300 Easterners and 400 Westerners.

I know this isn’t valid syntax, but using pseudo-mySQL code, it hopefully illustrates what I’m trying to do:

```
(SELECT * FROM users WHERE gender = 'Male' LIMIT 600
UNION
SELECT * FROM users WHERE gender = 'Female' LIMIT 400)
INTERSECT
(SELECT * FROM users WHERE region = 'North' LIMIT 100
UNION
SELECT * FROM users WHERE region = 'South' LIMIT 200
UNION
SELECT * FROM users WHERE region = 'East' LIMIT 300
UNION
SELECT * FROM users WHERE region = 'West' LIMIT 400)
```

Note that I’m not looking for a one-time query. The total number of records and the number of records within each criteria will constantly be changing based on input by the user. So, I’m trying to come up with a generic solution that can be re-used over and over, not a hard-coded solution.

To make things more complicated, now add more criteria. There could also be age, ethnicity and income each with their own set number of records for each group, additional code appended to above:

```
INTERSECT
(SELECT * FROM users WHERE age >= 18 and age <= 24 LIMIT 300
UNION
SELECT * FROM users WHERE age >= 25 and age <= 36 LIMIT 200
UNION
SELECT * FROM users WHERE age >= 37 and age <= 54 LIMIT 200
UNION
SELECT * FROM users WHERE age >= 55 LIMIT 300)
INTERSECT
etc.
```

I’m not sure if this is possible to write in one query or if this requires multiple statements and iterations.

## Flatten Your Criteria

You can flatten your multi-dimensional criteria into a single level criteria

Now this criteria can be achieved in one query as follow

```
(SELECT * FROM users WHERE gender = 'Male' AND region = 'North' LIMIT 40) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'South' LIMIT 80) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'East' LIMIT 120) UNION ALL
(SELECT * FROM users WHERE gender = 'Male' AND region = 'West' LIMIT 160) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'North' LIMIT 60) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'South' LIMIT 120) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'East' LIMIT 180) UNION ALL
(SELECT * FROM users WHERE gender = 'Female' AND region = 'West' LIMIT 240)
```

**Problem**

- It does not always return the correct result. For example, if there are less than 40 users whose are male and from north, then the query will return less than 1,000 records.

## Adjust Your Criteria

Let say that there is less than 40 users whose are male and from north. Then, you need to adjust other criteria quantity to cover the missing quantity from “Male” and “North”. I believe it is not possible to do it with bare SQL. This is pseudo code that I have in mind. For sake of simplification, I think we will only query for Male, Female, North, and South

```
conditions.add({ gender: 'Male', region: 'North', limit: 40 })
conditions.add({ gender: 'Male', region: 'South', limit: 80 })
conditions.add({ gender: 'Female', region: 'North', limit: 60 })
conditions.add({ gender: 'Female', region: 'South', limit: 120 })
foreach(conditions as condition) {
temp = getResultFromDatabaseByCondition(condition)
conditions.remove(condition)
// there is not enough result for this condition,
// increase other condition quantity
if (temp.length < condition.limit) {
adjust(...);
}
}
```

Let say that there are only 30 northener male. So we need to adjust +10 male, and +10 northener.

```
To Adjust
---------------------------------------------------
Male +10
North +10
Remain Conditions
----------------------------------------------------
{ gender: 'Male', region: 'South', limit: 80 }
{ gender: 'Female', region: 'North', limit: 60 }
{ gender: 'Female', region: 'South', limit: 120 }
```

‘Male’ + ‘South’ is the first condition that match the ‘Male’ adjustment condition. Increase it by +10, and remove it from the “remain condition” list. Since, we increase the South, we need to decrease it back at other condition. So add “South” condition into “To Adjust” list

```
To Adjust
---------------------------------------------------
South -10
North +10
Remain Conditions
----------------------------------------------------
{ gender: 'Female', region: 'North', limit: 60 }
{ gender: 'Female', region: 'South', limit: 120 }
Final Conditions
----------------------------------------------------
{ gender: 'Male', region: 'South', limit: 90 }
```

Find condition that match the ‘South’ and repeat the same process.

```
To Adjust
---------------------------------------------------
Female +10
North +10
Remain Conditions
----------------------------------------------------
{ gender: 'Female', region: 'North', limit: 60 }
Final Conditions
----------------------------------------------------
{ gender: 'Female', region: 'South', limit: 110 }
{ gender: 'Male', region: 'South', limit: 90 }
```

And finally

```
{ gender: 'Female', region: 'North', limit: 70 }
{ gender: 'Female', region: 'South', limit: 110 }
{ gender: 'Male', region: 'South', limit: 90 }
```

I haven’t come up with the exact implementation of adjustment yet. It is more difficult than I have expected. I will update once I can figure out how to implement it.

The problem that you describe is a multi-dimensional modeling problem. In particular, you are trying to get a stratified sample along multiple dimensions at the same time. The key to this is to go down to the smallest level of granularity and build up the sample from there.

I am further assuming that you want the sample to be representative at all levels. That is, you don’t want all the users from “North” to be female. Or all the “males” to be from “West”, even if that does meet the end criteria.

Start by thinking in terms of a total number of records, dimensions, and allocations along each dimension. For instance, for the first sample, think of it as:

- 1000 records
- 2 dimensions: gender, region
- gender split: 60%, 40%
- region split: 10%, 20%, 30%, 40%

Then, you want to allocate these numbers to each gender/region combination. The numbers are:

- North, Male: 60
- North, Female: 40
- South, Male: 120
- South, Female: 80
- East, Male: 180
- East, Female: 120
- West, Male: 240
- West, Female: 160

You’ll see that these add up along the dimensions.

The calculation of the numbers in each cell is pretty easy. It is the product of the percentages times the total. So, “East, Female” is 30%*40% * 1000 . . . Voila! The value is 120.

Here is the solution:

- Take the input along each dimension as
*percentages*of the total. And be sure they add up to 100% along each dimension. - Create a table of the expected percentages for each of the cells. This is the product of the percentages along each dimension.
- Multiple the expected percentages by the overall total.
- The final query is outlined below.

Assume that you have a table `cells`

with the expected count and the original data (`users`

).

```
select enumerated.*
from (select u.*,
(@rn := if(@dims = concat_ws(':', dim1, dim2, dim3), @rn + 1,
if(@dims := concat_ws(':', dim1, dim2, dim3), 1, 1)
)
) as seqnum
from users u cross join
(select @dims = '', @rn := '') vars
order by dim1, dim2, dim3, rand()
) enumerated join
cells
on enumerated.dims = cells.dims
where enuemrated.seqnum <= cells.expectedcount;
```

Note that this is a sketch of the solution. You have to fill in the details about the dimensions.

This will work as long as you have enough data for all the cells.

In practice, when doing this type of multi-dimensional stratified sampling, you do run the risk that cells will be empty or too small. When this happens, you can often fix this with an additional pass afterwards. Take what you can from the cells that are large enough. These typically account for the majority of the data needed. Then add records in to meet the final count. The records to be added in are those whose values match what is needed along the most needed dimensions. However, this solution simply assumes that there is enough data to satisfy your criteria.

Problem with your request is that there’s enormous number of options that can be used to achieve proposed numbers:

```
Male Female Sum
-----------------------------
North: 100 0 100
South: 200 0 200
East: 300 0 300
West: 0 400 400
Sum: 600 400
-----------------------------
North: 99 1 100
South: 200 0 200
East: 300 0 300
West: 1 399 400
Sum: 600 400
-----------------------------
....
-----------------------------
North: 0 100 100
South: 200 0 200
East: 0 300 300
West: 400 0 400
Sum: 600 400
```

Just by combining North, East and West (with south always male: 200) you’ll get 400 possibilities how to achieve proposed numbers. And it gets even more complicated when you have just a limited amount of records per each “*class*” (Male/North = “*class*“).

You may need need up to `MIN(COUNT(gender), COUNT(location))`

records for every cell in table above (for the case that it’s counterpart will be zero).

That is up to:

```
Male Female
---------------------
North: 100 100
South: 200 200
East: 300 300
West: 400 400
```

So you need to count available records of each gender/location pair `AVAILABLE(gender, location)`

.

Finding particular fit seems to be close to *semimagic squares*[1][2].

And there are several questions on math.stackexchange.com about this [3][4].

I’ve ended up reading some paper on how to construct these and I doubt it’s possible to do this with one select.

If you have enough records and won’t end up in situation like this:

```
Male Female
---------------------
North: 100 0
South: 200 200
East: 300 0
West: 200 200
```

I would go with iterating trough locations and add proportional number of Males/Females in each step:

- M: 100 (16%); F: 0 (0%)
- M: 100 (16%); F: 200 (50%)
- M: 400 (66%); F: 200 (50%)
- M: 600 (100%); F: 400 (100%)

But this will give you only approximate results and after validating those you may want to iterate trough result few times and adjust counts in each category to be “*good enough*“.

I’d build a map of the distribution of the database and use that to implement the sampling logic. Bonuses include possibility to add quick demography feedback to the user and no additional burden to the server. On the con side, you’d need to implement a mechanism to keep the database and the map in sync.

It could look like this using JSON:

```
{"gender":{
"Male":{
"amount":35600,
"region":{
"North":{
"amount":25000,
"age":{
"18":{
"amount":2400,
"ethnicity":{
...
"income":{
...
}
},
"income":{
...
"ethnicity":{
...
}
}
},
"19":{
...
},
...
"120":{
...
}
},
"ethnicity":{
...
},
"income":{
...
}
},
"South":{
...
},
...
}
"age":{
...
}
"ethnicity":{
...
},
"income":{
...
}
},
"Female":{
...
}
},
"region":{
...
},
"age":{
...
},
"ethnicity":{
...
},
"income":{
...
}}
```

So the user selects

```
total 1000
600 Male
400 Female
100 North
200 South
300 East
400 West
300 <20 years old
300 21-29 years old
400 >=30 years old
```

Calculate a linear distribution:

```
male-north-u20: 1000*0.6*0.1*0.3=18
male-north-21to29: 18
male-north-o29: 24 (keep a track of rounding errors)
etc
```

then we’d check the map:

```
tmp.male.north.u20=getSumUnder(JSON.gender.Male.region.North.age,20) // == 10
tmp.male.north.f21to29=getSumBetween(JSON.gender.Male.region.North.age,21,29) // == 29
tmp.male.north.o29=getSumOver(JSON.gender.Male.region.north.age,29) // == 200
etc
```

Mark everything that meets the linear distribution as ok and keep track of surplus. If something (like male.north.u20) is below first adjust in parent (to make sure male.north for example meets the criteria), you get missing 8 for u20 and overused 8 for f21to29. After first run adjust each missing criteria in other regions. So like `tmp.male.south.u20+=8;tmp.male.south.f21to29-=8;`

.

It is pretty tedious to get it right.

In the end you have the correct distribution that can be used to construct a trivial SQL query.

This can be solved in two steps. I will describe how to do it for the example where gender and region are the dimensions. Then I will describe the more general case. In the first step we solve a system of equations of 8 variables, then we take the disjoint union of 8 select statements limited by the solutions found in step one. Notice that there are only 8 possibilities for any row. They can be male or female and then the region is one of north, south, east or west. Now let,

```
X1 equal the number of rows that are male and from the north,
X2 equal the number of rows that are male and from the south,
X3 equal the number of rows that are male and from the east,
X4 equal then number that are male and from the west
X5 equal the number of rows that are female and from the north,
X6 equal the number of rows that are female and from the south,
X7 equal the number of rows that are female and from the east,
X8 equal then number that are female and from the west
```

The equations are:

```
X1+X2+X3+X4=600
X5+X6+X7+X8=400
X1+X5=100
X2+X6=200
X3+X7=300
X4+X8=400
```

Now solve for X1,X2, …X8 in the above. There are many solutions (I will describe how to solve in a moment) Here is a solution:

```
X1=60, X2=120, X3=180,X4=240,X5=40,X6=80,X7=120,X8=160.
```

Now we can get the result by a simple union of 8 selects:

```
(select * from user where gender='m' and region="north" limit 60)
union distinct(select * from user where gender='m' and region='south' limit 120)
union distinct(select * from user where gender='m' and region='east' limit 180)
union distinct(select * from user where gender='m' and region='west' limit 240)
union distinct(select * from user where gender='f' and region='north' limit 40)
union distinct(select * from user where gender='f' and region='south' limit 80)
union distinct(select * from user where gender='f' and region='east' limit 120)
union distinct(select * from user where gender='f' and region='west' limit 160);
```

Notice that if there are not 60 rows in the data base the satisfy the first select above then the particular solution given will not work. So we have to add other constraints, LT:

```
0<X1 <= (select count(*) from user where from user where gender='m' and region="north")
0<X2 <= (select count(*) from user where gender='m' and region='south')
0<X3 <= (select count(*) from user where gender='m' and region='east' )
0<X4 <= (select count(*) from user where gender='m' and region='west')
0<X5 <= (select count(*) from user where gender='f' and region='north' )
0<X6 <= (select count(*) from user where gender='f' and region='south')
0<X7 <= (select count(*) from user where gender='f' and region='east' )
0<X8 <= (select count(*) from user where gender='f' and region='west');
```

Now let’s generalize for this case allowing any splits. The equations are E:

```
X1+X2+X3+X4=n1
X5+X6+X7+X8=n2
X1+X5=m1
X2+X6=m2
X3+X7=m3
X4+X8=m4
```

Th numbers n1,n2,m1,m2, m3,m4 are given and satisfy n1+n2=(m1+m2+m3+m4). So we have reduced the problem to solving the equations LT and E above. This is a just a linear programming problem and can be solved using the simplex method or other methods. Another possibility is to view this as a System of linear Diophantine equations and use methods for that to find solutions. In any case I have reduced the problem to finding the solution to the equations above. (Given that the equations are of a special form there may be a faster way then using the simplex method or solving a system of linear diophantine equations).Once we solve for Xi the final solution is:

```
(select * from user where gender='m' and region="north" limit :X1)
union distinct(select * from user where gender='m' and region='south' limit :X2)
union distinct(select * from user where gender='m' and region='east' limit :X3)
union distinct(select * from user where gender='m' and region='west' limit :X4)
union distinct(select * from user where gender='f' and region='north' limit :X5)
union distinct(select * from user where gender='f' and region='south' limit :X6)
union distinct(select * from user where gender='f' and region='east' limit :X7)
union distinct(select * from user where gender='f' and region='west' limit :X8);
```

Lets denote a dimension D with n possibilities as D:n. Suppose you have D1:n1, D2:n2, …DM:nM dimensions. The would generate n1*n2*…nM variables. The number of equations generated is n1+n2+…nM. Rather then define the general method lets take another case of 3 dimensions, 4 dimensions and 2 dimensions; Lets call the possible values for D1 to be d11, d12,d13, D2 is d21, d22, d23, d24, and D3 values are d31,d32. We will have 24 variables, and the equations are:

```
X1 + X2 + ...X8=n11
X9 + X10 + ..X16=n12
X17+X18 + ...X24=n13
X1+X2+X9+x10+x17+x18=n21
X3+X4+X11+x12+x19+x20=n22
X5+X6+X13+x14+x21+x22=n23
X7+X8+X15+x116+x23+x24=n24
X1+X3+X5+...X23=n31
X2+X4+......X24=n32
```

Where

```
X1 equals number with D1=d11 and D2=d21 and D3=d31
X2 equals number with D1=d11 and D2=d21 and D3 = d31
....
X24 equals number with D1=D13 and D2=d24, and D3=d32.
```

Add the less then constraints. Then solve for X1,X2, … X24. Create the 24 select statements and take the disjoint union.

We can solve similarly for any dimensions.

So in summary: Given dimensions D1:n1, D2:n2, …DM:nM we can solve the corresponding linear programming problem as describe above for n1*n2*…nM variables and then generate a solution by taking the disjoint union over n1*n2*…nM select statements. So yes, we can generate a solution by select statements but first we have to solve the equations and determine limits by getting counts for each of the n1*n2*…nM variables.

Even though the bounty is over I am going to add a bit more for those you are interested. I claim here that I have completely shown how to solve this if there is a solution.

To clarify my approach. In the case of 3 dimensions, lets say we split age into one of 3 possibilities. Then well use gender and region as in the question. There are 24 different possibilities for each user corresponding to where they fall in those categories. Let Xi be the number of each of those possibilities in the final result. Let me write a matrix where each row is represents one of each possibility. Each user will contribute at most 1 to m or f, 1 to north, south, east or west, and 1 to the age category. And there are only 24 possibilities for the user. Lets show a matrix: (abc) the 3 ages, (nsew) the regions and

(mf) male or female: a is age less then or equal to 10, b is age between 11 and 30 and c is age between 31 and 50.

```
abc nsew mf
X1 100 1000 10
X2 100 1000 01
X3 100 0100 10
X4 100 0100 01
X5 100 0010 10
X6 100 0010 01
X7 100 0001 10
X8 100 0001 01
X9 010 1000 10
X10 010 1000 01
X11 010 0100 10
X12 010 0100 01
X13 010 0010 10
X14 010 0010 01
X15 010 0001 10
X16 010 0001 01
X17 001 1000 10
X18 001 1000 01
X19 001 0100 10
X20 001 0100 01
X21 001 0010 10
X22 001 0010 01
X23 001 0001 10
X24 001 0001 01
```

Each row represents a user where there is a 1 in the column if it contributes to a result. For example, the first row shows 1 for a, 1 for n, and 1 for m. Which means the user’s age is less then or equal to 10, is from the north and is a male.

The Xi represents how many of that kind of row is in the final result. So lets say X1 is 10 that means that we are say the final result has 10 results all of which are from the north, are males and are less then or equal 10. OK so now we just have to add things up. Notice that the first 8 `X1+X2+X3+X4+X5+X6+X7+X8`

are all the rows that whose age less then or equal to 10. They must add up to whatever we chose for that category. Similarly for the next 2 sets of 8.

So so far we get the equations: (na is the number with age less then 10, nb the age between 10 and 20, nc the number whose age less then 50

```
X1+X2+X3+X4+X5+X6+X7+X8 = na
X9+X10+X11 + .... X16 = nb
X17+X18+X19+... X24=nc
```

Those are the age splits. Now lets look at the region splits. Just add up the variables in the “n” column,

```
X1+X2+X9+X10+X17+X18 = nn
X3+X4+X11+X12+X19+20=ns
...
```

etc.

Do you see how I am getting those equations by just looking down the columns?

Continue for ew and mf. giving 3+4+2 equations in total. So what I did here is quite simple. I have reasoned that any row you pick contributes one to each of the 3 dimensions and there are only 24 possibilities. Then let Xi be the number for each possibility and you get the equations that needs to be solved. It seems to me that whatever method you come up with must be a solution to those equations. In other words I simply reformulated the problem in terms of solving those equations.

Now we want an integer solution since we cannot have a fractional row. Notice these are all linear equations. But we want an integer solution. Here is a link to a paper that describes how to solve these: https://www.math.uwaterloo.ca/~wgilbert/Research/GilbertPathria.pdf

Forming the business logic in SQL is never a good idea as it’ll hamper ability to absorb even minor changes.

My suggestion would be to do this in an ORM and keep the business logic abstracted from SQL.

For example if your were using **Django**:

Your model would look like:

```
class User(models.Model):
GENDER_CHOICES = (
('M', 'Male'),
('F','Female')
)
gender = models.CharField(max_length=1, choices=GENDER_CHOICES)
REGION_CHOICES = (
('E', 'East'),
('W','West'),
('N','North'),
('S','South')
)
region = models.CharField(max_length=1, choices=REGION_CHOICES)
age = models.IntegerField()
ETHNICITY_CHOICES = (
.......
)
ethnicity = models.CharField(max_length=1, choices=ETHNICITY_CHOICES)
income = models.FloatField()
```

And your query function could be something like this:

```
# gender_limits is a dict like {'M':400, 'F':600}
# region_limits is a dict like {'N':100, 'E':200, 'W':300, 'S':400}
def get_users_by_gender_and_region(gender_limits,region_limits):
for gender in gender_limits:
gender_queryset = gender_queryset | User.objects.filter(gender=gender)[:gender_limits[gender]]
for region in region_limits:
region_queryset = region_queryset | User.objects.filter(region=region)[:region_limits[region]]
return gender_queryset & region_queryset
```

The query function can be abstracted further with the knowledge of all queries you plan to support, but this should serve as an example.

If you are using a different ORM, the same idea can be translated to that too as any good ORM would have the union and intersection abstraction.

I would use a programming language to generate the SQL statements, but below is a solution in pure mySQL. One assumption made: There is always enough male/female in one region to fit the numbers (e.g. what if there are no female living in the north?).

The routine is pre-calculating the needed row quantities. Limit cannot be specified using a variable. I am more an oracle guy where we have analytical functions. MySQL also provides this to some extend by allowing variables. So I set the target regions and gender and calculate the breakdown. Then I limit my output using the calculations.

This query shows the counts to proof the concept.

```
set @male=600;
set @female=400;
set @north=100;
set @south=200;
set @east=300;
set @west=400;
set @[email protected]*(@male/(@[email protected]));
set @[email protected]*(@male/(@[email protected]));
set @east_male [email protected] *(@male/(@[email protected]));
set @west_male [email protected] *(@male/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @east_female [email protected] *(@female/(@[email protected]));
set @west_female [email protected] *(@female/(@[email protected]));
select gender, region, count(*)
from (
select * from (select @north_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0
union all select * from (select @south_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0
union all select * from (select @east_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'East' ) me where row>=0
union all select * from (select @west_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'West' ) mw where row>=0
union all select * from (select @north_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0
union all select * from (select @south_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0
union all select * from (select @east_female :[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'East' ) fe where row>=0
union all select * from (select @west_female :[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'West' ) fw where row>=0
) a
group by gender, region
order by gender, region;
```

Output:

```
Female East 120
Female North 40
Female South 80
Female West 160
Male East 180
Male North 60
Male South 120
Male West 240
```

Remove the outer part to get the real records:

```
set @male=600;
set @female=400;
set @north=100;
set @south=200;
set @east=300;
set @west=400;
set @[email protected]*(@male/(@[email protected]));
set @[email protected]*(@male/(@[email protected]));
set @east_male [email protected] *(@male/(@[email protected]));
set @west_male [email protected] *(@male/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @[email protected]*(@female/(@[email protected]));
set @east_female [email protected] *(@female/(@[email protected]));
set @west_female [email protected] *(@female/(@[email protected]));
select * from (select @north_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'North' ) mn where row>=0
union all select * from (select @south_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'South' ) ms where row>=0
union all select * from (select @east_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'East' ) me where row>=0
union all select * from (select @west_male :[email protected]_male-1 as row, userid, gender, region from users where gender = 'Male' and region = 'West' ) mw where row>=0
union all select * from (select @north_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'North' ) fn where row>=0
union all select * from (select @south_female:[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'South' ) fs where row>=0
union all select * from (select @east_female :[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'East' ) fe where row>=0
union all select * from (select @west_female :[email protected]_female-1 as row, userid, gender, region from users where gender = 'Female' and region = 'West' ) fw where row>=0
;
```

For testing I have written a procedure which does create 10000 sample records fully random:

```
use test;
drop table if exists users;
create table users (userid int not null auto_increment, gender VARCHAR (20), region varchar(20), primary key (userid) );
drop procedure if exists load_users_table;
delimiter #
create procedure load_users_table()
begin
declare l_max int unsigned default 10000;
declare l_cnt int unsigned default 0;
declare l_gender varchar(20);
declare l_region varchar(20);
declare l_rnd smallint;
truncate table users;
start transaction;
WHILE l_cnt < l_max DO
set l_rnd = floor( 0 + (rand()*2) );
if l_rnd = 0 then
set l_gender = 'Male';
else
set l_gender = 'Female';
end if;
set l_rnd=floor(0+(rand()*4));
if l_rnd = 0 then
set l_region = 'North';
elseif l_rnd=1 then
set l_region = 'South';
elseif l_rnd=2 then
set l_region = 'East';
elseif l_rnd=3 then
set l_region = 'West';
end if;
insert into users (gender, region) values (l_gender, l_region);
set l_cnt=l_cnt+1;
end while;
commit;
end #
delimiter ;
call load_users_table();
select gender, region, count(*)
from users
group by gender, region
order by gender, region;
```

Hope this all helps you. The bottom line is: Use a `UNION ALL`

and restrict with pre-calculated variables not `LIMIT`

.

Well, I think the question is about randomly getting the records and not in the proportion of 60/40 for all regions. I have done for Region and Gender. It can be generalized to other fields like age, income and ethnicity in the same way.

```
Declare @Mlimit bigint
Declare @Flimit bigint
Declare @Northlimit bigint
Declare @Southlimit bigint
Declare @Eastlimit bigint
Declare @Westlimit bigint
Set @Mlimit= 600
Set @Flimit=400
Set @Northlimit= 100
Set @Southlimit=200
Set @Eastlimit=300
Set @Westlimit=400
CREATE TABLE #Users(
[UserId] [int] NOT NULL,
[gender] [varchar](10) NULL,
[region] [varchar](10) NULL,
[age] [int] NULL,
[ethnicity] [varchar](50) NULL,
[income] [bigint] NULL
)
Declare @MnorthCnt bigint
Declare @MsouthCnt bigint
Declare @MeastCnt bigint
Declare @MwestCnt bigint
Declare @FnorthCnt bigint
Declare @FsouthCnt bigint
Declare @FeastCnt bigint
Declare @FwestCnt bigint
Select @MnorthCnt=COUNT(*) from users where gender='male' and region='north'
Select @FnorthCnt=COUNT(*) from users where gender='female' and region='north'
Select @MsouthCnt=COUNT(*) from users where gender='male' and region='south'
Select @FsouthCnt=COUNT(*) from users where gender='female' and region='south'
Select @MeastCnt=COUNT(*) from users where gender='male' and region='east'
Select @FeastCnt=COUNT(*) from users where gender='female' and region='east'
Select @MwestCnt=COUNT(*) from users where gender='male' and region='west'
Select @FwestCnt=COUNT(*) from users where gender='female' and region='west'
If (@[email protected][email protected])
begin
Insert into #Users select * from Users where region='north'
set @Northlimit=0
set @[email protected]
set @[email protected]
set @MnorthCnt=0
set @FnorthCnt=0
end
If (@[email protected][email protected])
begin
Insert into #Users select * from Users where region='South'
set @Southlimit=0
set @[email protected]
set @[email protected]
set @MsouthCnt=0
set @FsouthCnt=0
end
If (@[email protected][email protected])
begin
Insert into #Users select * from Users where region='East'
set @Eastlimit=0
set @[email protected]
set @[email protected]
set @MeastCnt=0
set @FeastCnt=0
end
If (@[email protected][email protected])
begin
Insert into #Users select * from Users where region='West'
set @Westlimit=0
set @[email protected]
set @[email protected]
set @MwestCnt=0
set @FwestCnt=0
end
If @MnorthCnt<@Northlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='female' and region='north'
and userid not in (select userid from #users)
set @Flimit-=(@[email protected])
set @FNorthCnt-=(@[email protected])
set @Northlimit-=(@[email protected])
End
If @FnorthCnt<@Northlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='male' and region='north'
and userid not in (select userid from #users)
set @Mlimit-=(@[email protected])
set @MNorthCnt-=(@[email protected])
set @Northlimit-=(@[email protected])
End
if @MsouthCnt<@southlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='female' and region='south'
and userid not in (select userid from #users)
set @Flimit-=(@[email protected])
set @FSouthCnt-=(@[email protected])
set @southlimit-=(@[email protected])
End
if @FsouthCnt<@southlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='male' and region='south'
and userid not in (select userid from #users)
set @Mlimit-=(@[email protected])
set @MSouthCnt-=(@[email protected])
set @southlimit-=(@[email protected])
End
if @MeastCnt<@eastlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='female' and region='east'
and userid not in (select userid from #users)
set @Flimit-=(@[email protected])
set @FEastCnt-=(@[email protected])
set @eastlimit-=(@[email protected])
End
if @FeastCnt<@eastlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='male' and region='east'
and userid not in (select userid from #users)
set @Mlimit-=(@[email protected])
set @MEastCnt-=(@[email protected])
set @eastlimit-=(@[email protected])
End
if @MwestCnt<@westlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='female' and region='west'
and userid not in (select userid from #users)
set @Flimit-=(@[email protected])
set @FWestCnt-=(@[email protected])
set @westlimit-=(@[email protected])
End
if @FwestCnt<@westlimit
Begin
insert into #Users select top (@[email protected]) * from Users where gender='male' and region='west'
and userid not in (select userid from #users)
set @Mlimit-=(@[email protected])
set @MWestCnt-=(@[email protected])
set @westlimit-=(@[email protected])
End
IF (@MnorthCnt>[email protected] and @FnorthCnt>[email protected] and @MsouthCnt>[email protected] and @FsouthCnt>[email protected] and @MeastCnt>[email protected] and @FeastCnt>[email protected] and @MwestCnt>[email protected] and @FwestCnt>[email protected] and not(@Mlimit=0 and @Flimit=0))
Begin
---Create Cursor
DECLARE UC CURSOR FAST_forward
FOR
SELECT *
FROM Users
where userid not in (select userid from #users)
Declare @UserId [int] ,
@gender [varchar](10) ,
@region [varchar](10) ,
@age [int] ,
@ethnicity [varchar](50) ,
@income [bigint]
OPEN UC
FETCH NEXT FROM UC
INTO @UserId ,@gender, @region, @age, @ethnicity, @income
WHILE @@FETCH_STATUS = 0 and not (@Mlimit=0 and @Flimit=0)
BEGIN
If @gender='male' and @region='north' and @Northlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MNorthCnt-=1
set @Northlimit-=1
end
If @gender='male' and @region='south' and @southlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MsouthCnt-=1
set @Southlimit-=1
end
If @gender='male' and @region='east' and @eastlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MeastCnt-=1
set @eastlimit-=1
end
If @gender='male' and @region='west' and @westlimit>0 AND @Mlimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Mlimit-=1
set @MwestCnt-=1
set @westlimit-=1
end
If @gender='female' and @region='north' and @Northlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Flimit-=1
set @FNorthCnt-=1
set @Northlimit-=1
end
If @gender='female' and @region='south' and @southlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @Flimit-=1
set @FsouthCnt-=1
set @Southlimit-=1
end
If @gender='female' and @region='east' and @eastlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @flimit-=1
set @feastCnt-=1
set @eastlimit-=1
end
If @gender='female' and @region='west' and @westlimit>0 AND @flimit>0
begin
insert into #Users values (@UserId ,@gender, @region, @age, @ethnicity, @income)
set @flimit-=1
set @fwestCnt-=1
set @westlimit-=1
end
FETCH NEXT FROM UC
INTO @UserId ,@gender, @region, @age, @ethnicity, @income
END
CLOSE UC
DEALLOCATE UC
end
Select * from #Users
SELECT GENDER, REGION, COUNT(*) AS COUNT FROM #USERS
GROUP BY GENDER, REGION
DROP TABLE #Users
```

I expect you’d want to generate a bunch of queries based on the required filters.

I’ll explain a possible approach, with a full code sample – but note the caveats later on.

I’ll also address the issue where you can’t fulfil the requested sample from a proportional distribution, but you can from an adjusted distribution – and explain how to do that adjustment

The basic algorithm goes like this:

Start with a set of filters `{F1, F2, ... Fn}`

, each which has a group of values, and percentages which should be distributed amongst those values. For example F1 might be gender, with 2 values (F1V1 = Male: 60%, F1V2 = Female: 40%) You’ll also want the total sample size required (call this `X`

) From this starting point you can then combine all the filters items from each filter to get a single set all of the combined filter items, and the quantities required for each.

The code should be able to handle any number of filters, with any number of values (either exact values, or ranges)

EG: suppose 2 filters, F1: gender, {F1V1 = Male: 60%, F1V2 = Female: 40%}, F2: region, {F2V1 = North: 50%, F2V2 = South: 50%} and a total sample required of X = 10 people.

In this sample we’d like 6 of them to be male, 4 of them to be female, 5 to be from the north, and 5 to be from the south.

Then we do

- Create an sql stub for each value in F1 – with an associated fraction of the initial percentage (i.e.
`WHERE gender = 'Male'`

: 0.6,`WHERE gender = 'Female'`

: 0.4 )

- For each item in F2 – create a new sql stub from every item from the step above – with the filter now being both the F1 Value & the F2 Value, and the associated fraction being the product of the 2 fractions. So we now have 2 x 2 = 4 items of
`WHERE gender = 'Male' AND region = 'North'`

: 0.6 * 0.5 = 0.3,`WHERE gender = 'Female' AND region = 'North'`

: 0.4 * 0.5 = 0.2,`WHERE gender = 'Male' AND region = 'South'`

: 0.6*0.5 = 0.3,`WHERE gender = 'Female' AND region = 'South'`

: 0.4*0.5 = 0.2

- Repeat step 2 above for every additional Filter F3 to Fn. (in our example there were only 2 filters, so we are already done)
- Calculate the limit for each SQL stub as being [fraction associated with stub] * X = total required sample size (so for our example thats 0.3 * 10 = 3 for Male/North, 0.2 * 10 = 2 for Female/North etc)
- Finally for every sql stub – turn it into a complete SQL statement , and add the limit

**Code Sample**

I’ll provide C# code for this, but it should be easy enough to translate this to other languages.

It would be pretty tricky to attempt this in pure dynamic SQL

Note this is untested – and probably full of errors – but its an idea of the approach you could take.

I’ve defined a public method and a public class – which would be the entry point.

```
// This is an example of a public class you could use to hold one of your filters
// For example - if you wanted 60% male / 40% female, you could have an item with
// item1 = {Fraction: 0.6, ValueExact: 'Male', RangeStart: null, RangeEnd: null}
// & item2 = {Fraction: 0.4, ValueExact: 'Female', RangeStart: null, RangeEnd: null}
public class FilterItem{
public decimal Fraction {get; set;}
public string ValueExact {get; set;}
public int? RangeStart {get; set;}
public int? RangeEnd {get; set;}
}
// This is an example of a public method you could call to build your SQL
// - passing in a generic list of desired filter
// for example the dictionary entry for the above filter would be
// {Key: "gender", Value: new List<FilterItem>(){item1, item2}}
public string BuildSQL(Dictionary<string, List<FilterItem>> filters, int TotalItems)
{
// we want to build up a list of SQL stubs that can be unioned together.
var sqlStubItems = new List<SqlItem>();
foreach(var entry in filters)
{
AddFilter(entry.Key, entry.Value, sqlStubItems);
}
// ok - now just combine all of the sql stubs into one big union.
var result = ""; // Id use a stringbuilder for this normally,
// but this is probably more cross-language readable.
int limitSum = 0;
for(int i = 0; i < sqlStubItems.Count; i++) // string.Join() would be more succinct!
{
var item = sqlStubItems[i];
if (i > 0)
{
result += " UNION ";
}
int limit = (int)Math.Round(TotalItems * item.Fraction, 0);
limitSum+= limit;
if (i == sqlStubItems.Count - 1 && limitSum != TotalItems)
{
//may need to adjust one of the rounded items to account
//for rounding errors making a total that is not the
//originally required total limit.
limit += (TotalItems - limitSum);
}
result += item.Sql + " LIMIT "
+ Convert.ToString(limit);
}
return result;
}
// This method expands the number of SQL stubs for every filter that has been added.
// each existing filter is split by the number of items in the newly added filter.
private void AddFilter(string filterType,
List<FilterItem> filterValues,
List<SqlItem> SqlItems)
{
var newItems = new List<SqlItem>();
foreach(var filterItem in filterValues)
{
string filterAddon;
if (filterItem.RangeStart.HasValue && filterItem.RangeEnd.HasValue){
filterAddon = filterType + " >= " + filterItem.RangeStart.ToString()
+ " AND " + filterType + " <= " + filterItem.RangeEnd.ToString();
} else {
filterAddon = filterType + " = '"
+ filterItem.ValueExact.Replace("'","''") + "'";
//beware of SQL injection. (hence the .Replace() above)
}
if(SqlItems.Count() == 0)
{
newItems.Add(new SqlItem(){Sql = "Select * FROM users WHERE "
+ filterAddon, Fraction = filterItem.Fraction});
} else {
foreach(var existingItem in SqlItems)
{
newItems.Add(new SqlItem()
{
Sql = existingItem + " AND " + filterAddon,
Fraction = existingItem.Fraction * filterItem.Fraction
});
}
}
}
SqlItems.Clear();
SqlItems.AddRange(newItems);
}
// this class is for part-built SQL strings, with the fraction
private class SqlItem{
public string Sql { get; set;}
public decimal Fraction{get; set;}
}
```

**Notes** (as per comment by Sign)

- Rounding errors may mean you don’t get exactly the 600 / 400 split you were aiming for when applying a large number of filters – but should be close.
- If your dataset is not very diverse then it may not be possible to always generate the required split. This method will require an even distribution amongst the filters (so if you were doing a total of 10 people, 6 male, 4 female , 5 from the north, 5 from the south it would require 3 males from the north, 3 males from the south, 2 females from the north and 2 females from the south.)
- The people are not going to be retrieved at random – just whatever the default sort is. You would need to add something like ORDER BY RAND() (but not that as its VERY inefficient) to get a random selection.
- Beware of SQL injection. Sanitise all user input, replacing single quote
`'`

chars.

**Badly distributed sample problem**

How do you address the problem of there being insufficient items in one of our buckets to create our sample as per a representative split (that the above algorithm gives)? Or what if your numbers are not integers?

Well I won’t go so far as to provide code, but I will describe a possible approach. You’d need to alter the code above quite a bit, because a flat list of sql stubs isn’t going to cut it anymore. Instead you’d need to build a n-dimensional matrix of SQL stubs (adding a dimension for every filter F1 – n) After step 4 above has been completed (where we have our desired, but not necessarily possible numbers for each SQL stub item), what I’d expect to do is

- generate SQL to select counts for all the combined sql WHERE stubs.
- Then you’d iterate the collection – and if you hit an item where the requested limit is higher than the count (or not an integer),
- adjust the requested limit down to the count (or nearest integer).
- Then pick another item on each of the axis that is at least the above adjustment lower that its max count, and adjust it up by the same. If its not possible to find qualifying items then your requested split is not possible.
- Then adjust all the intersecting items for the upward adjusted items down again
- Repeat the step above for intersects between the intersecting points for every additional dimension to n (but toggle the adjustment between negative and positive each time)

So suppose continuing our previous example – our representative split is:

Male/North = 3, Female/North = 2, Male/South = 3, Female/South = 2, but there are only 2 Males in the north (but theres loads of people in the other groups we could pick)

- We adjust Male/North down to 2 (-1)
- We adjust Female/North to 3 (+1) and Male/South to 4 (+1)
- We adjust the Intersecting Female/South to 1 (-1). Voila! (there are no additional dimensions as we only had 2 criteria/dimensions)

This illustration may be helpful when adjusting intersecting items in higher dimensions (only showing up to 4 dimensions, but should help to picture what needs to be done! Each point represents one of our SQL stub items in the n-dimensional matrix (and has an associated limit number) A line represents a common criteria value (such as gender = male). The objective is that the total along any line should remain the same after adjustments have finished! We start with the red point, and continue for each additional dimension… In the example above we would only be looking at 2 dimensions – a square formed from the red point, the 2 orange points above and to the right of it, and the 1 green point to the NE to complete the square.

I’d go with `GROUP BY`

:

`SELECT gender,region,count(*) FROM users GROUP BY gender,region`

```
+----------------------+
|gender|region|count(*)|
+----------------------+
|f |E | 129|
|f |N | 43|
|f |S | 84|
|f |W | 144|
|m |E | 171|
|m |N | 57|
|m |S | 116|
|m |W | 256|
+----------------------+
```

You can verify you have 600 males, 400 females, 100 North, 200 South, 300 East and 400 West.

You can include other fields as well.

For range fields, like age and income, you can follow this example:

```
SELECT
gender,
region,
case when age < 30 then 'Young'
when age between 30 and 59 then 'Middle aged'
else 'Old' end as age_range,
count(*)
FROM users
GROUP BY gender,region, age_range
```

So, the results would be like:

```
+----------------------------------+
|gender|region|age |count(*)|
+----------------------------------+
|f |E |Middle aged| 56|
|f |E |Old | 31|
|f |E |Young | 42|
|f |N |Middle aged| 14|
|f |N |Old | 11|
|f |N |Young | 18|
|f |S |Middle aged| 40|
|f |S |Old | 23|
|f |S |Young | 21|
|f |W |Middle aged| 67|
|f |W |Old | 42|
|f |W |Young | 35|
|m |E |Middle aged| 77|
|m |E |Old | 56|
|m |E |Young | 38|
|m |N |Middle aged| 13|
|m |N |Old | 25|
|m |N |Young | 19|
|m |S |Middle aged| 46|
|m |S |Old | 39|
|m |S |Young | 31|
|m |W |Middle aged| 103|
|m |W |Old | 66|
|m |W |Young | 87|
+----------------------------------+
```