I need to split my dataset `df`

randomly into two sets (proportion 70:30) using batches of 2. By “batch”, I mean that the 2 (batch size) sequential rows should always belong to the same set.

```
col1 col2 col3
1 0.5 10
1 0.3 11
5 1.4 1
3 1.5 2
1 0.9 10
3 0.4 7
1 1.2 9
3 0.1 11
```

Sample result (due to randomness, the outputs might be different, but this serves as an example):

```
set1
col1 col2 col3
1 0.5 10
1 0.3 11
1 0.9 10
3 0.4 7
1 1.2 9
3 0.1 11
set2
5 1.4 1
3 1.5 2
```

I know how to split data randomly using batches of 1:

```
import numpy as np
msk = np.random.rand(len(df)) < 0.7
set1 = df[msk]
set2 = df[~msk]
```

However, not sure how to introduce a flexible batch.

Thanks.

**Update:**

This is what I currently have, but the last line of code fails. `set1`

and `set2`

should be pandas DataFrames.

```
n = 3
df_batches = [df[i:i+n] for i in range(0, df.shape[0],n)]
set1_idx = np.random.randint(len(df_batches), size=int(0.7*len(df_batches)))
set2_idx = np.random.randint(len(df_batches), size=int(0.3*len(df_batches)))
set1, set2 = df_batches[set1_idx,:], df_batches[set2_idx,:]
```

Here’s a function doing what you want based on a random integer and then taking the 30%:

```
def split_data(df, batchsize):
x = np.random.randint(0, len(df))
idx = round(len(df) * batchsize)
# so we don't get out of the bounds of our index
if x + idx > len(df):
x = x - idx
batch1 = df.loc[np.arange(x, x+idx)]
batch2 = df.loc[~df.index.isin(batch1.index)]
return batch1, batch2
df1, df2 = split_data(df, 0.3)
```

```
print(df1, '\n')
print(df2)
col1 col2 col3
4 1 0.9 10
5 3 0.4 7
col1 col2 col3
0 1 0.5 10
1 1 0.3 11
2 5 1.4 1
3 3 1.5 2
6 1 1.2 9
7 3 0.1 11
```

### Answer：

for more randomness you can use the numpy function `np.random.permutation`

. Here is an example :

```
batchsizes = np.asarray([0.7])
permutations = np.random.permutation(len(df))
batchsizes *= len(permutations)
slices = np.split(permutations, batchsizes.round().astype(np.int))
batchs = [df.loc[s] for s in slices]
```

This have a better randomness because it **no longer depends on the initial form of your dataframe**. And you can have **more than 2 part**. For example you can take `batchsizes = np.asarray([0.3,0.1,0.3])`

and it will slice in a proportions of 30:10:30:30.

Tags: exception, pythonpython