Home » Python » performance – Efficiently replace elements in array based on dictionary – NumPy / Python-Exceptionshub

performance – Efficiently replace elements in array based on dictionary – NumPy / Python-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

First, of all, my apologies if this has been answered elsewhere. All I could find were questions about replacing elements of a given value, not elements of multiple values.

background

I have several thousand large np.arrays, like so:

# generate dummy data
input_array = np.zeros((100,100))
input_array[0:10,0:10] = 1
input_array[20:56, 21:43] = 5
input_array[34:43, 70:89] = 8

In those arrays, I want to replace values, based on a dictionary:

mapping = {1:2, 5:3, 8:6}

approach

At this time, I am using a simple loop, combined with fancy indexing:

output_array = np.zeros_like(input_array)

for key in mapping:
    output_array[input_array==key] = mapping[key]

problem

My arrays have dimensions of 2000 by 2000, the dictionaries have around 1000 entries, so, these loops take forever.

question

is there a function, that simply takes an array and a mapping in the form of a dictionary (or similar), and outputs the changed values?

help is greatly appreciated!

Update:

Solutions:

I tested the individual solutions in Ipython, using

%%timeit -r 10 -n 10

input data

import numpy as np
np.random.seed(123)

sources = range(100)
outs = [a for a in range(100)]
np.random.shuffle(outs)
mapping = {sources[a]:outs[a] for a in(range(len(sources)))}

For every solution:

np.random.seed(123)
input_array = np.random.randint(0,100, (1000,1000))

divakar, method 3:

%%timeit -r 10 -n 10
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) #k,v from approach #1
mapping_ar[k] = v
out = mapping_ar[input_array]

5.01 ms ± 641 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

divakar, method 2:

%%timeit -r 10 -n 10
k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

sidx = k.argsort() #k,v from approach #1

k = k[sidx]
v = v[sidx]

idx = np.searchsorted(k,input_array.ravel()).reshape(input_array.shape)
idx[idx==len(k)] = 0
mask = k[idx] == input_array
out = np.where(mask, v[idx], 0)

56.9 ms ± 609 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

divakar, method 1:

%%timeit -r 10 -n 10

k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

out = np.zeros_like(input_array)
for key,val in zip(k,v):
    out[input_array==key] = val

113 ms ± 6.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

eelco:

%%timeit -r 10 -n 10
output_array = npi.remap(input_array.flatten(), list(mapping.keys()), list(mapping.values())).reshape(input_array.shape)

143 ms ± 4.47 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

yatu

%%timeit -r 10 -n 10

keys, choices = list(zip(*mapping.items()))
# [(1, 5, 8), (2, 3, 6)]
conds = np.array(keys)[:,None,None]  == input_array
np.select(conds, choices)

157 ms ± 5 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

original, loopy method:

%%timeit -r 10 -n 10
output_array = np.zeros_like(input_array)

for key in mapping:
    output_array[input_array==key] = mapping[key]

187 ms ± 6.44 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

Thanks for the superquick help!

How to&Answers:

Approach #1 : Loopy one with array data

One approach would be extracting the keys and values in arrays and then use a similar loop –

k = np.array(list(mapping.keys()))
v = np.array(list(mapping.values()))

out = np.zeros_like(input_array)
for key,val in zip(k,v):
    out[input_array==key] = val

Benefit with this one over the original one is the spatial-locality of the array data for efficient data-fetching, which is used in the iterations.

Also, since you mentioned thousand large np.arrays. So, if the mapping dictionary stays the same, that step to get the array versions – k and v would be a one-time setup process.

Approach #2 : Vectorized one with searchsorted

A vectorized one could be suggested using np.searchsorted

sidx = k.argsort() #k,v from approach #1

k = k[sidx]
v = v[sidx]

idx = np.searchsorted(k,input_array.ravel()).reshape(input_array.shape)
idx[idx==len(k)] = 0
mask = k[idx] == input_array
out = np.where(mask, v[idx], 0)

Approach #3 : Vectorized one with mapping-array for integer keys

A vectorized one could be suggested using a mapping array for integer keys, which when indexed by the input array would lead us directly to the final output –

mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) #k,v from approach #1
mapping_ar[k] = v
out = mapping_ar[input_array]

Answer:

Given that you’re using numpy arrays, I’d suggest you do a mapping using numpy too. Here’s a vectorized approach using np.select:

mapping = {1:2, 5:3, 8:6}
keys, choices = list(zip(*mapping.items()))
# [(1, 5, 8), (2, 3, 6)]
# we can use broadcasting to obtain a 3x100x100
# array to use as condlist
conds = np.array(keys)[:,None,None]  == input_array
# use conds as arrays of conditions and the values 
# as choices
np.select(conds, choices)

array([[2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Answer:

The numpy_indexed library (disclaimer: I am its author) provides functionality to implement this operation in an efficient vectorized maner:

import numpy_indexed as npi
output_array = npi.remap(input_array.flatten(), list(mapping.keys()), list(mapping.values())).reshape(input_array.shape)

Note; I didnt test it; but it should work along these lines. Efficiency should be good for large inputs, and many items in the mapping; I imagine similar to divakars’ method 2; not as fast as his method 3. But this solution is aimed more at generality; and it will also work for inputs which are not positive integers; or even nd-arrays (f.i. replacing colors in an image with other colors, etc).

Answer:

I think the Divakar #3 method assumes that the mapping dict covers all values (or at least the maximum value) in the target array. Otherwise, to avoid index out of range errors, you have to replace the line

mapping_ar = np.zeros(k.max()+1,dtype=v.dtype) with

mapping_ar = np.zeros(array.max()+1,dtype=v.dtype)

That adds considerable overhead.