How Can I Bootstrap The Innermost Array Of A Numpy Array?
Solution 1:
The fastest/simplest answer turns out to be based on indexing a flattened version of your array:
defresampFlat(arr, reps):
n = arr.shape[-1]
# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)
# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriatelyreturn np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i inrange(reps)])
Timings confirm that this is the fastest answer.
Timings
I tested out the above resampFlat
function alongside a simpler for
loop based solution:
defresampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))
for rep,idx inzip(ret, idxs):
# iterate over the resampled replicatesfor row,rowrep,i inzip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]
# give the return value the appropriate shapereturn ret.reshape((reps, *shape))
and a solution based on Paul Panzer's fancy indexing approach:
defresampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]
return arr[I, J, K, idx]
I tested with the following data:
shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)
Here's the results from the array flattening approach:
%%timeit
resampFlat(data, 100)
1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the results from the for
loop approach:
%%timeit
resampFor(data, 100)
1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and from Paul's fancy indexing:
%%timeit
resampFancyIdx(data, 100)
1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrary to my expectations, resampFancyIdx
beat resampFor
, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.
Solution 2:
You can draw the indices of your samples and then apply fancy indexing:
>>>import numpy as np>>>>>>(categories, models, types, events) = (10, 11, 50, 100)>>>data = np.random.random((categories, models, types, events))>>>N_samples = 1000>>>>>>idx = np.random.randint(0, events, (categories, models, types, N_samples))>>>I, J, K, _ = np.ogrid[:categories, :models, :types, :0]>>>>>>resampled = data[I, J, K, idx]
A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.
>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],
[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],
[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],
[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3>>> >>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>> >>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],
[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],
[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],
[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],
[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],
[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],
[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],
[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],
[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],
[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],
[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],
[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)
Solution 3:
databoot = []
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])
- shape of
databoot
-> (5, 10, 11, 50, 100) - shape of
data
-> (10, 11, 50, 100)
Post a Comment for "How Can I Bootstrap The Innermost Array Of A Numpy Array?"