Skip to content Skip to sidebar Skip to footer

Numpy Array Show Only Unique Rows

I want to have the rows of an array, which are unique. Contrary to numpy's unique function, I want to exclude all rows, which occur more than once. So the input: [[1,1],[1,1],[1,2]

Solution 1:

Approach #1

Here's one approach using lex-sorting and np.bincount -

# Perform lex sort and get the sorted array version of the inputsorted_idx = np.lexsort(A.T)
sorted_Ar =  A[sorted_idx,:]

# Mask of start of each unique row in sorted array mask = np.append(True,np.any(np.diff(sorted_Ar,axis=0),1))

# Get counts of each unique rowunq_count = np.bincount(mask.cumsum()-1) 

# Compare counts to 1 and select the corresponding unique row with the maskout = sorted_Ar[mask][np.nonzero(unq_count==1)[0]]

Please note that the output would not maintain the order of elements as originally present in the input array.

Approach #2

If the elements are integers, then you can convert 2D array A to a 1D array assuming each row as an indexing tuple and that should be a pretty efficient solution. Also, please note that this approach would maintain the order of elements in the output. The implementation would be -

# Convert 2D array A to a 1D array assuming each row as an indexing tupleA_1D = A.dot(np.append(A.max(0)[::-1].cumprod()[::-1][1:],1))

# Get sorting indices for the 1D arraysort_idx = A_1D.argsort()

# Mask of start of each unique row in 1D sorted array mask = np.append(True,np.diff(A_1D[sort_idx])!=0)

# Get the counts of each unique 1D elementcounts = np.bincount(mask.cumsum()-1)

# Select the IDs with counts==1 and thus the unique rows from Aout = A[sort_idx[np.nonzero(mask)[0][counts==1]]]

Runtime tests and verification

Functions -

def unq_rows_v1(A):
    sorted_idx = np.lexsort(A.T)
    sorted_Ar =  A[sorted_idx,:]
    mask = np.append(True,np.any(np.diff(sorted_Ar,axis=0),1))
    unq_count = np.bincount(mask.cumsum()-1) 
    return sorted_Ar[mask][np.nonzero(unq_count==1)[0]]

def unq_rows_v2(A):
    A_1D = A.dot(np.append(A.max(0)[::-1].cumprod()[::-1][1:],1))
    sort_idx = A_1D.argsort()
    mask = np.append(True,np.diff(A_1D[sort_idx])!=0)
    return A[sort_idx[np.nonzero(mask)[0][np.bincount(mask.cumsum()-1)==1]]]

Timings & Verify Outputs -

In [272]: A = np.random.randint(20,30,(10000,5))

In [273]: unq_rows_v1(A).shape
Out[273]: (9051, 5)

In [274]: unq_rows_v2(A).shape
Out[274]: (9051, 5)

In [275]: %timeit unq_rows_v1(A)
100 loops, best of 3: 5.07 ms per loop

In [276]: %timeit unq_rows_v2(A)
1000 loops, best of 3: 1.96 ms per loop

Solution 2:

The numpy_indexed package (disclaimer: I am its author) is able to solve this problem efficiently, in a fully vectorized manner. I havnt tested with numpy yet 1.9, if that is still relevant, but perhaps youd be willing to give it a spin and let me know. I don't have any reason to believe it will not work with older versions of numpy.

a = np.random.rand(10000, 3).round(2)
unique, count = npi.count(a)
print(unique[count ==1])

Note that as per your original question, this solution is not restricted to a specific number of columns, or dtype.

Post a Comment for "Numpy Array Show Only Unique Rows"