Skip to content Skip to sidebar Skip to footer

Get List Of Unique String Values Per Column In A Dataframe Using Python

here I go with another question I have a large dataframe about 20 columns by 400.000 rows. In this dataset I can not have string since the software that will process the data only

Solution 1:

Answer :

# try to convert all columns to numbers...df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()

del st

Demo:

In [233]:dfOut[233]:DATETIMEFNRHP306HFNRHP306HCFNRHP306_2MEC_MAX07-Feb-150:00:00NORMALNORMAL105017-Feb-150:01:00NORMALNORMAL105027-Feb-150:02:00NORMALHIGH105037-Feb-150:03:00HIGHNORMAL105047-Feb-150:04:00LOWNORMAL105057-Feb-150:05:00NORMALLOW1050

first we stack all object (string) columns:

In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

In [236]: st = df[cols].stack().to_frame('name')

now we can factorize stacked column:

In [238]: st['cat'] = pd.factorize(st.name)[0]

In [239]: st
Out[239]:
                name  cat
0 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    01 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    02 FNRHP306H   NORMAL    0
  FNRHP306HC    HIGH    13 FNRHP306H     HIGH    1
  FNRHP306HC  NORMAL    04 FNRHP306H      LOW    2
  FNRHP306HC  NORMAL    05 FNRHP306H   NORMAL    0
  FNRHP306HC     LOW    2

assign unstacked result back to original DF (to object columns):

In [241]:df[cols]=st['cat'].unstack()In [242]:dfOut[242]:DATETIMEFNRHP306HFNRHP306HCFNRHP306_2MEC_MAX07-Feb-150:00:0000105017-Feb-150:01:0000105027-Feb-150:02:0001105037-Feb-150:03:0010105047-Feb-150:04:0020105057-Feb-150:05:00021050

Explanation:

In [248]: df.filter(like='FNR')
Out[248]:
  FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0    NORMAL     NORMAL               10501    NORMAL     NORMAL               10502    NORMAL       HIGH               10503      HIGH     NORMAL               10504       LOW     NORMAL               10505    NORMAL        LOW               1050

In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
  FNRHP306H FNRHP306HC
0    NORMAL     NORMAL
1    NORMAL     NORMAL
2    NORMAL       HIGH
3      HIGH     NORMAL
4       LOW     NORMAL
5    NORMAL        LOW

Post a Comment for "Get List Of Unique String Values Per Column In A Dataframe Using Python"