Get List Of Unique String Values Per Column In A Dataframe Using Python
here I go with another question I have a large dataframe about 20 columns by 400.000 rows. In this dataset I can not have string since the software that will process the data only
Solution 1:
Answer :
# try to convert all columns to numbers...df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()
del st
Demo:
In [233]:dfOut[233]:DATETIMEFNRHP306HFNRHP306HCFNRHP306_2MEC_MAX07-Feb-150:00:00NORMALNORMAL105017-Feb-150:01:00NORMALNORMAL105027-Feb-150:02:00NORMALHIGH105037-Feb-150:03:00HIGHNORMAL105047-Feb-150:04:00LOWNORMAL105057-Feb-150:05:00NORMALLOW1050
first we stack all object
(string) columns:
In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
In [236]: st = df[cols].stack().to_frame('name')
now we can factorize stacked column:
In [238]: st['cat'] = pd.factorize(st.name)[0]
In [239]: st
Out[239]:
name cat
0 FNRHP306H NORMAL 0
FNRHP306HC NORMAL 01 FNRHP306H NORMAL 0
FNRHP306HC NORMAL 02 FNRHP306H NORMAL 0
FNRHP306HC HIGH 13 FNRHP306H HIGH 1
FNRHP306HC NORMAL 04 FNRHP306H LOW 2
FNRHP306HC NORMAL 05 FNRHP306H NORMAL 0
FNRHP306HC LOW 2
assign unstacked result back to original DF (to object
columns):
In [241]:df[cols]=st['cat'].unstack()In [242]:dfOut[242]:DATETIMEFNRHP306HFNRHP306HCFNRHP306_2MEC_MAX07-Feb-150:00:0000105017-Feb-150:01:0000105027-Feb-150:02:0001105037-Feb-150:03:0010105047-Feb-150:04:0020105057-Feb-150:05:00021050
Explanation:
In [248]: df.filter(like='FNR')
Out[248]:
FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX
0 NORMAL NORMAL 10501 NORMAL NORMAL 10502 NORMAL HIGH 10503 HIGH NORMAL 10504 LOW NORMAL 10505 NORMAL LOW 1050
In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
FNRHP306H FNRHP306HC
0 NORMAL NORMAL
1 NORMAL NORMAL
2 NORMAL HIGH
3 HIGH NORMAL
4 LOW NORMAL
5 NORMAL LOW
Post a Comment for "Get List Of Unique String Values Per Column In A Dataframe Using Python"