Skip to content Skip to sidebar Skip to footer

Replace Any String In Columns With 1

I'm working with pandas. My goal is to convert several columns within a dataframe from containing either NaN or string data, into more or less a dummy variable (0's for NaN; 1's fo

Solution 1:

You can do this with DataFrame.replace() with a regular expression:

In [14]: df
Out[14]:
   fol T_opp T_Dir T_Enh
0    1     0     0    vo
1    2    vr     0     0
2    2     0     0     0
3    3     0    bt     0

In [15]: df.replace(regex={'vr|bt|vo': '1'}).convert_objects(convert_numeric=True)
Out[15]:
   fol T_opp T_Dir T_Enh
0    1     0     0     1
1    2     1     0     0
2    2     0     0     0
3    3     0     1     0

If for some reason you're against dicts, you can be very explicit about it too:

In [19]: df.replace(regex='vr|bt|vo', value='1')
Out[19]:
   fol T_opp T_Dir T_Enh
01001121002200033010

But wait there's more! You can specify the columns you want to operate on by passing a nested dict (keys cannot be regular expressions, well, they can but it won't do anything except return the frame):

In [22]: df.replace({'T_opp': {'vr': 1}, 'T_Dir': {'bt': 1}})
Out[22]:
   fol T_opp T_Dir T_Enh
0    1     0     0    vo
1    2     1     0     0
2    2     0     0     0
3    3     0     1     0

EDIT: Since you to replace all strings with the number 1 (as per your comments below) do:

In [23]: df.replace(regex={r'\D+': 1})
Out[23]:
   fol T_opp T_Dir T_Enh
01001121002200033010

EDIT: Microbenchmarks might be useful here:

Andy's method (faster):

In [11]: timeit df.convert_objects(convert_numeric=True).fillna(1)
1000 loops, best of 3: 590 µs per loop

DataFrame.replace():

In [46]: timeit df.replace(regex={r'\D': 1})
1000 loops, best of 3: 801 µs per loop

If you have columns containing strings that you want to keep

In [45]: cols_to_replace = 'T_opp', 'T_Dir', 'T_Enh'

In [46]: d = dict(zip(cols_to_replace, [{r'\D': 1}] * len(cols_to_replace)))

In [47]: d
Out[47]: {'T_Dir': {'\\D': 1}, 'T_Enh': {'\\D': 1}, 'T_opp': {'\\D': 1}}

In [48]: df.replace(d)
Out[48]:
   fol T_opp T_Dir T_Enh Activity
01001       hf
12100       hx
22000       fe
33010       rn

Yet another way is to use filter and join the results together after replacement:

In [10]: df
Out[10]:
   fol T_opp T_Dir T_Enh Activity
0    1     0     0    vo       hf
1    2    vr     0     0       hx
2    2     0     0     0       fe
3    3     0    bt     0       rn

In [11]: filtered = df.filter(regex='T_.+')

In [12]: res = filtered.replace({'\D': 1})

In [13]: res
Out[13]:
  T_opp T_Dir T_Enh
0     0     0     1
1     1     0     0
2     0     0     0
3     0     1     0

In [14]: not_filtered = df[df.columns - filtered.columns]

In [15]: not_filtered
Out[15]:
  Activity  fol
0       hf    1
1       hx    2
2       fe    2
3       rn    3

In [16]: res.join(not_filtered)
Out[16]:
  T_opp T_Dir T_Enh Activity  fol
0     0     0     1       hf    1
1     1     0     0       hx    2
2     0     0     0       fe    2
3     0     1     0       rn    3

Note that the original order of the columns is not retained.

You can use regular expressions to search for column names, which might be more useful than explicitly constructing a list if you have many columns to keep. The - operator performs set difference when used with two Index objects (df.columns is an Index).

You'll probably need to call DataFrame.convert_objects() afterward unless your columns are mixed string/integer columns. My solution assumes they are all strings so I call convert_objects() to coerce the values to intdtype.

Solution 2:

Another option is to do this the other way around, first convert to numeric:

In [11]: df.convert_objects(convert_numeric=True)
Out[11]: 
   fol  T_opp  T_Dir  T_Enh Activity
0100NaN       hf
12NaN00       hx
22000       fe
330NaN0       rn

And then fill in the NaNs with 1:

In [12]: df.convert_objects(convert_numeric=True).fillna(1)
Out[12]: 
   fol  T_opp  T_Dir  T_Enh Activity
01001       hf
12100       hx
22000       fe
33010       rn

Post a Comment for "Replace Any String In Columns With 1"