Pd.get_dummies Dataframe Same Size When Sparse = True As When Sparse = False

I have a dataframe with several string columns that I want to convert to categorical data so that I can run some models and extract important features from. However, due to the amo

Solution 1:

I looked at pandas get_dummies source but could not spot an error so far. Here is a small experiment that I did below (1st half is reproducing your problem with real data).

In [1]: import numpy as np
   ...: import pandas as pd
   ...: a = ['a', 'b'] * 100000
   ...: A = ['A', 'B'] * 100000
   ...: df1 = pd.DataFrame({'a': a, 'A': A})
   ...: df1 = pd.get_dummies(df1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB

In [2]: df2 = pd.DataFrame({'a': a, 'A': A})
   ...: df2 = pd.get_dummies(df2, sparse=True)
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 781.3 KB

So far the same result (the size of df1 is equal to that of df2) as yours, but if I explicitly convert df2 to sparse using to_sparse with fill_value=0

In [3]: df2 = df2.to_sparse(fill_value=0)
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 4 columns):
A_A    200000 non-null uint8
A_B    200000 non-null uint8
a_a    200000 non-null uint8
a_b    200000 non-null uint8
dtypes: uint8(4)
memory usage: 390.7 KB

Now the memory usage is half since half of the data is 0.

In conclusion, I'm not sure why get_dummies(sparse=True) does not compress the dataframe even though it is converted to SparseDataFrame, but there is a workaround. Related discussion was going on in github get_dummies with sparse doesn't convert numeric to sparse but the conclusion still seems to be up in the air.

