Combine Count And Percentage (normalization) In Pandas Crosstab
I know that I can have percentage values in a pandas.crosstab() when normalize=True. But I want to combine absolute and normalized values in one table. What I expect is a snipped l
Solution 1:
You can join the 2 result dataframes and then rearrange the column index, as follows:
tab2 = taba.join(tabb, lsuffix='_n', rsuffix='_%')
tab2.columns = tab2.columns.map(lambda x: tuple(x.split('_')))
tab2 = (tab2.sort_index(ascending=[True, False] , axis=1)
.rename_axis(columns=['YEAR', 'count_pct'], axis=1)
)
Result:
YEAR 200020012002All
count_pct n % n % n % n %
foo
A10.16666700.00000010.16666720.333333B10.16666710.16666700.00000020.333333
C 00.00000010.16666710.16666720.333333All20.33333320.33333320.33333361.000000
Edit:
Breaking down the steps with more explanations on .sort_index()
and rename_axis()
:
The interim resulting layout of tab2
before the last step is as follows:
YEAR2000 2001 2002 All2000 2001 2002 AllYEARnnnn%%%%fooA10120.1666670.0000000.1666670.333333B11020.1666670.1666670.0000000.333333C01120.0000000.1666670.1666670.333333All22260.3333330.3333330.3333331.000000
Here, 2 more fine-tunings we need to do:
- Group the columns by same years together, so that each
n
and%
is under the same year. We do it bysort_index()
here.axis=1
is to specify the index sorting is on columns instead of row index. Theascending=
parameter is to specify the sorting order of the 2 levels of the column MultiIndex. The firstTrue
is to specify that theYEAR
index should be sorted in ascending order, while the secondFalse
is to specify sorting'n'
and'%'
in descending order. This is the required sorting for them to appear in the required sequence with'n'
to show before'%'
.
Result:
YEAR 200020012002All
YEAR n % n % n % n %
foo
A10.16666700.00000010.16666720.333333B10.16666710.16666700.00000020.333333
C 00.00000010.16666710.16666720.333333All20.33333320.33333320.33333361.000000
- Second fine-tuning is to change the axis-name of the lower MultiIndex from
'YEAR'
to'count_pct'
. As you can see, there are now 2'YEAR'
on the left of the first and second line of the display. These corresponds to the axis names of first and second level column MultiIndex. We don't want all these axis names are the same. Hence, change by.rename_axis
to:
YEAR 200020012002All
count_pct n % n % n % n %
foo
A10.16666700.00000010.16666720.333333B10.16666710.16666700.00000020.333333
C 00.00000010.16666710.16666720.333333All20.33333320.33333320.33333361.000000
The axis name of second (lower) level of column MultiIndex is changed to 'count_pct'
.
Solution 2:
Since you already calculated the two cross tabs, the simplest solution is to concatenate them into your final data frame:
taba = pd.crosstab(df.foo, df.YEAR, dropna=False)
tabb = pd.crosstab(df.foo, df.YEAR, dropna=False, normalize=True)
tab = (
pd.concat([taba, tabb], axis=1, keys=['n', '%'])
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
.rename_axis(['YEAR', 'foo'], axis=1)
)
Resulting output is:
YEAR2000 2001 2002foon%n%n%fooA10.16666700.00000010.166667B10.16666710.16666700.000000C00.00000010.16666710.166667
Post a Comment for "Combine Count And Percentage (normalization) In Pandas Crosstab"