Append Series To Empty Dataframe Column Always Results The Same After A Loop

January 18, 2023 Post a Comment

import pandas as pd df = pd.DataFrame(columns=['A', 'B']) df2 = pd.DataFrame({'C': [5, 6, 7, 8, 9], 'D': [1, 2, 3, 4, 5]}) for i in range(5): df['A'] = df['A'].append(df2['C

Solution 1:

TL;DR By assigning series to Dataframe column, the series will be conformed to the DataFrames index. The result of append() has more elements than the index of df, so column value won't change.

There is no problem with the append() function, the problem is in df["A"] assignment.

With df["A"] = xx, we are calling __setitem__():

    def __setitem__(self, key, value):
        key = com.apply_if_callable(key, self)

        # see if we can slice the rows
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            # either we have a slice or we have a string that can be converted
            #  to a slice for partial-string date indexing
            return self._setitem_slice(indexer, value)

        if isinstance(key, DataFrame) or getattr(key, "ndim", None) == 2:
            self._setitem_frame(key, value)
        elif isinstance(key, (Series, np.ndarray, list, Index)):
            self._setitem_array(key, value)
        else:
            # set column
            self._set_item(key, value)

In this case, we are not accessing the dataframe like df[:], so indexer is None. key value is A, which is just a string type. So we actually call:

self._set_item(key, value)

Let's see how _set_item() is defined:

    def _set_item(self, key, value):
        """
        Add series to DataFrame in specified column.
        If series is a numpy-array (not a Series/TimeSeries), it must be the
        same length as the DataFrames index or an error will be thrown.
        Series/TimeSeries will be conformed to the DataFrames index to
        ensure homogeneity.
        """
        self._ensure_valid_index(value)
        value = self._sanitize_column(key, value)
        NDFrame._set_item(self, key, value)

        # check if we are modifying a copy
        # try to set first as we want an invalid
        # value exception to occur first
        if len(self):
            self._check_setitem_copy()

From the doc, we can see Series/TimeSeries will be conformed to the DataFrames index to ensure homogeneity.. This explains why the dataframe df doesn't change. Because after the first loop, the result of append() is larger than the index of df, the redundant is truncated.

If so, why appending to dataframe df is successful in the first loop? The answer lays in self._ensure_valid_index(value)

    def _ensure_valid_index(self, value):
        """
        Ensure that if we don't have an index, that we can create one from the
        passed value.
        """

If the dataframe is empty, this method extends the dataframe to a len(value)*columns matrix with NaN values. Then with NDFrame._set_item(self, key, value), we replace the column key with value.

In the second example, we are trying to append to B column after A column:

for i in range(5):
    df["A"] = df["A"].append(df2["C"], ignore_index=True)
    df["B"] = df["B"].append(df2["D"], ignore_index=True)

In the first loop, after appending to A column, the B column of dataframe df is filled with NaN. df["B"].append(df2["D"], ignore_index=True) appends values to original NaN. By assigning it to df["B"], the append() result will be conformed to the DataFrames index. That's why df["B"] remains NaN.

In the third example, we just replace the dataframe df with the result of append, it doesn't involve with dataframe __setitem__().

for i in range(5):
    df = df.append(df2, ignore_index=True)

Python Playground

Append Series To Empty Dataframe Column Always Results The Same After A Loop

Solution 1:

Post a Comment for "Append Series To Empty Dataframe Column Always Results The Same After A Loop"