Skip to content Skip to sidebar Skip to footer

Pyspark Hive Context -- Read Table With Utf-8 Encoding

I have a table in hive, And I am reading that table in pyspark df_sprk_df from pyspark import SparkContext from pysaprk.sql import HiveContext sc = SparkContext() hive_context = Hi

Solution 1:

So this workaround helped to solve this, By changing the default encoding for the session

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

and then

df_pandas_df = df_pandas_df.astype(str)

converts whole dataframe as string df.

Solution 2:

Instead of directly casting it to string try to infer types of pandas DataFrame using following statement:

df_pandas_df .apply(lambda x: pd.lib.infer_dtype(x.values))

UPD: try to perform mapping without .str invocation.

Maybe something like below:

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].apply(lambda x: unicode(x, errors='ignore'))

Post a Comment for "Pyspark Hive Context -- Read Table With Utf-8 Encoding"