Skip to content Skip to sidebar Skip to footer

Pandas To_json() Redundant Backslashes

I have a '.csv' file containing data about movies and I'm trying to reformat it as a JSON file to use it in MongoDB. So I loaded that csv file to a pandas DataFrame and then used t

Solution 1:

I had the same issue : the solution is in 3 steps

1- Data-frame form csv or in my case from xlsx:

excel_df= pd.read_excel(dataset ,sheet_name=my_sheet_name)

2- convert to json (if you have date in your data)

json_str = excel_df.to_json(orient='records' ,date_format='iso')

3-The most important thing : json.loads **** this is it !

parsed = json.loads(json_str)

4- (facultative) you can write or send the json file : for example : write locally

withopen(out, 'w') as json_file:
    json_file.write(json.dumps({"data": parsed}, indent=4))

more info : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

Solution 2:

Pandas is escaping the " character because it thinks the values in the json columns are text. To get the desired behaviour, simply parse the values in the json column as json.

let the file data.csv have the following content (with quotes escaped).

# data.csv
movie_id,title,cast
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}]"

read this into a dataframe, then apply the json.loads function & write out to a file as json.

df = pd.read_csv('data.csv')
df.cast = df.cast.apply(json.loads)
df.to_json('data.json', orient='records', lines=True)

The output is a properly formatted json (extra newlines added by me)

# data.json
{"movie_id":19995,
 "title":"Avatar",
 "cast":[{"cast_id":242,"character":"Jake Sully","credit_id":"5602a8a7c3a3685532001c9a","gender":2,"id":65731,"name":"Sam Worthington","order":0},
         {"cast_id":3,"character":"Neytiri","credit_id":"52fe48009251416c750ac9cb","gender":1,"id":8691,"name":"Zoe Saldana","order":1},
         {"cast_id":25,"character":"Dr. Grace Augustine","credit_id":"52fe48009251416c750aca39","gender":1,"id":10205,"name":"Sigourney Weaver","order":2},
         {"cast_id":4,"character":"Col. Quaritch","credit_id":"52fe48009251416c750ac9cf","gender":2,"id":32747,"name":"Stephen Lang","order":3}]
}

Post a Comment for "Pandas To_json() Redundant Backslashes"