Skip to content Skip to sidebar Skip to footer

How To Optimize Time While Converting List To Dataframe?(Part II)

I didn't get any proper answers to my previous question: How to optimize time while converting list to dataframe? Let me explain the example more: Let's consider the data frame mor

Solution 1:

Use:

print (df)
  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  \
0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   
1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN   

  Phototype  
0        PP  
1        PP

First is defined dictionary for keys by first value in final list with all columns strating with strings:

d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo','Phototype']}

Then filter DataFrames by lists from dictionary:

out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}

For PH is necessary melting for change format:

out['PH'] = (out['PH'].melt('Phototype', 
                           value_name='Photo',
                           ignore_index=False)
                      .drop('variable',1)[['Photo','Phototype']]
                      .dropna(subset=['Photo']))

Last create same columns and join by concat with sorting for correct ordering:

out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()}

df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('')
print (df)
  level_0           0       1   2
0       N        Mark  Shelby  US
0      AG          42   Adult    
0      AD  Petersburg            
0      PH       1.jpg      PP    
0      PH       2.jpg      PP    
1       N        Andy  Carnot  GE
1      AG          16    Teen    
1      AD    Freiburg            
1      PH       1.jpg      PP   

Last create lists with different lengths by remove empty strings:

fin = [x[x!= ''].tolist() for x in df.to_numpy() ]
print (fin)
[['N', 'Mark', 'Shelby', 'US'],
 ['AG', 42, 'Adult'],
 ['AD', 'Petersburg'], 
 ['PH', '1.jpg', 'PP'], 
 ['PH', '2.jpg', 'PP'], 
 ['N', 'Andy', 'Carnot', 'GE'], 
 ['AG', 16, 'Teen'], 
 ['AD', 'Freiburg'], 
 ['PH', '1.jpg', 'PP']]

EDIT: For match Photo with digits is use regex, so instead startswith is used contains with joined values of lists by | for regex OR:

d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo\d+','Phototype']}

out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out)
{'N':   First Name Last Name Country
0       Mark    Shelby      US
1       Andy    Carnot      GE, 'AG':    Age Age-Group
0   42     Adult
1   16      Teen, 'AD':       Address
0  Petersburg
1    Freiburg, 'PH':   Photo1 Photo2 Phototype
0  1.jpg  2.jpg        PP
1  1.jpg    NaN        PP}

EDIT: Trick is add ^ to start of strings and $ to end of string for exact match values, then is necessary for correct working Photo+ 'digit':

print (df)
  First Name Last Name Country     Address  Age Age-Group Photo1 Photo2  \
0       Mark    Shelby      US  Petersburg   42     Adult  1.jpg  2.jpg   
1       Andy    Carnot      GE    Freiburg   16      Teen  1.jpg    NaN   

  Phototype Age Detail Address Detail  
0        PP      Young            Far  
1        PP  Too Young           Near  


d = {'N':['First Name','Last Name', 'Country'],
     'AG':['Age','Age-Group'],
     'AD':['Address'],
     'PH':['Photo\d+','Phototype']}

d = {k: [rf'^{x}$' for x in v] for k, v in d.items()}
print (d)
{'N': ['^First Name$', '^Last Name$', '^Country$'], 
 'AG': ['^Age$', '^Age-Group$'], 
 'AD': ['^Address$'], 
 'PH': ['^Photo\\d+$', '^Phototype$']}

out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}

print (out['AG'])
   Age Age-Group
0   42     Adult
1   16      Teen

print (out['AD'])
      Address
0  Petersburg
1    Freiburg

Post a Comment for "How To Optimize Time While Converting List To Dataframe?(Part II)"