Skip to content Skip to sidebar Skip to footer

Pandas Scraped Data Not Working In Pandas

Why is when I enter data manually into an excel, pandas works. Yet when I scrape data, put it in to a csv. It gives me: zz = df1.WE=np.where(df3.AL.isin(df1.EW),df1.WE,np.nan

Solution 1:

I think you need change:

df1.WE=np.where(df3.AL.isin(df1.EW),df1.WE,np.nan)

to

df1.WE=np.where(df1.EW.isin(df2.AL),df1.WE,np.nan)

Problem is different length of DataFrame with real data. So need change data from df1 with another data - comapring return maks with same length as df1 and no error.

With your data:

df1 = pd.read_csv('df1.csv', names=['a','b','c'])
print (df1.head())
                                           a     b  \
0             Ponte Preta U20 v Cruzeiro U20  2.10   
1  Fluminense RJ U20 v Defensor Sporting U20  2.00   
2              Gremio RS U20 v Palmeiras U20  3.30   
3                       Barcelona v Sporting  1.33   
4                        Bayern Munich v PSG  2.40   

                                                   c  
0  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
1  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
2  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
3  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
4  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  

df2 = pd.read_csv('df2.csv', names=['a','b','c', 'd', 'e'])
print (df2.head())
                 a                    b                  c     d  \
0          In-Play      CSKA Moscow U19        Man Utd U19  1.141          In-Play  Atletico Madrid U19        Chelsea U19  1.012          In-Play         Juventus U19     Olympiakos U19  1.403  Starting in22'       Paris St-G U19  Bayern Munich U19  2.244      Today 21:00         Man City U19       Shakhtar U19  2.66   

                                                   e  
0  https://www.betfair.com.au/exchange/plus/footb...  1  https://www.betfair.com.au/exchange/plus/footb...  2  https://www.betfair.com.au/exchange/plus/footb...  3  https://www.betfair.com.au/exchange/plus/footb...  4  https://www.betfair.com.au/exchange/plus/footb...  

comapre numeric columns, here b and d:

df1.b=np.where(df1.b.isin(df2.d),df1.b,np.nan)#first 5 values is NaNs
print (df1.head())
                                           a   b  \0             Ponte Preta U20 v Cruzeiro U20 NaN1  Fluminense RJ U20 v Defensor Sporting U20 NaN2              Gremio RS U20 v Palmeiras U20 NaN3                       Barcelona v Sporting NaN4                        Bayern Munich v PSG NaN

                                                   c  
0  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
1  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
2  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
3  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
4  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  #check if some not NaNs values in b columnprint (df1[df1.b.notnull()])
                                       a      b  \
23                Swindon v Forest Green   1.40   
50       Sportivo Barracas v Canuelas FC  13.00   
80                              FC Nitra   1.53   
81                                   0-0   1.40   
83       Cape Town City v Maritzburg Utd   1.53   
84         Mamelodi Sundowns v Baroka FC   3.75   
90  Dorking Wanderers v Tonbridge Angels   1.53   
95             Coalville Town v Stamford   1.40   

                                                    c  
23  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
50  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
80  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
81  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
83  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
84  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
90  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  
95  https://www.bet365.com.au/#/AC/B1/C1/D13/E40/F...  

Also problem of your test data is there are same number of rows (4), so no errors.

Solution 2:

On a side note, I'd recommend using pandas functions with pandas:

df1.loc[~df1.EW.isin(df2.AL), 'WE'] = np.nan

Solution 3:

Ok, let's get back to the drawing board. The code above is cleaner, but does exactly the same you're doing with numpy. Lets split your code apart.

1) I highly recommend you to use jupyter / jupyter notebooks to play with the data and understand what is going on at each line. Take a look here, for example: https://gist.github.com/Casyfill/f432966ebabd93f4271e27a1e2e76579

So, your df1 has 100 rows and 3 columns. your df2 has 42 rows and 5 columns.

Now, you create df3 as an empty dataframe (0 rows) but 12 columns (by the way, perhaps you should use more explanatory column names). This step is totally fine, while you don't have to define all columns beforehand.

Lets go to the second line: df3['DAT'] = df2['AA']

here you basically copy the column from the second dataframe. Now, as we didn't have any rows in df3 before, it is totaly legitimate operation. By doing that, you create 42 rows in your df3. Again, this line by itself is fine.

now, last line. here the logic is the following: first, for each row in df3, we check if cell of df3.AL (its value) is in df1.EW column. Just note, that we never defined df3.AL before, so the whole column contains only NANs, therefore this by itself does not make any sense.

Next, let's assume there is something in df3.AL. as we check everything row-wise, we'll get a pd.Series (think - one column) of booleans as a result of this test, column with 42 rows. Now, we're trying to use this column as a "mask", which defines if df1.WE should be the same or defaulted to Nan. but you can't do that, because df1 has 100 rows, not 42!. Hense, we've got an error.

So you need to redefine what you're actually want to do here - it's not clear what you're actually need to do here.

Post a Comment for "Pandas Scraped Data Not Working In Pandas"