Unable To Retrieve Data From Macro Trends Using Selenium And Read_html To Create A Data Frame?

April 28, 2023 Post a Comment

I'm want to import data from macro trends into pandas data frame. From looking at the page source of the website it appears that data is in a jqxgrid. I have tried using pandas/bea

Solution 1:

Here's an alternative that's quicker than selenium which has the headers as shown on page.

import requests
from bs4 import BeautifulSoup as bs
import re
import json
import pandas as pd

r = requests.get('https://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q')
p = re.compile(r' var originalData = (.*?);\r\n\r\n\r',re.DOTALL)
data = json.loads(p.findall(r.text)[0])
headers = list(data[0].keys())
headers.remove('popup_icon')
result = []

for row in data:
    soup = bs(row['field_name'])
    field_name = soup.select_one('a, span').text
    fields = list(row.values())[2:]
    fields.insert(0, field_name)
    result.append(fields)

pd.option_context('display.max_rows', None, 'display.max_columns', None)
df = pd.DataFrame(result, columns = headers)
print(df.head())

Solution 2:

The problem is the data is not in a table but 'div' elements. I'm not an expert on pandas but you can do it with BeautifulSoup.

Insert the line after your outher imports

from bs4 import BeautifulSoup

then change your last line for:

soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

This finds all the 'div' elements with the attribute 'row'. Then reads the text elements for each div it finds under the 'div' elements with the attribute 'row' but only descends one level as some have multiple 'div' elements.

Output:

                                     0  1       2       3       4       5   \
0                               Revenue     $4,135  $5,672  $3,262  $2,886   
1                    Cost Of Goods Sold     $3,179  $4,501  $2,500  $2,185   
2                          Gross Profit       $956  $1,171    $762    $701   
3     Research And Development Expenses       $234    $222    $209    $201   
4                         SG&A Expenses       $518    $675    $427    $381   
5    Other Operating Income Or Expenses        $-6     $-3     $-3     $-3   
...
        6       7       8       9       10      11      12      13      14  
0   $3,015  $3,986  $2,307  $2,139  $2,279  $2,977  $1,858  $1,753  $1,902  
1   $2,296  $3,135  $1,758  $1,630  $1,732  $2,309  $1,395  $1,303  $1,444  
2     $719    $851    $549    $509    $547    $668    $463    $450    $458  
3     $186    $177    $172    $167    $146    $132    $121    $106     $92  
4     $388    $476    $335    $292    $292    $367    $247    $238    $257  
5        -     $-2     $-2     $-3     $-3     $-4    $-40     $-2     $-1  
...

However as you scroll across the page the items on the left are removed from the page source so that not all the data is scraped.

Updated in response to comment. To set the column headers use:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.maximize_window()
driver.execute_script(
    "window.location = 'http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q';")
driver.implicitly_wait(2)

grid = driver.find_element_by_id('wrapperjqxgrid')
time.sleep(1)
driver.execute_script("window.scrollBy(0, 600);")
scrollbar = driver.find_element_by_id('jqxScrollThumbhorizontalScrollBarjqxgrid')

time.sleep(1)

actions = ActionChains(driver)
time.sleep(1)

for i in range(1, 6):
    actions.drag_and_drop_by_offset(scrollbar, i * 70, 0).perform()
    time.sleep(1)


soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
headersList = soup.findAll('div', {'role': 'columnheader'})
col_names=[h.text for h in headersList]
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data, columns=col_names)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

Outputs:

   Quarterly Data | Millions of US $ except per share data   2008-03-31  \
0                                             Revenue            $4,135
1                                  Cost Of Goods Sold            $3,179
2                                        Gross Profit              $956
...
   2007-12-31 2007-09-30 2007-06-30 2007-03-31 2006-12-31 2006-09-30  \
0      $5,672     $3,262     $2,886     $3,015     $3,986     $2,307
1      $4,501     $2,500     $2,185     $2,296     $3,135     $1,758
...
   2006-06-30 2006-03-31 2005-12-31 2005-09-30 2005-06-30 2005-03-31  
0      $2,139     $2,279     $2,977     $1,858     $1,753     $1,902
1      $1,630     $1,732     $2,309     $1,395     $1,303     $1,444

Python Playground

Unable To Retrieve Data From Macro Trends Using Selenium And Read_html To Create A Data Frame?

Solution 1:

Solution 2:

Post a Comment for "Unable To Retrieve Data From Macro Trends Using Selenium And Read_html To Create A Data Frame?"