How Can I Scrape The Data From In Between These Span Tags?

December 27, 2023 Post a Comment

I am attempting to scrape the figures shown on https://www.usdebtclock.org/world-debt-clock.html , however due to the numbers constantly changing i am unaware of how to collect thi

Solution 1:

BeautifulSoup cannot do this for you, because the data you need is provided by JavaScript, and BeautifulSoup does not support JS processing.

An alternative is to use a tool such as Selenium WebDriver:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('https://www.usdebtclock.org/world-debt-clock.html')
elem2 = driver.find_element_by_xpath('//span[@id="X4a79R9BW"]')
print(elem2.text)
driver.close()

If you have not used Selenium WebDriver before, you need to follow the installation instructions here.

In particular, you will need to follow the instructions for downloading the browser driver of your choice (I use geckodriver for Firefox). And make sure the executable is on your path.

(I expect there are other Python-based alternatives, also.)

Solution 2:

Based on the page's code, I think what you want to accomplish may not be possible with BS. Running your code returned [<span id="X4a79R9BW"> </span>]. Trying to getText() on that returned nothing. When inspecting the page, I noticed that the numerical value in the span was continuously updating as it does on the page. Viewing the page source showed that X4a79R9BW appeared at five places in the page. First to set aspects of the font, several places where an equation was being processed, and last the empty span scraped by your code. From viewing the source, it appears that the counter is an equation running inside a tag <script type="text/javascript">. Here is what I think is the equation running under the JavaScript tag:

{'leftMargin':0,'color':-16751104,:0 */var X3a34729DW = /*144,:14 */    96.9230013  /*751104,:0 */; var R3a45G7S =   /*7104,:54 */  0.000000306947   /*43,451134,:5 */; var Y12 = /*241,:15457 */   18442.16666 /*19601*2*2*/*21600*2*2; /*79301*2*2*/varClass = newDate(); varMethod = Class.getTime() / 1000 - Y12a4798; varPublic = X3a34729DW + Method * R3a45G7S;    varAssign = FormatNumber2(Public); document.getElementById   ('X3a34729DW')  .firstChild.nodeValue = Assign; /*'advance':4289}

This section of the page's source indicates that the text you want is being continuously updated via JavaScript. Given that, it is my understanding that BS is not the appropriate library to complete the desired task. Though I have not used it myself, I've seen Selenium as a suggested library for scraping pages dynamically updated via JavaScript. Good luck, perhaps someone else can help provide a clearer path forward.

Python Playground

How Can I Scrape The Data From In Between These Span Tags?

Solution 1:

Solution 2:

Post a Comment for "How Can I Scrape The Data From In Between These Span Tags?"