How Do I Make My Web Scraping Script More Robust?

January 26, 2024 Post a Comment

I launched a code to scrape the Santander website. Scraping seems to work, except that I get false results. And when I run the code twice in a row, the results change. How could I

Solution 1:

That data come from a XHR. So just use requests to post your values and parse the response with json.loads

Use your browser network tab to see what the request looks like.

Solution 2:

This is my time to shine!

Information:

I'm currently working on a financial data aggregator that was facing this exact same problem.

It collects data from about a dozen websites and organizes it into a JSON object that is then used by a Flask site to display the data.

This data is scraped from websites that have several sub-directories with similar content which have different selectors.

As you can imagine, with a framework like selenium this becomes very complex so the only solution is to dumb-down it down.

Answer :

Simplicity is key, so I removed every dependency except for the BeautifulSoup and requests library.

Then I created three classes and a function for each filter

from bs4 import BeautifulSoup

classGET:
  deftext(soup, selector, index = 0):
    selected = soup.select(selector)
    iflen(selected) > index:
      return selected[index].text.strip()

classParse:
  defcommon(soup, selector):
    return GET.text(soup, selector, index = 5)

classRoutes:
  defmain(self):
    data = {}
    if self.is_dir_1:
      data["name"] = GET.text(self.soup, "div")
      data["title-data"] = Parse.common(self.soup, "p > div:nth-child(1)")
    elif self.is_dir_2:
      data["name"] = GET.text(self.soup, "p", index = 2)
      data["title-data"] = Parse.common(self.soup, "p > div:nth-child(5)")
    return data

deffilter_name(url: str, response: str, filter_type: str):
  ifhasattr(Routes, filter_type):
    returngetattr(Routes, filter_type)(to_object({
      "is_dir_1": bool("/sub_dir_1/"in url),
      "is_dir_2": bool("/sub_dir_1/"in url),
      "soup": BeautifulSoup(html, "lxml")
    }))
  return {}

Using the requests library I made the request that got the data, then I passed the URL, response text and filter_type to the filter_name function.

Then in the filter_name function I used the filter_type argument to pass the "soup" to the target route function and select each element and get it's data there.

Then in the target route function, I used an if condition to determine the sub directory and assigned the text to a data object.

After all this is complete I returned the data object.

This method is very simple and has kept my code DRY, it even allows for optional key: value pairs.

Here is the code for the to_object helper class:

classto_object(object):def__init__(self, dictionary):
    self.__dict__ = dictionary

This converts dictionaries to objects so instead of having to always write:

self["soup"]

You would write:

self.soup

Fixing errors:

You really need to standardize the type of indentation you use because your script raises the following error:

Traceback (most recent calllast):
  File "", line 84
    Amount =   [13.000, 14.000, 15.000, 30.000, 45.000, 60.000]
    ^
IndentationError: unindent does notmatchanyouter indentation level

Notes:

filters are scripts that scrape different sites, my project requires that I scrape several sites to get the required data.
try to tidy your code more, tidy code is simpler to read and simpler to write

I hope this helps, good luck.

Python Playground