How Do I Make My Web Scraping Script More Robust?
Solution 1:
That data come from a XHR. So just use requests to post your values and parse the response with json.loads
Use your browser network tab to see what the request looks like.
Solution 2:
This is my time to shine!
Information:
I'm currently working on a financial data aggregator that was facing this exact same problem.
It collects data from about a dozen websites and organizes it into a JSON object that is then used by a Flask site to display the data.
This data is scraped from websites that have several sub-directories with similar content which have different selectors.
As you can imagine, with a framework like selenium
this becomes very complex so the only solution is to dumb-down it down.
Answer :
Simplicity is key, so I removed every dependency except for the BeautifulSoup
and requests
library.
Then I created three classes and a function for each filter
from bs4 import BeautifulSoup
classGET:
deftext(soup, selector, index = 0):
selected = soup.select(selector)
iflen(selected) > index:
return selected[index].text.strip()
classParse:
defcommon(soup, selector):
return GET.text(soup, selector, index = 5)
classRoutes:
defmain(self):
data = {}
if self.is_dir_1:
data["name"] = GET.text(self.soup, "div")
data["title-data"] = Parse.common(self.soup, "p > div:nth-child(1)")
elif self.is_dir_2:
data["name"] = GET.text(self.soup, "p", index = 2)
data["title-data"] = Parse.common(self.soup, "p > div:nth-child(5)")
return data
deffilter_name(url: str, response: str, filter_type: str):
ifhasattr(Routes, filter_type):
returngetattr(Routes, filter_type)(to_object({
"is_dir_1": bool("/sub_dir_1/"in url),
"is_dir_2": bool("/sub_dir_1/"in url),
"soup": BeautifulSoup(html, "lxml")
}))
return {}
Using the requests
library I made the request that got the data, then I passed the URL, response text and filter_type to the filter_name
function.
Then in the filter_name
function I used the filter_type
argument to pass the "soup" to the target route function and select each element and get it's data there.
Then in the target route function, I used an if
condition to determine the sub directory and assigned the text to a data object.
After all this is complete I returned the data
object.
This method is very simple and has kept my code DRY, it even allows for optional key: value
pairs.
Here is the code for the to_object
helper class:
classto_object(object):def__init__(self, dictionary):
self.__dict__ = dictionary
This converts dictionaries to objects so instead of having to always write:
self["soup"]
You would write:
self.soup
Fixing errors:
You really need to standardize the type of indentation you use because your script raises the following error:
Traceback (most recent calllast):
File "", line 84
Amount = [13.000, 14.000, 15.000, 30.000, 45.000, 60.000]
^
IndentationError: unindent does notmatchanyouter indentation level
Notes:
- filters are scripts that scrape different sites, my project requires that I scrape several sites to get the required data.
- try to tidy your code more, tidy code is simpler to read and simpler to write
I hope this helps, good luck.
Post a Comment for "How Do I Make My Web Scraping Script More Robust?"