Python Re Does Not Return Anything After /ref=
I am trying to retrieve the URL and category name from Amazon's best sellers list. For some reason the RE I'm using stops, when it encounters /ref= and I truly don't see why? I'm
Solution 1:
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.
Use a HTML parser instead, Python has several to choose from:
- ElementTree is part of the standard library
- BeautifulSoup is a popular 3rd party library
- lxml is a fast and feature-rich C-based library.
The latter two also handle malformed HTML quite gracefully as well, making decent sense of many a botched website. In fact, BeautifulSoup 4 uses lxml
under the hood as the parser of choice if it is installed.
BeautifulSoup example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlsource)
forlinkin soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
printlink['href'], link.get_text()
This uses a CSS selector to find all <a>
elements contained directly in a <li>
element where the href
attribute starts with the text http://www.amazon.ca/Best-Sellers
.
Demo:
>>>from bs4 import BeautifulSoup>>>htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>'>>>soup = BeautifulSoup(htmlsource)>>>for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):...print link['href'], link.get_text()...
http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Appstore for Android
Note that Amazon alters the response based on the headers:
>>>import requests>>>from bs4 import BeautifulSoup>>>r = requests.get('http://www.amazon.ca/gp/bestsellers')>>>soup = BeautifulSoup(r.content)>>>soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Appstore for Android</a>
>>>r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={...'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})>>>soup = BeautifulSoup(r.content)>>>soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">Appstore for Android</a>
Post a Comment for "Python Re Does Not Return Anything After /ref="