Python Re Does Not Return Anything After /ref=

February 28, 2024 Post a Comment

I am trying to retrieve the URL and category name from Amazon's best sellers list. For some reason the RE I'm using stops, when it encounters /ref= and I truly don't see why? I'm

Solution 1:

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.

Use a HTML parser instead, Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

The latter two also handle malformed HTML quite gracefully as well, making decent sense of many a botched website. In fact, BeautifulSoup 4 uses lxml under the hood as the parser of choice if it is installed.

BeautifulSoup example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlsource)
forlinkin soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
    printlink['href'], link.get_text()

This uses a CSS selector to find all <a> elements contained directly in a <li> element where the href attribute starts with the text http://www.amazon.ca/Best-Sellers.

Demo:

>>>from bs4 import BeautifulSoup>>>htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>'>>>soup = BeautifulSoup(htmlsource)>>>for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):...print link['href'], link.get_text()... 
http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Appstore for Android

Note that Amazon alters the response based on the headers:

>>>import requests>>>from bs4 import BeautifulSoup>>>r = requests.get('http://www.amazon.ca/gp/bestsellers')>>>soup = BeautifulSoup(r.content)>>>soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Appstore for Android</a>
>>>r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={...'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})>>>soup = BeautifulSoup(r.content)>>>soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">Appstore for Android</a>

Python Playground

Python Re Does Not Return Anything After /ref=

Solution 1:

Post a Comment for "Python Re Does Not Return Anything After /ref="