How Can I Extract The List Of Urls Obtained During A Html Page Render In Python?
Solution 1:
It's likely that you'll have to render the page (not necessarily display it though) to be sure you're getting a complete list of all resources. I've used PyQT
and QtWebKit
in similar situations. Especially when you start counting resources included dynamically with javascript, trying to parse and load pages recursively with BeautifulSoup
just isn't going to work.
Ghost.py is an excellent client to get you started with PyQT. Also, check out the QWebView docs and the QNetworkAccessManager docs.
Ghost.py returns a tuple of (page, resources) when opening a page:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
resources
includes all of the resources loaded by the original URL as HttpResource objects. You can retrieve the URL for a loaded resource with resource.url
.
Solution 2:
I guess you will have to create a list of all known file extensions that you do NOT want, and then scan the content of the http response, checking with "if substring not in nono-list:"
The problem is all href's ending with TLDs, forwardslashes, url-delivered variables and so on, so i think it would be easier to check for stuff you know you dont want.
Post a Comment for "How Can I Extract The List Of Urls Obtained During A Html Page Render In Python?"