How To Scrape .csv Files From A Url, When They Are Saved In A .zip File In Python?
I am trying to scrape some .csv files from a website. I currently have a list of links: master_links = [ 'http://mis.nyiso.com/public/csv/damlbmp/20161201damlbmp_zone_csv.zip'
Solution 1:
You can do that with a custom file reader for pandas.read_csv()
like:
Code:
deffetch_multi_csv_zip_from_url(url, filenames=(), *args, **kwargs):
assert kwargs.get('compression') isNone
req = urlopen(url)
zip_file = zipfile.ZipFile(BytesIO(req.read()))
if filenames:
names = zip_file.namelist()
for filename in filenames:
if filename notin names:
raise ValueError(
'filename {} not in {}'.format(filename, names))
else:
filenames = zip_file.namelist()
return {name: pd.read_csv(zip_file.open(name), *args, **kwargs)
for name in filenames}
Some Docs: (ZipFile) (BytesIO) (urlopen)
Test Code:
try:
from urllib.request import urlopen
except ImportError:
from urllib2 import urlopen
from io import BytesIO
import zipfile
import pandas as pd
master_links = [
'http://mis.nyiso.com/public/csv/damlbmp/20161201damlbmp_zone_csv.zip',
'http://mis.nyiso.com/public/csv/damlbmp/20160301damlbmp_zone_csv.zip',
'http://mis.nyiso.com/public/csv/damlbmp/20160201damlbmp_zone_csv.zip']
dfs = fetch_multi_csv_zip_from_url(master_links[0])
print(dfs['20161201damlbmp_zone.csv'].head())
Results:
Time Stamp Name PTID LBMP($/MWHr) \
012/01/201600:00 CAPITL 6175721.94112/01/201600:00 CENTRL 6175416.85212/01/201600:00 DUNWOD 6176020.85312/01/201600:00 GENESE 6175316.16412/01/201600:00 H Q 6184415.73
Marginal Cost Losses($/MWHr) Marginal Cost Congestion($/MWHr)01.21 -4.4510.11 -0.4521.58 -2.993 -0.49 -0.364 -0.550.00
Post a Comment for "How To Scrape .csv Files From A Url, When They Are Saved In A .zip File In Python?"