Unable To Get All The Data Including Links From A Tr Tag
I've written a script in python to get data from some html elements which are in a table. I have roughly picked some data which are within a tr tag. My goal is to get the data (inc
Solution 1:
You can use either bs4
or regular expressions:
bs4
:
from bs4 import BeautifulSoup as soup
s = soup(content, 'lxml')
new_data = list(zip([i.text for i in s.find_all('a')], [i['href'] for i in s.find_all('a', href=True)]))
Output:
[(u'Charles Hard Townes', '/wiki/Charles_Hard_Townes'), (u'Nikolay Basov', '/wiki/Nikolay_Basov'), (u'Alexander Prokhorov', '/wiki/Alexander_Prokhorov'), (u'Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'), (u'Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'), (u'Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'), (u'Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'), (u'[D]', '#endnote_Note1D'), (u'Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')]
Regex:
import re
new_data = map(lambda x:filter(None, x)[0], re.findall('href="(.*?)"|title="(.*?)">', content))
final_data = [(new_data[i], new_data[i+1]) for i inrange(0, len(new_data)-1, 2)]
Output:
[('/wiki/Charles_Hard_Townes', 'Charles Hard Townes'), ('/wiki/Nikolay_Basov', 'Nikolay Basov'), ('/wiki/Alexander_Prokhorov', 'Alexander Prokhorov'), ('/wiki/Dorothy_Hodgkin', 'Dorothy Hodgkin'), ('/wiki/Konrad_Emil_Bloch', 'Konrad Emil Bloch'), ('/wiki/Feodor_Felix_Konrad_Lynen', 'Feodor Felix Konrad Lynen'), ('/wiki/Jean-Paul_Sartre', 'Jean-Paul Sartre'), ('#endnote_Note1D', '/wiki/Martin_Luther_King,_Jr.')]
Solution 2:
This modified code got me the href together with the data
from bs4 import BeautifulSoup
content="""
<tr><tdalign="center">1964</td><td><spanclass="sortkey">Townes, Charles Hard</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Charles_Hard_Townes"class="mw-redirect"title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br><spanclass="sortkey">Basov, Nikolay</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Nikolay_Basov"title="Nikolay Basov">Nikolay Basov</a></span></span>;<br><spanclass="sortkey">Prokhorov, Alexander</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Alexander_Prokhorov"title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td><td><spanclass="sortkey">Hodgkin, Dorothy</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Dorothy_Hodgkin"title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td><td><spanclass="sortkey">Bloch, Konrad Emil</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Konrad_Emil_Bloch"title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br><spanclass="sortkey">Lynen, Feodor Felix Konrad</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Feodor_Felix_Konrad_Lynen"class="mw-redirect"title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td><td><spanclass="sortkey">Sartre, Jean-Paul</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Jean-Paul_Sartre"title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><supclass="reference"id="ref_Note1D"><ahref="#endnote_Note1D">[D]</a></sup></td><td><spanclass="sortkey">King, Jr., Martin Luther</span><spanclass="vcard"><spanclass="fn"><ahref="/wiki/Martin_Luther_King,_Jr."class="mw-redirect"title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td><tdalign="center">—</td></tr>
"""
soup = BeautifulSoup(content,"lxml")
for items in soup.select('tr'):
item_name = [[item.text,item.get('href')] for item in items.select(".fn a")]
print(item_name)
OUTPUT
[['Charles Hard Townes', '/wiki/Charles_Hard_Townes'], ['Nikolay Basov', '/wiki/Nikolay_Basov'], ['Alexander Prokhorov', '/wiki/Alexander_Prokhorov'], ['Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'], ['Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'], ['Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'], ['Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'], ['Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.']]
Solution 3:
Slightly simpler: no need to select the table rows separately.
soup = BeautifulSoup(content,"lxml")
links = soup.select('tr .fn a')
forlinkin links:
print (link.attrs['href'])
print (link.text)
Post a Comment for "Unable To Get All The Data Including Links From A Tr Tag"