Skip to content Skip to sidebar Skip to footer

Extract Html Tags From A Text File Through Iteration And Append Them To A List And Ignore All Other Characters In Python

I want to be able to read a html file and extract only the tags out of it. Read one character at a time from the file, ignoring everything to get '<'(ignore '<' as well) Re

Solution 1:

In [10]: re.findall('<(.*?)>', html)
Out[10]: ['html', 'body', 'h1', '/h1', 'h2', 'h2', '/body', '/html']

Simply use regex or a HTMLParser.


Post a Comment for "Extract Html Tags From A Text File Through Iteration And Append Them To A List And Ignore All Other Characters In Python"