Skip to content Skip to sidebar Skip to footer

Output Ascii File From Unicode Web Scrape In Python

I am new to Python programming. I am using the following code in my Python file: import gethtml import articletext url = 'http://www.thehindu.com/news/national/india-calls-for-resu

Solution 1:

To take care of the unicode error, we need to encode the text as unicode (UTF-8 to be precise) instead of ascii. To ensure it doesn't throw an error if there's an encoding error, we're going to ignore any characters that we don't have a mapping for. (You can also use "replace" or other options given by str.encode. See the Python docs on Unicode here.)

Best practice in opening the file would be to use the Python context manager, which will close the file even if there's an error. I'm using slashes instead of backslashes in the path to make sure this works in either Windows or Unix/Linux.

text = text.encode('UTF-8', 'ignore')
withopen('/temp/Out.txt', 'w') as file:
    file.write(text)

This is equivalent to

text = text.encode('UTF-8', 'ignore')
try:
    file = open('/temp/Out.txt', 'w')
    file.write(text)
finally:
    file.close()

But the context manager is much less verbose and much less open to possibility of causing you to lock up a file in the middle of an error.

Solution 2:

text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

This should work, give it a try.

Why? Because saving everything as bytes and utf-8 it will ignore those kind of encoding errors :D

Edit Make sure the file exists in the same folder, otherwise put this code after the imports and it should create the file itself.

text_filefixed = open("Output.txt", "a")
text_filefixed.close()

It creates it, saves nothing, close file... but it's created automatically without human interaction.

Edit2 Notice this is only working in 3.3.2 but i know you can use this module to achieve the same thing in 2.7. A few minor differences would be that (i think) request is not needed in 2.7, but you should check that.

from urllib import request
result = str(request.urlopen("http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece").read())
text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

Just as i though, you will just find this error in 2.7, urllib.request in Python 2.7

Post a Comment for "Output Ascii File From Unicode Web Scrape In Python"