Remove Special Chars From A Tsv File Using Regex

March 07, 2024 Post a Comment

I have a File called 'X.tsv' i want to remove special characters (including double spaces) (excluding . Single spaces Tabs / -) using regex before i export them to sub files in py

Solution 1:

You could use read_csv() to help with loading the TSV file. You could then specify the columns you want to keep and for it to use \t as the delimiter:

import pandas as pd
import re

defnormalise(text):
    text = re.sub('[{}]'.format(re.escape('",$!@#$%^&*()')), ' ', text.strip())  # Remove special characters
    text = re.sub(r'\s+', ' ', text)        # Convert multiple whitespace into a single spacereturn text

fieldnames = ['title', 'abstract', 'keywords', 'general_terms', 'acm_classification']
df = pd.read_csv('xa.tsv', delimiter='\t', usecols=fieldnames, dtype='object', na_filter=False)
df = df.applymap(normalise)
print(df)

You can then use df.applymap() to apply a function to each cell to format it as you need. In this example it first removes any leading or trailing spaces, converts multiple whitespace characters into a single space and also removes your list of special characters.

The resulting dataframe could then be further processed using your all_subsets() function before saving.

Python Playground

Remove Special Chars From A Tsv File Using Regex

Solution 1:

Post a Comment for "Remove Special Chars From A Tsv File Using Regex"