Skip to content Skip to sidebar Skip to footer

Regex To Remove Words From A List That Are Not A-Z A-z (exceptions)

I would like to remove non-alpha characters from a string and convert each word into a list component such that: 'All, the above.' -> ['all', 'the', 'above'] It would seem that

Solution 1:

Although I understand you are asking specifically about regex, another solution to your overall problem is to use a library for this express purpose. For instance nltk. It should help you split your strings in sane ways (parsing out the proper punctuation into separate items in a list) which you can then filter out from there.

You are right, the number of corner cases is huge precisely because human language is imprecise and vague. Using a library that already accounts for these edge cases should save you a lot of headache.

A helpful primer on dealing with raw text in nltk is here. It seems the most useful function for your use case is nltk.word_tokenize, which passes back a list of strings with words and punctuation separated.


Solution 2:

Here's a Python regex that should work for splitting the sentences you provided.

((?<![A-Z])\.)*[\W](?<!\.)|[\W]$

Try it here

Since all abbreviations with periods should have a capital letter before the period, we can use a negative lookbehind to exclude those periods:

((?<![A-Z])\.)*

Then splits on all other non-period non-alphanumerics:

[\W](?<!\.)

or symbols at the end of a line:

|[\W]$

I tested the regex on these strings:

The R.N. lives in the U.S.

The R.N., lives in the U.S. here.


Post a Comment for "Regex To Remove Words From A List That Are Not A-Z A-z (exceptions)"