Skip to content Skip to sidebar Skip to footer

Pyparsing: Extract Variable Length, Variable Content, Variable Whitespace Substring

I need to extract Gleason scores from a flat file of prostatectomy final diagnostic write-ups. These scores always have the word Gleason and two numbers that add up to another numb

Solution 1:

Here is a sample to pull out the patient data and any matching Gleason data.

from pyparsing import *
num = Word(nums)
accessionDate = Combine(num + "/" + num + "/" + num)("accDate")
accessionNumber = Combine("S" + num + "-" + num)("accNum")
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum")
gleason = Group("GLEASON" + Optional("SCORE:") + num("left") + "+" + num("right") + "=" + num("total"))
assert 'GLEASON 5+4=9' == gleason
assert 'GLEASON SCORE:  3 + 3 = 6' == gleason

patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum)
assert '01/02/11  S11-4444 20/111-22-3333' == patientData

partMatch = patientData("patientData") | gleason("gleason")

lastPatientData = None
for match in partMatch.searchString(data):
    if match.patientData:
        lastPatientData = match
    elif match.gleason:
        if lastPatientData is None:
            print "bad!"
            continue
        print "{0.accDate}: {0.accNum} {0.patientNum} Gleason({1.left}+{1.right}={1.total})".format(
                        lastPatientData.patientData, match.gleason
                        )

Prints:

01/01/11: S11-55555 20/444-55-6666 Gleason(5+4=9)
01/02/11: S11-4444 20/111-22-3333 Gleason(3+3=6)

Solution 2:

Take a look at the SkipTo parse element in pyparsing. If you define a pyparsing structure for the num+num=num part, you should be able to use SkipTo to skip anything between "Gleason" and that. Roughly like this (untested pseuo-pyparsing):

score = num + "+" + num + "=" num
Gleason = "Gleason" + SkipTo(score) + score

PyParsing by default skips whitespace anyway, and with SkipTo you can skip anything that doesn't match your desired format.


Solution 3:

gleason = re.compile("gleason\d+\d=\d")
scores = set()
for record in records:
    for line in record.lower().split("\n"):
        if "gleason" in line:
            scores.add(gleason.match(line.replace(" ", "")).group(0)[7:])

Or something


Post a Comment for "Pyparsing: Extract Variable Length, Variable Content, Variable Whitespace Substring"