Skip to content Skip to sidebar Skip to footer

How To Find Duplicate Words In A Line Using Pandas?

Here is sample jason data. id opened_date title exposure state 1 06/11/2014 9:28 AM Device rebooted and crashed with error 0x024 critical open 2 06/11/2014 7:12 AM

Solution 1:

You can use loc for selecting by condition created str.contains with parameter case=False. Last if you need list use tolist:

li = ['Sensor','0x024']

for i in li:
    print (df.loc[df['title'].str.contains(i, case=False),'id'].tolist())
    [3, 4]
    [1, 4]

For storing you can use dict comprehension:

dfs = { i: df.loc[df['title'].str.contains(i, case=False),'id'].tolist() for i in li }

print (dfs['Sensor'])
[3, 4]
print (dfs['0x024'])
[1, 4]

If you need function, try get_id:

defget_id(id):
    ids = df.loc[df['title'].str.contains(id, case=False),'id'].tolist()
    return"Input String = %s : Output = ID " % id + 
            " and ".join(str(x) for x in ids) + 
            " has '%s' in it." % idprint (get_id('Sensor'))
Input String = Sensor : Output = ID 3and4 has 'Sensor'in it.

print (get_id('0x024'))
Input String = 0x024 : Output = ID 1and4 has '0x024'in it.

EDIT by comment:

Now it is more complicated, because use logical and:

defget_multiple_id(ids):
    #split ids and crete list of boolean series containing each id
    ids1 = [df['title'].str.contains(x, case=False) for x in ids.split()]
    #http://stackoverflow.com/a/20528566/2901002
    cond = np.logical_and.reduce(ids1)

    ids = df.loc[cond,'id'].tolist()
    return"Input String = '%s' : Output = ID " % id +
           ' and '.join(str(x) for x in ids) +
           " has '%s' in it." % idprint (get_multiple_id('0x024 Sensor'))
Input String = '0x024 Sensor' : Output = ID 4 has '0x024 Sensor'in it.

If use logical or, it is more easier, because or in re is |, so you can use 0x024|Sensor:

defget_multiple_id(id):
    ids = df.loc[df['title'].str.contains(id.replace(' ','|'), case=False),'id'].tolist()
    return"Input String = '%s' : Output = ID " % id +
            ' and '.join(str(x) for x in ids) +
            " has '%s' in it." % idprint (get_multiple_id('0x024 Sensor'))
Input String = '0x024 Sensor' : Output = ID 1and3and4 has '0x024 Sensor'in it.

Post a Comment for "How To Find Duplicate Words In A Line Using Pandas?"