Split And Replace In One Dataframe Based On A Condition With Another Dataframe In Pandas

November 26, 2023 Post a Comment

I have two dataframes and both contains sql table. This is my first Dataframe Original_Input Cleansed_Input Core_Input Type_input TECHNOLOGIES S.A TECH

Solution 1:

IIUC, we can use some basic regex :

first we remove any trailing and leading white space and split by white space, this returns a list of lists which we can break out using chain.from_iterable

then we use some regex with pandas methods str.findall and str.contains to match your inputs.

from itertools import chain

ext = df2['Name_Extension'].str.strip().str.split('\s+')

ext = list(chain.from_iterable(i for i in ext))

df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]

s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()

df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s

print(df)

          Original_Input      Cleansed_Input type_input      core_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA        NaNNaN1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC        LLC  A J INDUSTRIES
2    A&S DENTAL SERVICES  AS DENTAL SERVICES        NaNNaN3     A.M.G Médicale Inc     AMG Mdicale Inc        Inc     AMG Mdicale
4       AAREN SCIENTIFIC    AAREN SCIENTIFIC        NaNNaN

Solution 2:

Assuming you have read in the dataframes as df1 and df2, the first step is to create 2 lists - one for the Name_Extension (keys) and one for Company_Type (values) as shown below:

keys = list(df2['Name_Extension'])
keys = [key.strip().lower() for key in keys]
print (keys)
>>> ['co llc', 'pvt ltd', 'corp', 'co ltd', 'inc', 'co']
values = list(df2['Company_Type']) 
values = [value.strip().lower() for value in values]
print (values)
>>> ['company llc', 'private limited', 'corporation', 'company limited', 'incorporated', 'company']

Next step will be to map each value in the Cleansed_Input to a Core_Inputand Type_Input. We can use pandas apply method on the Cleansed_Input column To get the Core_input:

defget_core_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keysfor key in keys:
        if data.endswith(key):
            return data.split(key)[0].strip() # split the data and return the part without the keyreturnNone

df1['Core_Input'] = df1['Cleansed_Input'].apply(get_core_input)
print (df1)
>>>
 Original_Input      Cleansed_Input   Core_Input  Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         None         NaN
1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         None         NaN
2    A&S DENTAL SERVICES  AS DENTAL SERVICES         None         NaN
3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale         NaN
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          None         NaN

To get the Type_input:

defget_type_input(data):
    # preprocess
    data = str(data).strip().lower()
    # check if the data end with any of the keysfor idx inrange(len(keys)):
        if data.endswith(keys[idx]):
            return values[idx].strip() # return the value of the corresponding matched keyreturnNone

df1['Type_input'] = df1['Cleansed_Input'].apply(get_type_input)
print (df1)
>>>
Original_Input      Cleansed_Input   Core_Input    Type_input
0       TECHNOLOGIES S.A     TECHNOLOGIES SA         NoneNone1  A & J INDUSTRIES, LLC  A J INDUSTRIES LLC         NoneNone2    A&S DENTAL SERVICES  AS DENTAL SERVICES         NoneNone3     A.M.G Médicale Inc     AMG Mdicale Inc  amg mdicale  incorporated
4       AAREN SCIENTIFIC   AAREN SCIENTIFIC          NoneNone

This is a pretty easy to follow solution, but is not the most efficient way to solve the problem, I am sure.. Hopefully it solves your use case.

Solution 3:

Here is a possible solution you can implement:

df = pd.DataFrame({
    "Original_Input": ["TECHNOLOGIES S.A", 
                       "A & J INDUSTRIES, LLC", 
                       "A&S DENTAL SERVICES", 
                       "A.M.G Médicale Inc", 
                       "AAREN SCIENTIFIC"],
    "Cleansed_Input": ["TECHNOLOGIES SA", 
                       "A J INDUSTRIES LLC", 
                       "AS DENTAL SERVICES", 
                       "AMG Mdicale Inc", 
                       "AAREN SCIENTIFIC"]
})

df_2 = pd.DataFrame({ 
    "Name_Extension": ["llc",
                       "Pvt ltd",
                       "Corp",
                       "CO Ltd",
                       "inc", 
                       "CO",
                       "SA"],
    "Company_Type": ["Company LLC",
                     "Private Limited",
                     "Corporation",
                     "Company Limited",
                     "Incorporated",
                     "Company",
                     "Anonymous Company"],
    "Priority": [2, 8, 4, 3, 5, 1, 9]
})

# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()

# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0]) 
                    for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
    .apply(lambda p: next(( priority 
                            for priority, extension in extensions_list 
                            if p.endswith(extension)), None))

# Merging both dataframes based on priority. This step can be ignored if you only need# one column from the df_2. In that case, just give the column you require instead of # `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")

# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) ifisinstance(p, str) else0)
df["Core_Input"] = df.apply(
    lambda p: p["Cleansed_Input"] 
              if p["aux"] == 0else p["Cleansed_Input"][:p["aux"]].strip(), 
    axis=1
)

# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]

I assumed that the "Priority" column would have unique values. However, if this isn't the case, just sort the priorities and create an index based on that order like this:

df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))

Also, next time give the data example in a format that allows anyone to load easily. It was cumbersome to handle the format you sent.

EDIT: Not related with the question, but it might be of some help. You can simplify the steps from 1 to 4 with the following:

data['Cleansed_Input'] = data["Original_Input"] \
    .str.replace("[^\w ]+", "") \ # removes non-alpha characters
    .str.replace(" +", " ") \ # removes duplicated spaces
    .str.strip() # removes spaces before or after the string

EDIT 2: SQL version of the solution (I'm using PostgreSQL, but I used standard SQL operators, so the differences shouldn't be that huge).

SELECT t.Original_Name,
       t.Cleansed_Input,
       t.Name_Extension,
       t.Company_Type,
       t.Priority
FROM (
    SELECT df.Original_Name,
           df.Cleansed_Input,
           df_2.Name_Extension,
           df_2.Company_Type,
           df_2.Priority,
           ROW_NUMBER() OVER (PARTITIONBY df.Original_Name ORDERBY df_2.Priority) AS rn
    FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
                 ('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
                 ('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
         LEFTJOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
                           ('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
                           ('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
            ONlower(df.Cleansed_Input) like ( '%'||lower(df_2.Name_Extension) )
) t
WHERE rn =1

Python Playground

Split And Replace In One Dataframe Based On A Condition With Another Dataframe In Pandas

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Split And Replace In One Dataframe Based On A Condition With Another Dataframe In Pandas"