Split And Replace In One Dataframe Based On A Condition With Another Dataframe In Pandas
Solution 1:
IIUC, we can use some basic regex :
first we remove any trailing and leading white space and split by white space, this returns a list of lists which we can break out using chain.from_iterable
then we use some regex with pandas methods str.findall
and str.contains
to match your inputs.
from itertools import chain
ext = df2['Name_Extension'].str.strip().str.split('\s+')
ext = list(chain.from_iterable(i for i in ext))
df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]
s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()
df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s
print(df)
Original_Input Cleansed_Input type_input core_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA NaNNaN1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC LLC A J INDUSTRIES
2 A&S DENTAL SERVICES AS DENTAL SERVICES NaNNaN3 A.M.G Médicale Inc AMG Mdicale Inc Inc AMG Mdicale
4 AAREN SCIENTIFIC AAREN SCIENTIFIC NaNNaN
Solution 2:
Assuming you have read in the dataframes as df1
and df2
, the first step is to create 2 lists - one for the Name_Extension
(keys) and one for Company_Type
(values) as shown below:
keys = list(df2['Name_Extension'])
keys = [key.strip().lower() for key in keys]
print (keys)
>>> ['co llc', 'pvt ltd', 'corp', 'co ltd', 'inc', 'co']
values = list(df2['Company_Type'])
values = [value.strip().lower() for value in values]
print (values)
>>> ['company llc', 'private limited', 'corporation', 'company limited', 'incorporated', 'company']
Next step will be to map each value in the Cleansed_Input
to a Core_Input
and Type_Input
. We can use pandas apply method on the Cleansed_Input
column
To get the Core_input
:
defget_core_input(data):
# preprocess
data = str(data).strip().lower()
# check if the data end with any of the keysfor key in keys:
if data.endswith(key):
return data.split(key)[0].strip() # split the data and return the part without the keyreturnNone
df1['Core_Input'] = df1['Cleansed_Input'].apply(get_core_input)
print (df1)
>>>
Original_Input Cleansed_Input Core_Input Type_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA None NaN
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC None NaN
2 A&S DENTAL SERVICES AS DENTAL SERVICES None NaN
3 A.M.G Médicale Inc AMG Mdicale Inc amg mdicale NaN
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None NaN
To get the Type_input
:
defget_type_input(data):
# preprocess
data = str(data).strip().lower()
# check if the data end with any of the keysfor idx inrange(len(keys)):
if data.endswith(keys[idx]):
return values[idx].strip() # return the value of the corresponding matched keyreturnNone
df1['Type_input'] = df1['Cleansed_Input'].apply(get_type_input)
print (df1)
>>>
Original_Input Cleansed_Input Core_Input Type_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA NoneNone1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC NoneNone2 A&S DENTAL SERVICES AS DENTAL SERVICES NoneNone3 A.M.G Médicale Inc AMG Mdicale Inc amg mdicale incorporated
4 AAREN SCIENTIFIC AAREN SCIENTIFIC NoneNone
This is a pretty easy to follow solution, but is not the most efficient way to solve the problem, I am sure.. Hopefully it solves your use case.
Solution 3:
Here is a possible solution you can implement:
df = pd.DataFrame({
"Original_Input": ["TECHNOLOGIES S.A",
"A & J INDUSTRIES, LLC",
"A&S DENTAL SERVICES",
"A.M.G Médicale Inc",
"AAREN SCIENTIFIC"],
"Cleansed_Input": ["TECHNOLOGIES SA",
"A J INDUSTRIES LLC",
"AS DENTAL SERVICES",
"AMG Mdicale Inc",
"AAREN SCIENTIFIC"]
})
df_2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()
# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0])
for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
.apply(lambda p: next(( priority
for priority, extension in extensions_list
if p.endswith(extension)), None))
# Merging both dataframes based on priority. This step can be ignored if you only need# one column from the df_2. In that case, just give the column you require instead of # `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")
# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) ifisinstance(p, str) else0)
df["Core_Input"] = df.apply(
lambda p: p["Cleansed_Input"]
if p["aux"] == 0else p["Cleansed_Input"][:p["aux"]].strip(),
axis=1
)
# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]
I assumed that the "Priority" column would have unique values. However, if this isn't the case, just sort the priorities and create an index based on that order like this:
df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))
Also, next time give the data example in a format that allows anyone to load easily. It was cumbersome to handle the format you sent.
EDIT: Not related with the question, but it might be of some help. You can simplify the steps from 1 to 4 with the following:
data['Cleansed_Input'] = data["Original_Input"] \
.str.replace("[^\w ]+", "") \ # removes non-alpha characters
.str.replace(" +", " ") \ # removes duplicated spaces
.str.strip() # removes spaces before or after the string
EDIT 2: SQL version of the solution (I'm using PostgreSQL, but I used standard SQL operators, so the differences shouldn't be that huge).
SELECT t.Original_Name,
t.Cleansed_Input,
t.Name_Extension,
t.Company_Type,
t.Priority
FROM (
SELECT df.Original_Name,
df.Cleansed_Input,
df_2.Name_Extension,
df_2.Company_Type,
df_2.Priority,
ROW_NUMBER() OVER (PARTITIONBY df.Original_Name ORDERBY df_2.Priority) AS rn
FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
LEFTJOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
ONlower(df.Cleansed_Input) like ( '%'||lower(df_2.Name_Extension) )
) t
WHERE rn =1
Post a Comment for "Split And Replace In One Dataframe Based On A Condition With Another Dataframe In Pandas"