Pyspark: Select Part Of The String(file Path) Column Values
Pyspark: Split and select part of the string column values How can I select the characters or file path after the 4th(from left) backslash from the column in a spark DF? Sample row
Solution 1:
You may use a regular expression in regexp_replace
eg.
from pyspark.sql import functions as F
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))
you may also be more flexible with this solution eg.
from pyspark.sql import functions as F
no_of_slashes=4# number of slashes to consider here# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))
Let me know if this works for you.
Post a Comment for "Pyspark: Select Part Of The String(file Path) Column Values"