Pyspark: Select Part Of The String(file Path) Column Values

December 26, 2023 Post a Comment

Pyspark: Split and select part of the string column values How can I select the characters or file path after the 4th(from left) backslash from the column in a spark DF? Sample row

Solution 1:

You may use a regular expression in regexp_replace eg.

from pyspark.sql import functions as F

df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\[a-zA-Z0-9]+\\\\[a-zA-Z0-9]+\\\\",""))

you may also be more flexible with this solution eg.

from pyspark.sql import functions as F
no_of_slashes=4# number of slashes to consider here# we build the regular expression by repeating `"[a-zA-Z0-9]+\\\\"`# NB. We subtract 2 since we start with the frst 2 slashes
df = df.withColumn('sub_path',F.regexp_replace("path","^\\\\\\\\"+("[a-zA-Z0-9]+\\\\"*(no_of_slashes-2)),""))

Let me know if this works for you.

Python Playground

Pyspark: Select Part Of The String(file Path) Column Values

Solution 1:

Post a Comment for "Pyspark: Select Part Of The String(file Path) Column Values"