How To Compare Strings Without Case Sensitive In Spark Rdd?
I have following Dataset drug_name,num_prescriber,total_cost AMBIEN,2,300 BENZTROPINE MESYLATE,1,1500 CHLORPROMAZINE,2,3000 Wanted to find out number of A's and B's from above Dat
Solution 1:
To convert to lower case, you should use the lower()
function (see here) from pyspark.sql.functions
.So you could try:
import pyspark.sql.functions as F
logData = spark.createDataFrame(
[
(0,'aB'),
(1,'AaA'),
(2,'bA'),
(3,'bB')
],
('id', "value")
)
numAs = logData.filter(F.lower((logData.value)).contains('a')).count()
You mention 'I am using the following code to find out num of A's and number of B's.' Note that if you want to count the actual occurrences of a character instead of the amount of rows that contain the character, you could do something like:
def count_char_in_col(col: str, char: str):
return F.length(F.regexp_replace(F.lower(F.col(col)), "[^" + char + "]", ""))
logData.select(count_char_in_col('value','a')).groupBy().sum().collect()[0][0]
which in the above example will return 5
.
Hope this helps!
Post a Comment for "How To Compare Strings Without Case Sensitive In Spark Rdd?"