How To Make An Integer Index Row?

March 07, 2024 Post a Comment

I have a DataFrame: +-----+--------+---------+ | usn|log_type|item_code| +-----+--------+---------+ | 0| 11| I0938| | 916| 19| I0009| | 916| 51| I109

Solution 1:

Using monotanicallly_increasing_id only guarantees that the numbers are increasing, the starting number and consecutive numbering is not guaranteed. If you want to be sure to get 0,1,2,3,... you can use the RDD function zipWithIndex().

Since I'm not too familiar with spark together with python, the below example is using scala but it should be easy to convert it.

valspark= SparkSession.builder.getOrCreate()
import spark.implicits._

valdf= Seq("I0938","I0009","I1097","C0723","I0010","I0010",
    "C0117","C0117","I0009","I0009","I0010","I1067",
    "I1067","C1083","B0250","C1346")
  .toDF("item_code")

valdf2= df.distinct.rdd
  .map{caseRow(item: String) => item}
  .zipWithIndex()
  .toDF("item_code", "numId")

Which will give you the requested result:

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+

Python Playground

How To Make An Integer Index Row?

Solution 1:

Post a Comment for "How To Make An Integer Index Row?"