pyspark.sql.DataFrame.distinct#

DataFrame.distinct()[source]#

Returns a new DataFrame containing the distinct rows in this DataFrame.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Returns
DataFrame

DataFrame with distinct records.

Examples

Remove duplicate rows from a DataFrame

>>> df = spark.createDataFrame(
...     [(14, "Tom"), (23, "Alice"), (23, "Alice")], ["age", "name"])
>>> df.distinct().show()
+---+-----+
|age| name|
+---+-----+
| 14|  Tom|
| 23|Alice|
+---+-----+

Count the number of distinct rows in a DataFrame

>>> df.distinct().count()
2

Get distinct rows from a DataFrame with multiple columns

>>> df = spark.createDataFrame(
...     [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", "M")],
...     ["age", "name", "gender"])
>>> df.distinct().show()
+---+-----+------+
|age| name|gender|
+---+-----+------+
| 14|  Tom|     M|
| 23|Alice|     F|
+---+-----+------+

Get distinct values from a specific column in a DataFrame

>>> df.select("name").distinct().show()
+-----+
| name|
+-----+
|  Tom|
|Alice|
+-----+

Count the number of distinct values in a specific column

>>> df.select("name").distinct().count()
2

Get distinct values from multiple columns in DataFrame

>>> df.select("name", "gender").distinct().show()
+-----+------+
| name|gender|
+-----+------+
|  Tom|     M|
|Alice|     F|
+-----+------+

Get distinct rows from a DataFrame with null values

>>> df = spark.createDataFrame(
...     [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", None)],
...     ["age", "name", "gender"])
>>> df.distinct().show()
+---+-----+------+
|age| name|gender|
+---+-----+------+
| 14|  Tom|     M|
| 23|Alice|     F|
| 14|  Tom|  NULL|
+---+-----+------+

Get distinct non-null values from a DataFrame

>>> df.distinct().filter(df.gender.isNotNull()).show()
+---+-----+------+
|age| name|gender|
+---+-----+------+
| 14|  Tom|     M|
| 23|Alice|     F|
+---+-----+------+