pyspark.sql.DataFrame.distinct#
- DataFrame.distinct()[source]#
- Returns a new - DataFramecontaining the distinct rows in this- DataFrame.- New in version 1.3.0. - Changed in version 3.4.0: Supports Spark Connect. - Returns
- DataFrame
- DataFrame with distinct records. 
 
 - See also - Examples - Remove duplicate rows from a DataFrame - >>> df = spark.createDataFrame( ... [(14, "Tom"), (23, "Alice"), (23, "Alice")], ["age", "name"]) >>> df.distinct().show() +---+-----+ |age| name| +---+-----+ | 14| Tom| | 23|Alice| +---+-----+ - Count the number of distinct rows in a DataFrame - >>> df.distinct().count() 2 - Get distinct rows from a DataFrame with multiple columns - >>> df = spark.createDataFrame( ... [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", "M")], ... ["age", "name", "gender"]) >>> df.distinct().show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| +---+-----+------+ - Get distinct values from a specific column in a DataFrame - >>> df.select("name").distinct().show() +-----+ | name| +-----+ | Tom| |Alice| +-----+ - Count the number of distinct values in a specific column - >>> df.select("name").distinct().count() 2 - Get distinct values from multiple columns in DataFrame - >>> df.select("name", "gender").distinct().show() +-----+------+ | name|gender| +-----+------+ | Tom| M| |Alice| F| +-----+------+ - Get distinct rows from a DataFrame with null values - >>> df = spark.createDataFrame( ... [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", None)], ... ["age", "name", "gender"]) >>> df.distinct().show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| | 14| Tom| NULL| +---+-----+------+ - Get distinct non-null values from a DataFrame - >>> df.distinct().filter(df.gender.isNotNull()).show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| +---+-----+------+