pyspark.sql.DataFrame.join

DataFrame.join(other: pyspark.sql.dataframe.DataFrame, on: Union[str, List[str], pyspark.sql.column.Column, List[pyspark.sql.column.Column], None] = None, how: Optional[str] = None) → pyspark.sql.dataframe.DataFrame[source]

Joins with another DataFrame, using the given join expression.

New in version 1.3.0.

Parameters
otherDataFrame

Right side of the join

onstr, list or Column, optional

a string for the join column name, a list of column names, a join expression (Column), or a list of Columns. If on is a string or a list of strings indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join.

howstr, optional

default inner. Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti.

Examples

The following performs a full outer join between df1 and df2.

>>> from pyspark.sql.functions import desc
>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height)                 .sort(desc("name")).collect()
[Row(name='Bob', height=85), Row(name='Alice', height=None), Row(name=None, height=80)]
>>> df.join(df2, 'name', 'outer').select('name', 'height').sort(desc("name")).collect()
[Row(name='Tom', height=80), Row(name='Bob', height=85), Row(name='Alice', height=None)]
>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]
>>> df.join(df2, 'name').select(df.name, df2.height).collect()
[Row(name='Bob', height=85)]
>>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
[Row(name='Bob', age=5)]