Spark: Count number of duplicate row
Published
•1 min read
I
Azure Cloud Data & AI Solution Engineer specializing in Microsoft Fabric, Power BI, data architecture, governance, and modern data platforms.
To count the number of duplicate rows in a pyspark DataFrame, you want to groupBy() all the columns and count(), then select the sum of the counts for the rows where the count is greater than 1:
import pyspark.sql.functions as funcs
df.groupBy(df.columns)\
.count()\
.where(funcs.col('count') > 1)\
.select(funcs.sum('count'))\
.show()




