Spark: Count number of duplicate row

PublishedJuly 12, 2021

•1 min read

Spark: Count number of duplicate row

Azure Cloud Data & AI Solution Engineer specializing in Microsoft Fabric, Power BI, data architecture, governance, and modern data platforms.

To count the number of duplicate rows in a pyspark DataFrame, you want to groupBy() all the columns and count(), then select the sum of the counts for the rows where the count is greater than 1:

import pyspark.sql.functions as funcs
df.groupBy(df.columns)\
    .count()\
    .where(funcs.col('count') > 1)\
    .select(funcs.sum('count'))\
    .show()

#pyspark #data-analysis #data-engineering

Comments

Join the discussion

No comments yet. Be the first to comment.

More from this blog

Microsoft Fabric at Build 2026

My Recommended Session List

Jun 11, 20264 min read

Microsoft Fabric at Build 2026

Design Thinking, Simply: From Symptoms to Smart Solutions

A practical guide to design thinking that shows how to move from visible symptoms to root-cause fixes using a “problem tree,” with a famous real-world

Jan 3, 20263 min read

Design Thinking, Simply: From Symptoms to Smart Solutions

🚀 Navigating the Microsoft Fabric Adoption Roadmap

A Strategic Guide to Building a Data-Driven Culture with Microsoft Fabric

Jun 1, 20252 min read

🚀 Navigating the Microsoft Fabric Adoption Roadmap

Pulumi vs. Terraform

Choosing the Right IaC Tool

May 26, 20255 min read

Pulumi vs. Terraform

Enforcing item limits in a Fabric workspace

What You Need to Know

Mar 1, 20253 min read

Enforcing item limits in a Fabric workspace

Ian's blog

155 posts