#pyspark

Articles tagged with #pyspark

Spark: Count number of duplicate row
To count the number of duplicate rows in a pyspark DataFrame, you want to groupBy() all the columns and count(), then select the sum of the counts for the rows where the count is greater than 1: import pyspark.sql.functions as funcs df.groupBy(df.col...
Jul 12, 20211 min read4
Apache Spark Architecture
Spark & its Features Apache Spark is an open source cluster computing framework for real-time data processing. The main feature of Apache Spark is its in-memory cluster computing that increases the processing speed of an application. Spark provides a...
Jul 3, 20218 min read1