Post

Pandas To Spark

This is a guide to understanding the PySpark equivalents of Pandas methods, functions, and objects. Reference the below for information when switching —

PySpark (.sql)

Window

Used to apply rank, denserank, lag, and lead functions (similar to SQL statements). Python Logo

Parts of window functions:

  • partitionBy - Used to subset the data for more useful metrics pertaining to the subsets
  • orderBy - Used to order a column because rank, lag, and lead need to have a certain order to work as expected

Rank

1
2
3
4
from pyspark.sql.window import Window

windowSpec = Window.partitionBy('col1').orderBy('col2') # Partition by col1 and order by col2
df_ranked = df.withColumn('rank').over(windowSpec) # Create a new column 'rank' based on partition above

Lag / Lead

1
2
3
4
5
# Define Window Specification 
windowSpec = Window.orderBy("Date") 
# Use lag and lead functions 
df_with_lag_lead = df.withColumn("PreviousDaySales", lag("Sales", 1).over(windowSpec)) \
						.withColumn("NextDaySales", lead("Sales", 1).over(windowSpec))

spark.functions

when

Python Logo Used similar to a case statement when assigning a value to a new column. Used along with otherwise(),

1
df.withColumn('new_col', when(col('old_col') > 100, 'Yes').when(col('old_col') < 10, 'Kinda').otherwise('Nope'))

Pandas to PySpark Cheat Sheet

DataFrame Creation

Pandas:

1
2
import pandas as pd
pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

PySpark:

1
2
3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1, 4), (2, 5), (3, 6)], ["A", "B"])

Reading Data

Pandas (CSV):

1
pd.read_csv("file.csv")

PySpark (CSV):

1
spark.read.csv("file.csv", header=True, inferSchema=True)

Viewing Data

Pandas:

1
df.head()

PySpark:

1
df.show()

Referencing Columns

Pandas

1
df['col_name']

PySpark

1
2
3
col('col_name') # Best Practice
'col_name'
df.col_name

Data Selection

Selecting Columns (Pandas):

1
df['A']

Selecting Columns (PySpark):

1
df.select('A')

Selecting Rows by Position (Pandas):

1
df.iloc[0]

Selecting Rows by Position (PySpark):

1
df.take(1)

Filtering Data

Pandas:

1
df[df['A'] > 2]

PySpark:

1
df.filter(df['A'] > 2)

Grouping and Aggregating

Pandas:

1
df.groupby('A').sum()

PySpark:

1
2
from pyspark.sql import functions as F
df.groupBy('A').agg(F.sum('B'))

Joining DataFrames

Pandas:

1
pd.merge(df1, df2, on='key')

PySpark:

1
df1.join(df2, df1.key == df2.key)

Handling Missing Data

Drop NA (Pandas):

1
df.dropna()

Drop NA (PySpark):

1
df.na.drop()

Fill NA (Pandas):

1
df.fillna(value)

Fill NA (PySpark):

1
df.na.fill(value)

Sorting

Pandas:

1
df.sort_values(by='A')

PySpark:

1
2
df.sort(df.A) # Ascending (Default)
df.sort(df.A.desc()) # Descending

Writing Data

Pandas (CSV):

1
df.to_csv("file.csv", index=False)

PySpark (CSV):

1
df.write.csv("file.csv", header=True)

Very similar to Pandas

  • Sep
  • Encoding
  • Header
  • escapeQuotes

Creating New Column

Pandas

1
df['new_col'] = df['old_col'].apply(lambda x: True if x == 'Joseph' else False)

PySpark

1
df = df.withColumn('new_col', when(col('old_col') == 'Joseph', True).otherwise(False))

Uses [[#when]] function to give various results based on condition

This post is licensed under CC BY 4.0 by the author.

Trending Tags