Pandas To Spark

Posted Oct 23, 2024 Updated Oct 23, 2024

By Joseph Bokan 2 min read

This is a guide to understanding the PySpark equivalents of Pandas methods, functions, and objects. Reference the below for information when switching —

PySpark (.sql)

Window

Used to apply rank, denserank, lag, and lead functions (similar to SQL statements).

Parts of window functions:

partitionBy - Used to subset the data for more useful metrics pertaining to the subsets
orderBy - Used to order a column because rank, lag, and lead need to have a certain order to work as expected

Rank

  
from pyspark.sql.window import Window

windowSpec = Window.partitionBy('col1').orderBy('col2') # Partition by col1 and order by col2
df_ranked = df.withColumn('rank').over(windowSpec) # Create a new column 'rank' based on partition above

Lag / Lead

  
# Define Window Specification 
windowSpec = Window.orderBy("Date") 
# Use lag and lead functions 
df_with_lag_lead = df.withColumn("PreviousDaySales", lag("Sales", 1).over(windowSpec)) \
						.withColumn("NextDaySales", lead("Sales", 1).over(windowSpec))

spark.functions

when

Used similar to a case statement when assigning a value to a new column. Used along with otherwise(),

  
df.withColumn('new_col', when(col('old_col') > 100, 'Yes').when(col('old_col') < 10, 'Kinda').otherwise('Nope'))

Pandas to PySpark Cheat Sheet

DataFrame Creation

Pandas:

  
import pandas as pd
pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

PySpark:

  
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1, 4), (2, 5), (3, 6)], ["A", "B"])

Reading Data

Pandas (CSV):

  
pd.read_csv("file.csv")

PySpark (CSV):

  
spark.read.csv("file.csv", header=True, inferSchema=True)

Viewing Data

Pandas:

  
df.head()

PySpark:

  
df.show()

Referencing Columns

Pandas

  
df['col_name']

PySpark

  
col('col_name') # Best Practice
'col_name'
df.col_name

Data Selection

Selecting Columns (Pandas):

  
df['A']

Selecting Columns (PySpark):

  
df.select('A')

Selecting Rows by Position (Pandas):

  
df.iloc[0]

Selecting Rows by Position (PySpark):

  
df.take(1)

Filtering Data

Pandas:

  
df[df['A'] > 2]

PySpark:

  
df.filter(df['A'] > 2)

Grouping and Aggregating

Pandas:

  
df.groupby('A').sum()

PySpark:

  
from pyspark.sql import functions as F
df.groupBy('A').agg(F.sum('B'))

Joining DataFrames

Pandas:

  
pd.merge(df1, df2, on='key')

PySpark:

  
df1.join(df2, df1.key == df2.key)

Handling Missing Data

Drop NA (Pandas):

  
df.dropna()

Drop NA (PySpark):

  
df.na.drop()

Fill NA (Pandas):

  
df.fillna(value)

Fill NA (PySpark):

  
df.na.fill(value)

Sorting

Pandas:

  
df.sort_values(by='A')

PySpark:

  
df.sort(df.A) # Ascending (Default)
df.sort(df.A.desc()) # Descending

Writing Data

Pandas (CSV):

  
df.to_csv("file.csv", index=False)

PySpark (CSV):

  
df.write.csv("file.csv", header=True)

Very similar to Pandas
Sep
Encoding
Header
escapeQuotes

Creating New Column

Pandas

  
df['new_col'] = df['old_col'].apply(lambda x: True if x == 'Joseph' else False)

PySpark

  
df = df.withColumn('new_col', when(col('old_col') == 'Joseph', True).otherwise(False))

Uses [[#when]] function to give various results based on condition

This post is licensed under CC BY 4.0 by the author.

PySpark (.sql)

Window

Rank

Lag / Lead

spark.functions

when

Pandas to PySpark Cheat Sheet

DataFrame Creation

Pandas:

PySpark:

Reading Data

Pandas (CSV):

PySpark (CSV):

Viewing Data

Pandas:

PySpark:

Referencing Columns

Pandas

PySpark

Data Selection

Selecting Columns (Pandas):

Selecting Columns (PySpark):

Selecting Rows by Position (Pandas):

Selecting Rows by Position (PySpark):

Filtering Data

Pandas:

PySpark:

Grouping and Aggregating

Pandas:

PySpark:

Joining DataFrames

Pandas:

PySpark:

Handling Missing Data

Drop NA (Pandas):

Drop NA (PySpark):

Fill NA (Pandas):

Fill NA (PySpark):

Sorting

Pandas:

PySpark:

Writing Data

Pandas (CSV):

PySpark (CSV):

Creating New Column

Pandas

PySpark

Trending Tags