This is a guide to understanding the PySpark equivalents of Pandas methods, functions, and objects. Reference the below for information when switching —
PySpark (.sql)
Window
Used to apply rank, denserank, lag, and lead functions (similar to SQL statements).
Parts of window functions:
- partitionBy - Used to subset the data for more useful metrics pertaining to the subsets
- orderBy - Used to order a column because rank, lag, and lead need to have a certain order to work as expected
Rank
1
2
3
4
| from pyspark.sql.window import Window
windowSpec = Window.partitionBy('col1').orderBy('col2') # Partition by col1 and order by col2
df_ranked = df.withColumn('rank').over(windowSpec) # Create a new column 'rank' based on partition above
|
Lag / Lead
1
2
3
4
5
| # Define Window Specification
windowSpec = Window.orderBy("Date")
# Use lag and lead functions
df_with_lag_lead = df.withColumn("PreviousDaySales", lag("Sales", 1).over(windowSpec)) \
.withColumn("NextDaySales", lead("Sales", 1).over(windowSpec))
|
spark.functions
when
Used similar to a case statement when assigning a value to a new column. Used along with otherwise(),
1
| df.withColumn('new_col', when(col('old_col') > 100, 'Yes').when(col('old_col') < 10, 'Kinda').otherwise('Nope'))
|
Pandas to PySpark Cheat Sheet
DataFrame Creation
Pandas:
1
2
| import pandas as pd
pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
|
PySpark:
1
2
3
| from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.createDataFrame([(1, 4), (2, 5), (3, 6)], ["A", "B"])
|
Reading Data
Pandas (CSV):
1
| pd.read_csv("file.csv")
|
PySpark (CSV):
1
| spark.read.csv("file.csv", header=True, inferSchema=True)
|
Viewing Data
Pandas:
PySpark:
Referencing Columns
Pandas
PySpark
1
2
3
| col('col_name') # Best Practice
'col_name'
df.col_name
|
Data Selection
Selecting Columns (Pandas):
Selecting Columns (PySpark):
Selecting Rows by Position (Pandas):
Selecting Rows by Position (PySpark):
Filtering Data
Pandas:
PySpark:
Grouping and Aggregating
Pandas:
PySpark:
1
2
| from pyspark.sql import functions as F
df.groupBy('A').agg(F.sum('B'))
|
Joining DataFrames
Pandas:
1
| pd.merge(df1, df2, on='key')
|
PySpark:
1
| df1.join(df2, df1.key == df2.key)
|
Handling Missing Data
Drop NA (Pandas):
Drop NA (PySpark):
Fill NA (Pandas):
Fill NA (PySpark):
Sorting
Pandas:
PySpark:
1
2
| df.sort(df.A) # Ascending (Default)
df.sort(df.A.desc()) # Descending
|
Writing Data
Pandas (CSV):
1
| df.to_csv("file.csv", index=False)
|
PySpark (CSV):
1
| df.write.csv("file.csv", header=True)
|
Very similar to Pandas
- Sep
- Encoding
- Header
- escapeQuotes
Creating New Column
Pandas
1
| df['new_col'] = df['old_col'].apply(lambda x: True if x == 'Joseph' else False)
|
PySpark
1
| df = df.withColumn('new_col', when(col('old_col') == 'Joseph', True).otherwise(False))
|
Uses [[#when]] function to give various results based on condition