Simple PySpark runs far too slow compared to Scala + Spark

I have a simple Pyspark program for test:

import datetime
from pyspark.sql import SparkSession

start = datetime.datetime.now()

spark = SparkSession.builder.appName("test").getOrCreate()

dept = [("D1", 10), ("D2", 20), ("D3", 30), ("D4", 40)]

deptCols = ["Deptname", "DeptId"]
deptDF = spark.createDataFrame(dept, deptCols)
deptDF.show()

end = datetime.datetime.now()
duration = end - start
print("Duration :", duration.seconds, " seconds")

It took 40 seconds to run in my system (with PyCharm IDE). Python version is 3.10.9 and PySpark is 3.3.2.
It is ridiculously slow compared to Scala +Spark (< 5 seconds) in IntelliJ IDE.

Any idea if we can speed up the PySpark program somehow or is this the standard behavior?

Leave a Comment