I have a simple Pyspark program for test:
import datetime
from pyspark.sql import SparkSession
start = datetime.datetime.now()
spark = SparkSession.builder.appName("test").getOrCreate()
dept = [("D1", 10), ("D2", 20), ("D3", 30), ("D4", 40)]
deptCols = ["Deptname", "DeptId"]
deptDF = spark.createDataFrame(dept, deptCols)
deptDF.show()
end = datetime.datetime.now()
duration = end - start
print("Duration :", duration.seconds, " seconds")
It took 40 seconds to run in my system (with PyCharm IDE). Python version is 3.10.9 and PySpark is 3.3.2.
It is ridiculously slow compared to Scala +Spark (< 5 seconds) in IntelliJ IDE.
Any idea if we can speed up the PySpark program somehow or is this the standard behavior?