I have a dataframe and I want to use values from rows to execute a query (on delta lake) and get results in a new column. However in Synapse notebook I always get an error:
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063
I have below function:
def execute_sql(database, table):
query = f"SELECT COUNT(*) as count FROM {database}.{table}"
result = spark.sql(query).collect()[0][0]
return result
and I am applying it as such (there are many more columns, I am just using two, but want to keep others):
execute_sql=udf(execute_sql,StringType())
new_df=input_df.withColumn('TotalCount',execute_sql(col("Database"), col("Table")))
display(new_df)
I am trying to avoid complicating and iterate through dataframe with rdd.collect