Unable to set up HDFS interface with PyArrow

I am trying to access the HDFS directory using pyarrow as follows.

  1. Install Hadoop and Spark
  2. Locate libhdfs.so

/usr/local/hadoop-3.3.5/lib/native/libhdfs.so

  1. Create Home Directories
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"   # default is changed
os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.5/"
os.environ['ARROW_LIBHDFS_DIR'] = '/usr/local/hadoop-3.3.5/lib/native/'
new_path = current_path+':/usr/local/hadoop-3.3.5/bin/'
os.environ["PATH"] = new_path
  1. Define Hive ( or Spark SQL databases)
spark.sql("show databases").show()

+---------+
|namespace|
+---------+
|  default|
+---------+
  1. Create new database, load tables
sql_create_database = "create database if not exists analytics location '/content/spark-warehouse/'"
result_create_db = spark.sql(sql_create_database)
sql_create_table = """
create table if not exists analytics.pandas_spark_hive
using parquet
as select to_timestamp(date) as date_parsed, *
from air_quality_sdf
"""
result_create_table = spark.sql(sql_create_table)
  1. Locate parquet files with new table data
!ls /content/spark-warehouse/pandas_spark_hive
part-00000-670c5cc7-8241-4744-9fbc-eddb640ccc19-c000.snappy.parquet 
_SUCCESS
part-00001-670c5cc7-8241-4744-9fbc-eddb640ccc19-c000.snappy.parquet
  1. Create HDFS interface
import pyarrow as pa
hdfs_interface = pa.hdfs.connect(host="localhost", port=50010)
<ipython-input-71-efc100d06888>:6: FutureWarning: pyarrow.hdfs.connect is deprecated as 
of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
hdfs_interface = pa.hdfs.connect(host="localhost", port=50010)
  1. Now when I try to connect, I get an error
hdfs_interface.ls('/content/spark-warehouse/')
OSError: [Errno 255] HDFS list directory failed. Detail: [errno 255] Unknown error 255

I have tried different ports and file paths but each time I am getting the same error.
Would be grateful for any help

Leave a Comment