I am trying to access the HDFS directory using pyarrow as follows.
- Install Hadoop and Spark
- Locate libhdfs.so
/usr/local/hadoop-3.3.5/lib/native/libhdfs.so
- Create Home Directories
import os os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" # default is changed os.environ["HADOOP_HOME"] = "/usr/local/hadoop-3.3.5/" os.environ['ARROW_LIBHDFS_DIR'] = '/usr/local/hadoop-3.3.5/lib/native/' new_path = current_path+':/usr/local/hadoop-3.3.5/bin/' os.environ["PATH"] = new_path
- Define Hive ( or Spark SQL databases)
spark.sql("show databases").show() +---------+ |namespace| +---------+ | default| +---------+
- Create new database, load tables
sql_create_database = "create database if not exists analytics location '/content/spark-warehouse/'" result_create_db = spark.sql(sql_create_database) sql_create_table = """ create table if not exists analytics.pandas_spark_hive using parquet as select to_timestamp(date) as date_parsed, * from air_quality_sdf """ result_create_table = spark.sql(sql_create_table)
- Locate parquet files with new table data
!ls /content/spark-warehouse/pandas_spark_hive part-00000-670c5cc7-8241-4744-9fbc-eddb640ccc19-c000.snappy.parquet _SUCCESS part-00001-670c5cc7-8241-4744-9fbc-eddb640ccc19-c000.snappy.parquet
- Create HDFS interface
import pyarrow as pa hdfs_interface = pa.hdfs.connect(host="localhost", port=50010) <ipython-input-71-efc100d06888>:6: FutureWarning: pyarrow.hdfs.connect is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead. hdfs_interface = pa.hdfs.connect(host="localhost", port=50010)
- Now when I try to connect, I get an error
hdfs_interface.ls('/content/spark-warehouse/') OSError: [Errno 255] HDFS list directory failed. Detail: [errno 255] Unknown error 255
I have tried different ports and file paths but each time I am getting the same error.
Would be grateful for any help
If you want to replicate the problem, the notebook is available at colab.research.google.com/drive/…