Importing raw unstructured text file data into a Databricks table

Question

I have a bunch of text files (all with a file extension of “.flib”) stored in an Azure Blob Storage container. I would like to read these files into an Azure Databricks table with two columns in it: (1) the SourceFileName which would contain the name of each file (1 row in the table per FLIB file), and (2) the File_Data column which would be a string representation of the text in that file. I’m looking for what the code would need to look like in a Databricks Notebook (using SQL Mode) to accomplish this.

Why would I want to do this? Reasons. Reasons is why.

My best attempt:

storage_account_name = "<my_account_name>"
storage_account_key = "<redacted>"
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)
container_name = "raw-flibs"

%sql
USE CATALOG "flib-dev";
USE beta;
DROP TABLE IF EXISTS flib_raw_data;
CREATE TABLE flib_raw_data 
USING delta
AS
SELECT
  SUBSTRING_INDEX(REVERSE(value), "https://stackoverflow.com/", 1) AS SourceFileName,
  value AS File_Data
FROM EXPLODE("wasbs://raw-flibs@<my_account_name>.blob.core.windows.net/raw_data/*.flib")
WHERE
  value LIKE '%.flib';

Produces exceptions, I’m happy to post them but I’m positive its because I’m not using correct syntax (I just don’t know what the correct syntax is). Any ideas where I’m going awry?

Leave a Comment Cancel reply