Im using apache beam and python for using it on dataflow (gcp) to load and transform data baches.
I have a pipeline that its separated into different parts.
The first part, write into bigquery.
The second part read the data written in the first step.
I’m using the same p
pipeline to run this, so both parts will run at the same time. I have to achieve the sequentially running in just one pipeline. I tried to add some flag to trigger the second part, but as WriteToBigQuery doesnt return nothing, and I cannot make it iterable, so I was unable to achieve it.
p = beam.Pipeline(options=opts)
part_1 = (
p
| "F1: Read data 1"
>> beam.io.ReadFromText(
entrada, skip_header_lines=True
)
| "F1: Transform 1" >> beam.Map(format_date)
| "F1: Transform 2" >> Map(make_row)
| "F1: Write into BQ"
>> WriteToBigQuery(
output_table,
schema=table_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
additional_bq_parameters={
"timePartitioning": {"type": "DAY"},
"clustering": {"fields": ["programcode"]},
},
custom_gcs_temp_location=temp_location
)
)
After this execution I read the data written (I use a query that includes another tables besides this), this way:
part_2 = (
p
| "F2: Read from BigQuery from first step"
>> beam.io.ReadFromBigQuery(
query=query_raw,
use_standard_sql=True,
gcs_location=temp_location,
project=project_id
)
)
and finally run the defined pipeline:
result = p.run()
result.wait_until_finish()
If I add part_1 as an input in part_2 like this:
part_2 = (
part_1
| "F2: Read from BigQuery"...
The output its the following because part_1 output:
AttributeError: Error trying to access nonexistent attribute `0` in write result. Please see __documentation__ for available attributes.
Any ideas or examples to achieve this sequentially are welcome to discuss.
Hi @Sebastian, Have you checked this similar stack link ? Is it helpful for you ?
Hi @DhirajSingh! thanks for the comment, I have already tried but it didn’t work drive.google.com/file/d/1aTMFQeF4TW0nD1iNGtIxiHB5_N0Wrolc/…
I recommend using Airflow with two Dataflow jobs.
You could (in addition to the BigQuery output) publish the data also to PubSub and read it from Pubsub in the second part of your pipeline. Mind, however, that this will most likely destroy the order of the elements