Can I use a pandera DataFrameModel to validate a property involving multiple dataframes?

I have the feeling that pandara’s checks are primarily designed to run on a single dataframe. Assume I have a code which loads data into two dataframes df1 and df2. For example, after having loaded df1, I want to load df2 and validate that all values in the column df2.col2 are member of df1.col1. This seems to require that we define the check dynamically at runtime. Is this possible with pandera?

EDIT
There is a solution if we convert the DataFrameModel of df2 to a schema:

import pandera as pa
import pandas as pd

class MyModel(pa.DataFrameModel):
    col2: pa.typing.Series[int] = pa.Field(ge=0)

df1 = pd.DataFrame({'col1':[1,2,2]})

col1_values = df1.col1.unique()

UpdatedSchema = MyModel.to_schema().update_column('col2',
                checks=[pa.Check.isin(col1_values)])

#the validation works
df2 = pd.DataFrame({'col2':[1,2,1]})
df2 = UpdatedSchema(df2)

#the validation fails
df2 = pd.DataFrame({'col2':[1,2,3]})
df2 = UpdatedSchema(df2) #this works

But I wonder if there is a way to do the same without requiring to convert to a DataFrameSchema ?

  • If you are referring to Pandas when writing pandera then: You need to merge the two dataframes on a common key and then compare the two columns now in the same dataframe

    – 

  • For more help you need to post sample of the two dataframes and expected output

    – 

  • No, I was refering to Pandera (pandera.readthedocs.io).

    – 

  • I have found a solution (see edit), but this involves a Schema, this is not the object-oriented API with a DataFrameModel. So I leave this question open.

    – 

Leave a Comment