I have the feeling that pandara’s checks are primarily designed to run on a single dataframe. Assume I have a code which loads data into two dataframes df1
and df2
. For example, after having loaded df1
, I want to load df2
and validate that all values in the column df2.col2
are member of df1.col1
. This seems to require that we define the check dynamically at runtime. Is this possible with pandera?
EDIT
There is a solution if we convert the DataFrameModel of df2
to a schema:
import pandera as pa
import pandas as pd
class MyModel(pa.DataFrameModel):
col2: pa.typing.Series[int] = pa.Field(ge=0)
df1 = pd.DataFrame({'col1':[1,2,2]})
col1_values = df1.col1.unique()
UpdatedSchema = MyModel.to_schema().update_column('col2',
checks=[pa.Check.isin(col1_values)])
#the validation works
df2 = pd.DataFrame({'col2':[1,2,1]})
df2 = UpdatedSchema(df2)
#the validation fails
df2 = pd.DataFrame({'col2':[1,2,3]})
df2 = UpdatedSchema(df2) #this works
But I wonder if there is a way to do the same without requiring to convert to a DataFrameSchema
?
If you are referring to
Pandas
when writing pandera then: You need tomerge
the two dataframes on a common key and then compare the two columns now in the same dataframeFor more help you need to post sample of the two dataframes and expected output
No, I was refering to Pandera (pandera.readthedocs.io).
I have found a solution (see edit), but this involves a Schema, this is not the object-oriented API with a DataFrameModel. So I leave this question open.