linear regression output are nonsensical

Question

I have a dataset and am trying to fill in the missing values by utilizing a 2d regression to get the slope of the surrounding curves to approximate the missing value. I am not sure if this is the right approach here, but am open to listen to other ideas. However, here’s my example:

local_window = pd.DataFrame({102.5: {0.021917: 0.0007808776581961896,
                                 0.030136: 0.0009108521507099643,
                                 0.035616: 0.001109650616093018,
                                 0.041095: 0.0013238862647034224,
                                 0.060273: 0.0018552410055933753},
                         105.0: {0.021917: 0.0008955896980595855,
                                 0.030136: 0.001003244315807649,
                                 0.035616: 0.0011852612740301449,
                                 0.041095: 0.0013952857530607904,
                                 0.060273: 0.0018525880756980716},
                         107.5: {0.021917: np.nan,
                                 0.030136: 0.0012354997955153118,
                                 0.035616: 0.00140044893559622,
                                 0.041095: 0.0015902024099268574,
                                 0.060273: 0.001973254493672934}})

def predict_nan_local(local_window):
    if not local_window.isnull().values.any():
        return local_window
    
    # Extract x and y values for the local window
    X_local = local_window.columns.values.copy()
    y_local = local_window.index.values.copy()

    # Create a meshgrid of x and y values
    X_local, y_local = np.meshgrid(X_local, y_local)

    # Flatten x and y for fitting the model
    X_local_flat = X_local.flatten()
    y_local_flat = y_local.flatten()
    values_local_flat = local_window.values.flatten()
    
    # Find indices of non-NaN values
    non_nan_indices = ~np.isnan(values_local_flat)

    # Filter out NaN values
    X_local_flat_filtered = X_local_flat[non_nan_indices]
    y_local_flat_filtered = y_local_flat[non_nan_indices]
    values_local_flat_filtered = values_local_flat[non_nan_indices]

    regressor = LinearRegression()
    regressor.fit(np.column_stack((X_local_flat_filtered, y_local_flat_filtered)), values_local_flat_filtered)
    
    nan_indices = np.argwhere(np.isnan(local_window.values))
    X_nan = local_window.columns.values[nan_indices[:, 1]]
    y_nan = local_window.index.values[nan_indices[:, 0]]
    
    # Predict missing value
    predicted_values = regressor.predict(np.column_stack((X_nan, y_nan)))

    local_window.iloc[nan_indices[:, 0], nan_indices[:, 1]] = predicted_values

    return local_window

The output – as you can see – doesn’t make a whole lot of sense. Is there anything I am missing?

Leave a Comment Cancel reply