Data Transformation Issue on End-to-End ML Project – KeyError: ‘Date_of_Journey’

I am following the process shown on Wine Quality Prediction End-to-End ML Project on Krish Naik’s YouTube channel to do a Flight Fare Prediction Project.

I run this cell of data transformation pipeline on 03_data_transformation.ipynb:

try:
    config = ConfigurationManager()
    data_transformation_config = config.get_data_transformation_config()
    data_transformation = DataTransformation(config=data_transformation_config)
    # data_transformation.train_test_spliting()
    # New Line
    data_transformation.initiate_data_transformation()
except Exception as e:
    raise e

I get this error:

KeyError: 'Date_of_Journey'

Here is the full traceback:

[2023-11-24 10:34:37,441: INFO: common: yaml file: config\config.yaml loaded successfully]
[2023-11-24 10:34:37,450: INFO: common: yaml file: params.yaml loaded successfully]
[2023-11-24 10:34:37,457: INFO: common: yaml file: schema.yaml loaded successfully]
[2023-11-24 10:34:37,459: INFO: common: created directory at: artifacts]
[2023-11-24 10:34:37,462: INFO: common: created directory at: artifacts/data_transformation]
[2023-11-24 10:34:41,604: INFO: 1223503272: Read data completed]
[2023-11-24 10:34:41,604: INFO: 1223503272: df dataframe head: 
   Total_Stops  Price  journey_date  journey_month  Air Asia  Air India  GoAir  IndiGo  Jet Airways  Jet Airways Business  Multiple carriers  Multiple carriers Premium economy  SpiceJet  Vistara  Vistara Premium economy  Chennai  Mumbai  Cochin  Hyderabad  New Delhi  duration
0            0   3897            24              3         0          0      0       1            0                     0                  0                                  0         0        0                        0        0       0       0          0          1         2
1            2   7662             1              5         0          1      0       0            0                     0                  0                                  0         0        0                        0        0       0       0          0          0         3
2            2  13882             9              6         0          0      0       0            1                     0                  0                                  0         0        0                        0        0       0       1          0          0         3
3            1   6218            12              5         0          0      0       1            0                     0                  0                                  0         0        0                        0        0       0       0          0          0         2
4            1  13302             1              3         0          0      0       1            0                     0                  0                                  0         0        0                        0        0       0       0          0          1         2]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\core\indexes\base.py:3653, in Index.get_loc(self, key)
   3652 try:
-> 3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:

File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\_libs\index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()

File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\_libs\index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Date_of_Journey'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
g:\Machine_Learning_Projects\iNeuron internship\Flight-Fare-Prediction-End-to-End-ML-Project\research\03_data_transformation.ipynb Cell 10 line 9
      <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=6">7</a>     data_transformation.initiate_data_transformation()
      <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=7">8</a> except Exception as e:
----> <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=8">9</a>     raise e

g:\Machine_Learning_Projects\iNeuron internship\Flight-Fare-Prediction-End-to-End-ML-Project\research\03_data_transformation.ipynb Cell 10 line 7
      <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=3">4</a>     data_transformation = DataTransformation(config=data_transformation_config)
      <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=4">5</a>     # data_transformation.train_test_spliting()
      <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=5">6</a>     # New Line
----> <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=6">7</a>     data_transformation.initiate_data_transformation()
      <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=7">8</a> except Exception as e:
      <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=8">9</a>     raise e

g:\Machine_Learning_Projects\iNeuron internship\Flight-Fare-Prediction-End-to-End-ML-Project\research\03_data_transformation.ipynb Cell 10 line 4
     <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=36">37</a> df.dropna(inplace = True)
     <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=38">39</a> ## Date of journey column transformation
---> <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=39">40</a> df['journey_date'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.day
     <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=40">41</a> df['journey_month'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.month
     <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=42">43</a> ## encoding total stops.

File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\core\frame.py:3761, in DataFrame.__getitem__(self, key)
   3759 if self.columns.nlevels > 1:
   3760     return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
   3762 if is_integer(indexer):
   3763     indexer = [indexer]

File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\core\indexes\base.py:3655, in Index.get_loc(self, key)
   3653     return self._engine.get_loc(casted_key)
   3654 except KeyError as err:
-> 3655     raise KeyError(key) from err
   3656 except TypeError:
   3657     # If we have a listlike key, _check_indexing_error will raise
   3658     #  InvalidIndexError. Otherwise we fall through and re-raise
   3659     #  the TypeError.
   3660     self._check_indexing_error(key)

KeyError: 'Date_of_Journey

Here is the code of data transformation cell:

class DataTransformation:

    # New Function Added
    # https://github.com/yash1314/Flight-Price-Prediction/blob/main/src/utils.py
    def convert_to_minutes(self, duration):
        try:
            hours, minute = 0, 0
            for i in duration.split():
                if 'h' in i:
                    hours = int(i[:-1])
                elif 'm' in i:
                    minute = int(i[:-1])
            return hours * 60 + minute
        except :
            return None 

    def __init__(self, config: DataTransformationConfig):
        self.config = config

    
    ## Note: You can add different data transformation techniques such as Scaler, PCA and all
    #You can perform all kinds of EDA in ML cycle here before passing this data to the model

    # I am only adding train_test_spliting cz this data is already cleaned up

    # New Code Added Start
    def initiate_data_transformation(self):
        ## reading the data
        # df = pd.read_csv(self.config.data_path)
        # New Line
        df = pd.read_excel(self.config.data_path)

        logger.info('Read data completed')
        logger.info(f'df dataframe head: \n{df.head().to_string()}')

        ## dropping null values
        df.dropna(inplace = True)

        ## Date of journey column transformation
        df['journey_date'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.day
        df['journey_month'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.month

        ## encoding total stops.
        df.replace({'Total_Stops': {'non-stop' : 0, '1 stop': 1, '2 stops': 2, '3 stops': 3, '4 stops': 4}}, inplace = True)

        ## ecoding airline, source, and destination
        df_airline = pd.get_dummies(df['Airline'], dtype=int)
        df_source = pd.get_dummies(df['Source'],  dtype=int)
        df_dest = pd.get_dummies(df['Destination'], dtype=int)

        ## dropping first columns of each categorical variables.
        df_airline.drop('Trujet', axis = 1, inplace = True)
        df_source.drop('Banglore', axis = 1, inplace = True)
        df_dest.drop('Banglore', axis = 1, inplace = True)

        df = pd.concat([df, df_airline, df_source, df_dest], axis = 1)
       
        ## handling duration column
        # df['duration'] = df['Duration'].apply(convert_to_minutes)
        # New Line Added
        df['duration'] = df['Duration'].apply(self.convert_to_minutes)
        upper_time_limit = df.duration.mean() + 1.5 * df.duration.std()
        df['duration'] = df['duration'].clip(upper = upper_time_limit)

        ## encodign duration column
        bins = [0, 120, 360, 1440]  # custom bin intervals for 'Short,' 'Medium,' and 'Long'
        labels = ['Short', 'Medium', 'Long'] # creating labels for encoding

        df['duration'] = pd.cut(df['duration'], bins=bins, labels=labels)
        df.replace({'duration': {'Short':1, 'Medium':2, 'Long': 3}}, inplace = True)
        
        ## dropping the columns
        cols_to_drop = cols_to_drop = ['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route', 'Dep_Time', 'Arrival_Time', 'Duration', 'Additional_Info', 'Delhi', 'Kolkata']

        df.drop(cols_to_drop, axis = 1, inplace = True)

        logger.info('df data transformation completed')
        logger.info(f' transformed df data head: \n{df.head().to_string()}')

        # df.to_csv(self.data_transformation_config.transformed_data_file_path, index = False, header= True)
        # New Line
        df.to_excel(self.config.data_path, index = False, header= True)
        # df.to_excel(self.config.transformed_data_file_path, index = False, header= True)
        # df.to_excel(self.data_transformation_config.transformed_data_file_path, index = False, header= True)
        logger.info("transformed data is stored")
        df.head(1)
        ## splitting the data into training and target data
        X = df.drop('Price', axis = 1)
        y = df['Price']
        
        ## accessing the feature importance.
        select = ExtraTreesRegressor()
        select.fit(X, y)

        # plt.figure(figsize=(12, 8))
        # fig_importances = pd.Series(select.feature_importances_, index=X.columns)
        # fig_importances.nlargest(20).plot(kind='barh')
    
        # ## specify the path to the "visuals" folder using os.path.join
        # visuals_folder="visuals"
        # if not os.path.exists(visuals_folder):
        #     os.makedirs(visuals_folder)

        # ## save the plot in the visuals folder
        # plt.savefig(os.path.join(visuals_folder, 'feature_importance_plot.png'))
        # logger.info('feature imp figure saving is successful')

        ## further Splitting the data.
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, shuffle = True) 
        logger.info('final splitting the data is successful')
        

        ## returning splitted data and data_path.
        return (
            X_train, 
            X_test, 
            y_train, 
            y_test,
            self.config.data_path
            # self.data_transformation_config.transformed_data_file_path
        ) 

Here is my file in GitHub.

My file encoding is UTF-8.

How to fix this issue?

  • I believe the issue is stemming from that self.config.data_path is empty and may not actually be reading any file besides a blank one. Are you passing in the correct file path? (are there actual paths going to ConfigurationManager) Is the logger telling you the columns that are in the spreadsheet that you are trying to read from?

    – 




  • @AndrewRyan I believe that it’s reading the right file. And that’s why it’s giving me output by converting string values to integer. Did you see the complete traceback?

    – 

  • 1

    What file are you trying to read and what file are you reading? Because the dataframe that is printed out before the error does not have a column labeled Date_of_Journey

    – 

  • @AndrewRyan I am trying to read Data_Train.xlsx file. You can find it in here: github.com/MdEhsanulHaqueKanan/… It contains a column, called Date_of_Journey.

    – 




Leave a Comment