I am following the process shown on Wine Quality Prediction End-to-End ML Project on Krish Naik’s YouTube channel to do a Flight Fare Prediction Project.
I run this cell of data transformation pipeline on 03_data_transformation.ipynb:
try:
config = ConfigurationManager()
data_transformation_config = config.get_data_transformation_config()
data_transformation = DataTransformation(config=data_transformation_config)
# data_transformation.train_test_spliting()
# New Line
data_transformation.initiate_data_transformation()
except Exception as e:
raise e
I get this error:
KeyError: 'Date_of_Journey'
Here is the full traceback:
[2023-11-24 10:34:37,441: INFO: common: yaml file: config\config.yaml loaded successfully]
[2023-11-24 10:34:37,450: INFO: common: yaml file: params.yaml loaded successfully]
[2023-11-24 10:34:37,457: INFO: common: yaml file: schema.yaml loaded successfully]
[2023-11-24 10:34:37,459: INFO: common: created directory at: artifacts]
[2023-11-24 10:34:37,462: INFO: common: created directory at: artifacts/data_transformation]
[2023-11-24 10:34:41,604: INFO: 1223503272: Read data completed]
[2023-11-24 10:34:41,604: INFO: 1223503272: df dataframe head:
Total_Stops Price journey_date journey_month Air Asia Air India GoAir IndiGo Jet Airways Jet Airways Business Multiple carriers Multiple carriers Premium economy SpiceJet Vistara Vistara Premium economy Chennai Mumbai Cochin Hyderabad New Delhi duration
0 0 3897 24 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 2
1 2 7662 1 5 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
2 2 13882 9 6 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 3
3 1 6218 12 5 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2
4 1 13302 1 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 2]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\core\indexes\base.py:3653, in Index.get_loc(self, key)
3652 try:
-> 3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\_libs\index.pyx:147, in pandas._libs.index.IndexEngine.get_loc()
File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\_libs\index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
File pandas\_libs\hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas\_libs\hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'Date_of_Journey'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
g:\Machine_Learning_Projects\iNeuron internship\Flight-Fare-Prediction-End-to-End-ML-Project\research\03_data_transformation.ipynb Cell 10 line 9
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=6">7</a> data_transformation.initiate_data_transformation()
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=7">8</a> except Exception as e:
----> <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=8">9</a> raise e
g:\Machine_Learning_Projects\iNeuron internship\Flight-Fare-Prediction-End-to-End-ML-Project\research\03_data_transformation.ipynb Cell 10 line 7
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=3">4</a> data_transformation = DataTransformation(config=data_transformation_config)
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=4">5</a> # data_transformation.train_test_spliting()
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=5">6</a> # New Line
----> <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=6">7</a> data_transformation.initiate_data_transformation()
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=7">8</a> except Exception as e:
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=8">9</a> raise e
g:\Machine_Learning_Projects\iNeuron internship\Flight-Fare-Prediction-End-to-End-ML-Project\research\03_data_transformation.ipynb Cell 10 line 4
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=36">37</a> df.dropna(inplace = True)
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=38">39</a> ## Date of journey column transformation
---> <a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=39">40</a> df['journey_date'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.day
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=40">41</a> df['journey_month'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.month
<a href="vscode-notebook-cell:/g%3A/Machine_Learning_Projects/iNeuron%20internship/Flight-Fare-Prediction-End-to-End-ML-Project/research/03_data_transformation.ipynb#X12sZmlsZQ%3D%3D?line=42">43</a> ## encoding total stops.
File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\core\frame.py:3761, in DataFrame.__getitem__(self, key)
3759 if self.columns.nlevels > 1:
3760 return self._getitem_multilevel(key)
-> 3761 indexer = self.columns.get_loc(key)
3762 if is_integer(indexer):
3763 indexer = [indexer]
File c:\Users\2021\.conda\envs\flightfareprediction\lib\site-packages\pandas\core\indexes\base.py:3655, in Index.get_loc(self, key)
3653 return self._engine.get_loc(casted_key)
3654 except KeyError as err:
-> 3655 raise KeyError(key) from err
3656 except TypeError:
3657 # If we have a listlike key, _check_indexing_error will raise
3658 # InvalidIndexError. Otherwise we fall through and re-raise
3659 # the TypeError.
3660 self._check_indexing_error(key)
KeyError: 'Date_of_Journey
Here is the code of data transformation cell:
class DataTransformation:
# New Function Added
# https://github.com/yash1314/Flight-Price-Prediction/blob/main/src/utils.py
def convert_to_minutes(self, duration):
try:
hours, minute = 0, 0
for i in duration.split():
if 'h' in i:
hours = int(i[:-1])
elif 'm' in i:
minute = int(i[:-1])
return hours * 60 + minute
except :
return None
def __init__(self, config: DataTransformationConfig):
self.config = config
## Note: You can add different data transformation techniques such as Scaler, PCA and all
#You can perform all kinds of EDA in ML cycle here before passing this data to the model
# I am only adding train_test_spliting cz this data is already cleaned up
# New Code Added Start
def initiate_data_transformation(self):
## reading the data
# df = pd.read_csv(self.config.data_path)
# New Line
df = pd.read_excel(self.config.data_path)
logger.info('Read data completed')
logger.info(f'df dataframe head: \n{df.head().to_string()}')
## dropping null values
df.dropna(inplace = True)
## Date of journey column transformation
df['journey_date'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.day
df['journey_month'] = pd.to_datetime(df['Date_of_Journey'], format ="%d/%m/%Y").dt.month
## encoding total stops.
df.replace({'Total_Stops': {'non-stop' : 0, '1 stop': 1, '2 stops': 2, '3 stops': 3, '4 stops': 4}}, inplace = True)
## ecoding airline, source, and destination
df_airline = pd.get_dummies(df['Airline'], dtype=int)
df_source = pd.get_dummies(df['Source'], dtype=int)
df_dest = pd.get_dummies(df['Destination'], dtype=int)
## dropping first columns of each categorical variables.
df_airline.drop('Trujet', axis = 1, inplace = True)
df_source.drop('Banglore', axis = 1, inplace = True)
df_dest.drop('Banglore', axis = 1, inplace = True)
df = pd.concat([df, df_airline, df_source, df_dest], axis = 1)
## handling duration column
# df['duration'] = df['Duration'].apply(convert_to_minutes)
# New Line Added
df['duration'] = df['Duration'].apply(self.convert_to_minutes)
upper_time_limit = df.duration.mean() + 1.5 * df.duration.std()
df['duration'] = df['duration'].clip(upper = upper_time_limit)
## encodign duration column
bins = [0, 120, 360, 1440] # custom bin intervals for 'Short,' 'Medium,' and 'Long'
labels = ['Short', 'Medium', 'Long'] # creating labels for encoding
df['duration'] = pd.cut(df['duration'], bins=bins, labels=labels)
df.replace({'duration': {'Short':1, 'Medium':2, 'Long': 3}}, inplace = True)
## dropping the columns
cols_to_drop = cols_to_drop = ['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route', 'Dep_Time', 'Arrival_Time', 'Duration', 'Additional_Info', 'Delhi', 'Kolkata']
df.drop(cols_to_drop, axis = 1, inplace = True)
logger.info('df data transformation completed')
logger.info(f' transformed df data head: \n{df.head().to_string()}')
# df.to_csv(self.data_transformation_config.transformed_data_file_path, index = False, header= True)
# New Line
df.to_excel(self.config.data_path, index = False, header= True)
# df.to_excel(self.config.transformed_data_file_path, index = False, header= True)
# df.to_excel(self.data_transformation_config.transformed_data_file_path, index = False, header= True)
logger.info("transformed data is stored")
df.head(1)
## splitting the data into training and target data
X = df.drop('Price', axis = 1)
y = df['Price']
## accessing the feature importance.
select = ExtraTreesRegressor()
select.fit(X, y)
# plt.figure(figsize=(12, 8))
# fig_importances = pd.Series(select.feature_importances_, index=X.columns)
# fig_importances.nlargest(20).plot(kind='barh')
# ## specify the path to the "visuals" folder using os.path.join
# visuals_folder="visuals"
# if not os.path.exists(visuals_folder):
# os.makedirs(visuals_folder)
# ## save the plot in the visuals folder
# plt.savefig(os.path.join(visuals_folder, 'feature_importance_plot.png'))
# logger.info('feature imp figure saving is successful')
## further Splitting the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, shuffle = True)
logger.info('final splitting the data is successful')
## returning splitted data and data_path.
return (
X_train,
X_test,
y_train,
y_test,
self.config.data_path
# self.data_transformation_config.transformed_data_file_path
)
Here is my file in GitHub.
My file encoding is UTF-8.
How to fix this issue?
I believe the issue is stemming from that self.config.data_path is empty and may not actually be reading any file besides a blank one. Are you passing in the correct file path? (are there actual paths going to ConfigurationManager) Is the logger telling you the columns that are in the spreadsheet that you are trying to read from?
@AndrewRyan I believe that it’s reading the right file. And that’s why it’s giving me output by converting string values to integer. Did you see the complete traceback?
What file are you trying to read and what file are you reading? Because the dataframe that is printed out before the error does not have a column labeled
Date_of_Journey
@AndrewRyan I am trying to read Data_Train.xlsx file. You can find it in here: github.com/MdEhsanulHaqueKanan/… It contains a column, called
Date_of_Journey
.