I am reading a book titled “Beginning Anomaly Detection Using Python” by Sridhar All and Suman Kalyan Adari. In Chapter 2, the author explains the traditional method of anomaly detection and provides a sample of Isolation Forest. They use the KDD CUP 1999 dataset to train the model.
While training the model, they did not remove the label column where actions are marked as normal, back, and so on. I am confused as to why they did not drop this column because it already contains information about whether web traffic is normal or not. It seems clear to me that this could lead to data leakage.
I would like to ask whether my understanding is correct or not.
This is an example of supervised learning in AIML, i.e. the model is given the features/attributes columns as well as the final result column (or in this case, the column of labels of “bad” and “good” normal connections). This final result column is used to teach the model that given some characteristics, it’s likely to correspond to a certain label.
Think of it like a student learning something in class. The teacher will teach the student how to do certain things or infer information from some given information. At the same time, the teacher will give the student the correct answers so that it knows what is right and what is not. This is the training phase. It is in the testing and validation phase that the model does not see the labels “bad” or “good”. The testing phase is like when a teacher has quizzes in class as a sort of revision for the final exam and the validation phase is the final exam.
Thus, there is no data leakage of any sort. What you may be thinking of is “overfitting” which is a scenario that arises when the train-test-validation split is not done optimally. In this case, the model can learn a little too well and it overfits on any new data.
If no label is available or given, then it falls under unsupervised learning, which is still a tough task as we have no way of truly evaluating the model with data that has no ground truth annotations.