Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test data leakage in step 05a (PyTorch and Tensorflow) #55

Open
jhauffa opened this issue Nov 1, 2022 · 1 comment
Open

Test data leakage in step 05a (PyTorch and Tensorflow) #55

jhauffa opened this issue Nov 1, 2022 · 1 comment

Comments

@jhauffa
Copy link

jhauffa commented Nov 1, 2022

The two notebooks 05a - Deep Neural Networks (PyTorch).ipynb and 05a - Deep Neural Networks (TensorFlow).ipynb contain the following piece of code:

	 # The dataset is too small to be useful for deep learning
	 # So we'll oversample it to increase its size
	 for i in range(1,3):
		 penguins = penguins.append(penguins)

This creates a new dataframe that contains four copies of each row of the original dataframe. Since this happens before the training/test split, the probability of a row of the original dataframe to be present in both training and test set is approximately 0.75. In other words, one can expect 3/4 of the original rows to be present in both sets.

This constitutes a leakage of information from the test set into the training set, which renders the test set incapable of assessing the generalization capability of the trained model. In the case of the penguin toy dataset, this does not matter much: The three species appear to be well-separated in feature space, so that overfitting is not an immediate concern. Still, mixing training and test data is bad practice and should not be taught to ML beginners.

I therefore suggest the removal of the piece of code shown above. Since the model is no longer exposed to multiple copies of each row in one epoch of training, the number of epochs has to be increased to achieve the same test set accuracy. Training for 100 instead of 50 epochs worked well in my tests.

@hassou
Copy link

hassou commented Aug 13, 2023

This specified piece of code represented a problem for me as well.
Executing the code results in the following error.
AttributeError: 'DataFrame' object has no attribute 'append'.
I am using Azure Machine Learning workspace with Python 3.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants