You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The two notebooks 05a - Deep Neural Networks (PyTorch).ipynb and 05a - Deep Neural Networks (TensorFlow).ipynb contain the following piece of code:
# The dataset is too small to be useful for deep learning
# So we'll oversample it to increase its size
for i in range(1,3):
penguins = penguins.append(penguins)
This creates a new dataframe that contains four copies of each row of the original dataframe. Since this happens before the training/test split, the probability of a row of the original dataframe to be present in both training and test set is approximately 0.75. In other words, one can expect 3/4 of the original rows to be present in both sets.
This constitutes a leakage of information from the test set into the training set, which renders the test set incapable of assessing the generalization capability of the trained model. In the case of the penguin toy dataset, this does not matter much: The three species appear to be well-separated in feature space, so that overfitting is not an immediate concern. Still, mixing training and test data is bad practice and should not be taught to ML beginners.
I therefore suggest the removal of the piece of code shown above. Since the model is no longer exposed to multiple copies of each row in one epoch of training, the number of epochs has to be increased to achieve the same test set accuracy. Training for 100 instead of 50 epochs worked well in my tests.
The text was updated successfully, but these errors were encountered:
This specified piece of code represented a problem for me as well.
Executing the code results in the following error.
AttributeError: 'DataFrame' object has no attribute 'append'.
I am using Azure Machine Learning workspace with Python 3.8
The two notebooks
05a - Deep Neural Networks (PyTorch).ipynb
and05a - Deep Neural Networks (TensorFlow).ipynb
contain the following piece of code:This creates a new dataframe that contains four copies of each row of the original dataframe. Since this happens before the training/test split, the probability of a row of the original dataframe to be present in both training and test set is approximately 0.75. In other words, one can expect 3/4 of the original rows to be present in both sets.
This constitutes a leakage of information from the test set into the training set, which renders the test set incapable of assessing the generalization capability of the trained model. In the case of the penguin toy dataset, this does not matter much: The three species appear to be well-separated in feature space, so that overfitting is not an immediate concern. Still, mixing training and test data is bad practice and should not be taught to ML beginners.
I therefore suggest the removal of the piece of code shown above. Since the model is no longer exposed to multiple copies of each row in one epoch of training, the number of epochs has to be increased to achieve the same test set accuracy. Training for 100 instead of 50 epochs worked well in my tests.
The text was updated successfully, but these errors were encountered: