Training starts but then freezes without any error message

When trying to train the following model:


the training starts but then freezes without any error message displayed:

It seems like DLS has run out of memory.

The ResNet50 pre-trained layer is set to:
27

21 Gb of ran should be enough to train ResNet50 isn’t there a memory allocation issue ?

Since it is shared instance, it may have been caused by other’s training. You can retry and let me know if you continue to see this error.

Yes I have been able to reproduce it.

The training phase begins like that:


and then freeze with this displayed:

The project is EFE_TL3 on my account, if you want to test it.

I am still experiencing the same problem on the new instance. The training suddenly freezes when the program seem to run out of memory. AFAIU there should be a way to check/warn when the memory on the GPU runs out.

Further to that, even if the training freezes, there’s a result set that is created in the Result tab. That should not be the case because that result set is empty: