Why use convolutions instead? Since upsampling is already employed to obtain the mask matrix, it seems like transformers could also be used.