Why not use the masked transformers directly in the first two stages?

Why use convolutions instead? Since upsampling is already employed to obtain the mask matrix, it seems like transformers could also be used.