Replies: 1 comment
-
I'm sure a pure ViT is perfectly useable but I found the hybrid version to be more stable during training. Visualising the resnet embeddings seems reasonable but I don’t have code for that I could share. For the ViT, it’s the attention mask you want to see. The model will most likely only pay attention to the current character in the sequence. I don’t think it’s very interesting to do either but I won’t stop you :) |
Beta Was this translation helpful? Give feedback.
-
Hi Lukas,
I appreciate your work on this project. But I got a question of why ViT+ResNetV2 is better than Pure ViT. And how the input images maintain the spatial information after processed by ResNetV2.
My approach is to access the encoder and visualize the feature map of the last layer of ResNetV2. And also visualize the feature map of the last layer of ViT to have a better insight.
Could you provide me some codes to plot the feature maps for an image as input?
Beta Was this translation helpful? Give feedback.
All reactions