You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks for the awesome collection of demos and code.
I wonder if you have benchmarks or comparisons of the text grounding segmentation capabilities of GroundingDino vs Florence-2? While I've been testing both with SAM2, my qualitative perception is Florence-2 is more precise matching more tokens with boundaries, and it's also able to detect a more diverse set of objects using their base model, not fine-tuned yet.
At the same time, I wasn't able to extract confidence levels from the specific bboxes generated by Florence-2.
The text was updated successfully, but these errors were encountered:
Your observation is very thorough, and the questions you've raised are highly valuable.
We haven't benchmarked the two approaches implemented in this repo ourselves, but I believe each of these models currently has its own strengths.
For Grounding DINO 1.5, we can see its zero-shot detection capability is stronger than Florence-2, which achieves zero-shot 54.3 AP and 55.7 AP on LVIS minival, and Florence-2 achieves 43.4 AP on COCO zero-shot benchmark.
But after training on FLD-5B datasets, Florence-2 can not only localize main phrase on caption and also has a strong referring capability, you can refer to the following table:
And it can also serve as a foundation model for users to fine-tune it on their specific scenarios.
rentainhe
changed the title
Florence-2 vs GroundingDino + SAM2
Florence-2 vs Grounding DINO + SAM2
Aug 30, 2024
Hello, thanks for the awesome collection of demos and code.
I wonder if you have benchmarks or comparisons of the text grounding segmentation capabilities of GroundingDino vs Florence-2? While I've been testing both with SAM2, my qualitative perception is Florence-2 is more precise matching more tokens with boundaries, and it's also able to detect a more diverse set of objects using their base model, not fine-tuned yet.
At the same time, I wasn't able to extract confidence levels from the specific bboxes generated by Florence-2.
The text was updated successfully, but these errors were encountered: