Florence-2 vs Grounding DINO + SAM2 #34

radames · 2024-08-24T18:24:39Z

Hello, thanks for the awesome collection of demos and code.
I wonder if you have benchmarks or comparisons of the text grounding segmentation capabilities of GroundingDino vs Florence-2? While I've been testing both with SAM2, my qualitative perception is Florence-2 is more precise matching more tokens with boundaries, and it's also able to detect a more diverse set of objects using their base model, not fine-tuned yet.
At the same time, I wasn't able to extract confidence levels from the specific bboxes generated by Florence-2.

rentainhe · 2024-08-25T08:32:23Z

Hi @radames

Your observation is very thorough, and the questions you've raised are highly valuable.

We haven't benchmarked the two approaches implemented in this repo ourselves, but I believe each of these models currently has its own strengths.

For Grounding DINO 1.5, we can see its zero-shot detection capability is stronger than Florence-2, which achieves zero-shot 54.3 AP and 55.7 AP on LVIS minival, and Florence-2 achieves 43.4 AP on COCO zero-shot benchmark.

But after training on FLD-5B datasets, Florence-2 can not only localize main phrase on caption and also has a strong referring capability, you can refer to the following table:

And it can also serve as a foundation model for users to fine-tune it on their specific scenarios.

rentainhe changed the title ~~Florence-2 vs GroundingDino + SAM2~~ Florence-2 vs Grounding DINO + SAM2 Aug 30, 2024

rentainhe added the discussion label Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Florence-2 vs Grounding DINO + SAM2 #34

Florence-2 vs Grounding DINO + SAM2 #34

radames commented Aug 24, 2024

rentainhe commented Aug 25, 2024

Florence-2 vs Grounding DINO + SAM2 #34

Florence-2 vs Grounding DINO + SAM2 #34

Comments

radames commented Aug 24, 2024

rentainhe commented Aug 25, 2024