-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss: Auto-detection discussion and it's difficulties #1213
Comments
Is |
@onury Please provide a reproducible example or https://jsfiddle.net of this issue. I tried to reproduce it just now and I can't get yaml to flag for those samples. |
Can we come up with a psuedo-template response for requests like this (that after some investigation) prove truly impossible to fix? I've looked into a few now and my response in usually something along these lines of:
Many of the cases I've seen are similar (small or ambitious code samples)... things that are really going to be impossible for us to get right 100% of the time... so perhaps we need to find some way to set proper expectations here? Obviously I'm not suggesting to just ignore any response (with a form letter)... when we have time we should take a look and see if there is a larger problem that's potentially correctable... but in this case it's just that
Thoughts? |
It's possible a more liberal use of "Given this snippet, it SHOULD always be JS, period". OR "Given this snippet it should score 0 on properties." I'm not sure we want to start coding specific failing cases into the tests like that because there will be no end... but maybe for common failures it's a good idea? I think the issue is we have no metrics for what might be a "common" case, so it's all a judgement call? |
FYI: The current web of detect tests would seem to be a bit of a deterrent for contributors (not sure if that's intentional or not). I understand we need a way to monitor for regressions, but at some point it also feels a bit contrived... since sometimes we just have to alter the samples to make the detection happy. It seems it'd be better (but harder) to have a huge library of KNOWN snippets (for every language) and then let the parser loose on ALL of them... and have a certain hit ratio... like our auto-detection must be 90% for tests to pass. (accepting that 100% isn't feasible with a wider sampling) Or maybe it's 95 or 85, but you take the point. Not saying that would be easy to do, but it's food for thought. In fact (and I just realized this) you could even change our existing system to do that - simply by dropping the requirement from 100% to some other number. And having nicer output about failures (for people who want to try and improve the tuning). In some ways this would be more "honest" and would allow us to surface some visibility to these types of issues, like:
Right now all these things are true, but they are hiding just beneath the surface, with no visibility. It would also provide a framework for using illegal better: Before:
After:
That's just a rough idea. |
Then instead of just saying "well, thank goodness all the test still pass!" You could say things like
Just by looking at the things they WANTED to match before, and now seeing that they are no longer inclined to do so (because of tuning, adding illegal, etc). All of this has relevance for the larger "languages are in separate repositories discussion also"... since this type of system would be helpful for someone who wanted to download core + download 100 "plugin" languages and then see metrics on which parsers are "best behaved" (regarding auto-detect, etc). And all the languages could test themselves against every other language. So every time a new language is added the existing ones actually have an opportunity to get even better because they have new sample data to analyze and improve their own relevancy. |
Autodetection is a nice feature, and still it often fails, especially on small snippets. Just a thought: Having a library of snippets it probably should be possible to autotomatically tune relevances so that all/most snippets are detected correctly... |
Could you elaborate just a bit more? I'm not sure I follow.
It's a problem when it makes contributing to existing grammars a huge headache because even the tiniest changes upset the delicate balance. :-) And the balance is very delicate. Many languages are just 1 relevance point away from being flagged the wrong thing in tests, etc. And I'm not sure the tests all being green really have much to do with "how well does auto-detect work" in any case - because we actually have no unbiased way to measure that. It's possible that great changes that would improve the auto-detect greatly would be rejected now because they break the delicate balance of the tests. Our data set is biased because it's either passed the tests from the beginning or else it was initially tweaked to do so. And the way it's done you can't tweak just one thing at a time, because everything is compared to everything else. Note: I'm not saying it doesn't work, or that it doesn't even work well. I'm just saying I think "all tests green" is a very status quo thing at this point, not a measure meant of anything other than test greenness. :-) |
I think in some cases you could, but that's another discussion and not super related to this. :) |
People who file issues seem to really like the auto-detection. :-) But I agree in principal if we depreciated how valued it was except in the circumstance it's really useful that'd be good I think. |
For example another way of thinking out matching. I haven't quite worked out the terminology or the way to talk about it yet, but this is the early version.
The only grammar in use here is You'll see Tex (a CRAZY simple grammar) matches almost 10% of it's size with relevance. That's pretty within the range of "normal". A lot of grammars will match themselves in the 5-15% range quite easily - so by that we know that Tex isn't an outlier, or just crazy about itself (ie, overly self-relevant, which some other grammars are). If it were ridiculous crazy about self that would make the other numbers less relevant, so that matters. But And the rest of the list are things This means empirically (irregardless of anything else, and assuming the tex sample is reasonable), that tex is a VERY good grammar at detecting itself. We can say that without knowing anything about what the OTHER grammars might think about Now it might be true that another grammar is really good at false-positives on Tex, but that's not really tex's problem (so I don't think it makes sense to think about at the macro level). It might be the other grammar's problem, or they might just be very similar languages. But irregardless I think that something can be said for the above... that So now perhaps you're saying "well, tex is an easy one, it's very distinctive"... pick something else... and I will... but I think that is exactly the point. If we come to the (general) conclusion "these 10 grammars are GREAT at auto-detect", those other 170 really are so loosy gossy as to be almost meaningless then I think we need to change our tests to take that into account - and focus the VALUE - not just the meaningless raw relevancy numbers. So might we want to test for?
Those type of metrics could be coded into the languages themselves... then as long as Of course all this is most useful as a before/after metric... making changes and watching these numbers go up or down, etc. |
As a point of comparison:
Now smalltalk is on the low end of the density scale at 3% (which doesn't help it), so perhaps it could be improved to identify itself better... I've included all languages that are 25% less dense than it or greater. Remember with Tex I went all the way down to 90% less dense, and only found 7. Here we go down just a little and find 34 languages that smalltalk thinks are quite like smalltalk. And several of them look more like Smalltalk than Smalltalk itself does. Gone is the order of magnitude difference we saw with Tex. So we can conclude that Smalltalk is pretty bad about identifying itself vs other languages. Now we just need to conclude is that because it's too similar or because it's own relevance isn't well tuned? |
And there you have 2 discrete analysis of TWO grammars, only looking at language sample files... ie, entirely isolated from OTHER grammars. Ie, we're duding the grammars on their OWN merits, nothing else. I find this a very refreshing way of looking at the problem. And one that's super helpful for grammar contributors writing 3rd party grammars. |
So what implications might this have more broadly? Well let's say we have an unknown snippet. Smalltalk scores it x relevance. We can't just compare x and y directly (which is what we do now). We need to take density into account. Smalltalk doesn't think of itself as highly as Tex, so Tex naturally might beat it at score... by taking that into account you eliminate the advantage some languages have of being more "self-important" than others... but after we take that into account... lets say they have a reasonably equally density... ie, it's a tex file that has enough smalltalk-ness (density wise) to perhaps pass for smalltalk (in smalltalk opinion)... Then we look at the fact that Tex is (say) 10x more likely to correctly identify itself than Smalltalk... making it far more likely it would be correct than Smalltalk. Tex wins. |
First actual work based on this research: The good thing is that because we're TIGHTENING the relevancy: scores don't go up much and likeliness to match other languages goes WAY down. So in additional to being independently better usually the existing "detect" tests will all just pass without any complaint. |
It's probably possible to build a neural network or genetic algorithm to calculate relevance coefficients automatically (: |
Interesting research. I think a weak point in it is that you're relying heavily on our autodetection tests, which are just random code-like fragments. I think you need to collect a library of real-life snippets, to get figures you can trust. |
We will assign that task to you. :-)
Oh, for sure. More data samples would be great. I'm trying to flesh out the reasoning and a way of thinking about this that doesn't get entangled in every grammar having to be expressly tuned against every other grammar, which is a losing battle IMHO - and shown by the fragility of the existing testing framework. You mentioned needed more tests and better samples for access log - I was actually a little afraid to add a better sample because there is a high likelihood that a larger sample will cause one of the other grammars to now see something it doesn't like (meaning something it actually really DOES like, has relevancy, etc) and now the tests are broken and I"m going down a rabbit hole trying to fix the balance again. :-) That's what I'm trying to get us in a place to avoid... where someone could add better samples, a better grammar for any language... and it could be accepted based on those merits alone, not whether some other language suddenly decide the new samples look "too much like X". A good start would actually be adding longer more fully representative samples for each language... but how to go about that without breaking the existing tests... Should we just copy all the
|
@onury Sorry to co-opt your issue, but all of this is relevant to the core issue you reported. :-) |
For me something like (non-mandatory) |
What do you mean by non-mandatory? Not sure I followed there. |
I just mean we mustn't rely on this folder to exist for all languages. |
Why not? We'd start by populating it with the detect samples... so there would be a sample for every language already. :-) And we're not accepting more languages into core presently, so seems like a solved problem? :) |
It seems so. |
Is it possible that we run a formal benchmark between highlight.js and other programming language detection tools? That way we might be able to see how accurate it is. |
Are you going to provide the benchmark? :-) It'd be great if there was such a thing just laying around that one could use. Indeed I mention above the idea of developing such a framework just for our use, but if someone else has already done it or wants to do it, that's great. :-) |
@yyyc514 we can start with https://github.com/andreasjansson/language-detection.el#model-performance as a start, but we can surly come up with even better benchmarks later on once we can source it. |
Ok, where are the samples I'm language-detection from? They are divided into Linguist/StackOverflow and rosetta folders... but Linguist is just a project and Stackoverflow is a website... so I'm not sure I'm 100% positive where they are getting these samples from. Are they aggregating them from other places? Are you involved with any of these projects? |
@yyyc514 it is more of a heavy observation than an involvement, as I would like to see the software used in such systems be tested through benchmarks. This is the dataset https://github.com/andreasjansson/language-detection.el/tree/master/test/data |
Yeah I saw that, I just wondered if they were pulling those 3 folders in from other sources. |
FYI, we already do tune the auto-detection against our own sample data, but that's a very small and biased sample set, which is why we'd benefit from a larger set like this. Thanks for bringing this to our attention. :-) |
I think the README is wrong... once you fix a bug or two in the analysis (with regards to how we handle php and html/xml) and filter out items < 100 bytes (which we know we will do poorly on) we do reasonably well IMHO:
I would start bragging about how great that is but if they messed up the stats on us so badly perhaps they also messed up the stats on the others. :-/ |
If you have time to work on this feel free. There are probably some low hanging fruit wins to be had. I found two just by glancing at the list of things we failed to auto-detect from language-detection.el. So finding "obvious" things and fixing grammars is only going to help. Really we need a formalized benchmark. If I have time I might whip up something a bit nicer based on the language-detection.el stuff since that gives us 3 different data sets to work from. So you could make an improvement and then test and see (across all 3) if you were moving in the right direction or not. So what would be worth doing for sure:
|
Closing and referencing at #2762. |
Puppet configuration management supports two kinds of templating system to generate files resources based on Puppet variables: - Embedded Ruby (erb) which is already supported by highlightjs - Embedded Puppet (epp) which this pull request add support for epp is similar with some differences: * variables are decorated with `$` * curly braces are used instead of `do` / `end` blocks * arguments of blocks are in slightly different position * interpolated strings use `${...}` instead of the `#` * etc It is similar to erb but the backing language is Puppet rather than plain ruby. [detect] The detect test is erroneously recognized as `mel` and some attempts to tweak the input test gave me `awk` or `csharp`. Based on highlightjs#1213, claim auto-detection to be too broad and disable it for `epp`. [markup] Epp can represent any other language configure it with `subLanguage: []` and provide a markup test based on XML (which is properly detected by the auto-detector). Additional tests cover epp comment and containment of Puppet. More or less related: the `erb` language has `subLanguage: xml` when it should be `[]`, I have left a TODO note for later. Co-authored-by: Wikimedia Foundation Inc. Co-authored-by: Ben Ford <[email protected]>
If js and yaml included both, js code is generally detected as yaml.
Examples below are all detected as yaml.
The text was updated successfully, but these errors were encountered: