Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss: Auto-detection discussion and it's difficulties #1213

Closed
onury opened this issue Jun 13, 2016 · 35 comments
Closed

Discuss: Auto-detection discussion and it's difficulties #1213

onury opened this issue Jun 13, 2016 · 35 comments
Labels
auto-detect Issue with auto detection of language type help welcome Could use help from community

Comments

@onury
Copy link

onury commented Jun 13, 2016

If js and yaml included both, js code is generally detected as yaml.
Examples below are all detected as yaml.

// Calculate distance between two points
var result = calcDistance({
    from: {
        x: 12,
        y: -5
    },
    to: {
        x: 4,
        y: 2
    }
});
// log(result);
var options = {
    opt1: 1,
    opt2: 2,
    opt3: 3
};
mylib.someFunc(options, function (err, result) {
    console.log(err || result);
});
// result:
[
  {
    from: "some name",
    to: "some other name",
    details: {
      prop1: 123,
      text: "some text"
    },
    prop: undefined,
    timestamp: 1456795956380
  }
]
@joshgoebel
Copy link
Member

Is { legal in YAML outside of strings?

@joshgoebel
Copy link
Member

@onury Please provide a reproducible example or https://jsfiddle.net of this issue. I tried to reproduce it just now and I can't get yaml to flag for those samples. dts does seem eager to grab them, but that would be a slightly different issue than the one you're describing here. :)

@joshgoebel
Copy link
Member

joshgoebel commented Oct 6, 2019

@egor-rogov @marcoscaceres

Can we come up with a psuedo-template response for requests like this (that after some investigation) prove truly impossible to fix? I've looked into a few now and my response in usually something along these lines of:

Highlight.js does not contain a FULL syntax/grammar parser for every languages. It only makes a "best-guess" effort to determine the type of language. If the sample size is large (or distinctive) can often do a pretty great job for many languages. However, If the sample size is small and bears a resemblance to a LOT of different languages then this can become an almost impossible problem.

There are likely some languages combos where we could carefully craft samples so that EVEN A HUMAN familiar with both couldn't tell one language from another. So given that, it's going to be impossible for us to get it right some of the time.

If you have any specific suggestions for improving the detection and preventing false positives we're all ears and would love PRs (that don't break any of the tests- they are a very interconnected web).

Many of the cases I've seen are similar (small or ambitious code samples)... things that are really going to be impossible for us to get right 100% of the time... so perhaps we need to find some way to set proper expectations here?


Obviously I'm not suggesting to just ignore any response (with a form letter)... when we have time we should take a look and see if there is a larger problem that's potentially correctable... but in this case it's just that key: value is so common a concept.

  • If I lower the relevance of dts to 0 then properties starts claiming them
  • If I lower properties then yaml starts grabbing it again

Thoughts?

@joshgoebel
Copy link
Member

joshgoebel commented Oct 6, 2019

It's possible a more liberal use of illegal could help with some of this, but I don't think we currently have a great test framework in place for "anti" cases like this:

"Given this snippet, it SHOULD always be JS, period".

OR

"Given this snippet it should score 0 on properties."
"Given this snippet it should score 0 on dts."

I'm not sure we want to start coding specific failing cases into the tests like that because there will be no end... but maybe for common failures it's a good idea?

I think the issue is we have no metrics for what might be a "common" case, so it's all a judgement call?

@joshgoebel
Copy link
Member

joshgoebel commented Oct 6, 2019

FYI: The current web of detect tests would seem to be a bit of a deterrent for contributors (not sure if that's intentional or not). I understand we need a way to monitor for regressions, but at some point it also feels a bit contrived... since sometimes we just have to alter the samples to make the detection happy.

It seems it'd be better (but harder) to have a huge library of KNOWN snippets (for every language) and then let the parser loose on ALL of them... and have a certain hit ratio... like our auto-detection must be 90% for tests to pass. (accepting that 100% isn't feasible with a wider sampling) Or maybe it's 95 or 85, but you take the point.

Not saying that would be easy to do, but it's food for thought. In fact (and I just realized this) you could even change our existing system to do that - simply by dropping the requirement from 100% to some other number. And having nicer output about failures (for people who want to try and improve the tuning).

In some ways this would be more "honest" and would allow us to surface some visibility to these types of issues, like:

  • dts, javascript, and properties are fighting over the same files
  • blah-synax will match ALMOST anything if you give it a chance

Right now all these things are true, but they are hiding just beneath the surface, with no visibility.

It would also provide a framework for using illegal better:

Before:

- javascript.sample.txt
- 15 languages have claimed some relevance
- 3 languages have a STRONG claim (dts, javascript, properties)
- - JS: 23
- - DTS: 34
- - properties: 20

After:

- javascript.sample.txt
- 12 languages have claimed some relevance
- 2 languages rejected as illegal
- 1 languages has a STRONG claim (javascript) [GOOD]
- - JS: 23

That's just a rough idea.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 6, 2019

Then instead of just saying "well, thank goodness all the test still pass!"

You could say things like

  • dts andproperties auto-detection improved 15%

Just by looking at the things they WANTED to match before, and now seeing that they are no longer inclined to do so (because of tuning, adding illegal, etc).

All of this has relevance for the larger "languages are in separate repositories discussion also"... since this type of system would be helpful for someone who wanted to download core + download 100 "plugin" languages and then see metrics on which parsers are "best behaved" (regarding auto-detect, etc).

And all the languages could test themselves against every other language. So every time a new language is added the existing ones actually have an opportunity to get even better because they have new sample data to analyze and improve their own relevancy.

@egor-rogov
Copy link
Collaborator

Autodetection is a nice feature, and still it often fails, especially on small snippets.
But I wonder if this is a real problem? When you want to highlight some snippet, you usually know your language and it should be no problem to specify it. There is one case, though, when you can't do it explicitly: there are languages that contain other languages. An example is an SQL-like language for PostgreSQL (pgsql), in which you can write a function header, while function body is written in some other language. And in this case we usually have a limited number of sub-languages.

Just a thought: Having a library of snippets it probably should be possible to autotomatically tune relevances so that all/most snippets are detected correctly...

@joshgoebel joshgoebel added the auto-detect Issue with auto detection of language type label Oct 7, 2019
@joshgoebel
Copy link
Member

joshgoebel commented Oct 7, 2019

Having a library of snippets it probably should be possible to autotomatically tune relevances so that all/most snippets are detected correctly...

Could you elaborate just a bit more? I'm not sure I follow.

But I wonder if this is a real problem?

It's a problem when it makes contributing to existing grammars a huge headache because even the tiniest changes upset the delicate balance. :-) And the balance is very delicate. Many languages are just 1 relevance point away from being flagged the wrong thing in tests, etc. And I'm not sure the tests all being green really have much to do with "how well does auto-detect work" in any case - because we actually have no unbiased way to measure that.

It's possible that great changes that would improve the auto-detect greatly would be rejected now because they break the delicate balance of the tests. Our data set is biased because it's either passed the tests from the beginning or else it was initially tweaked to do so. And the way it's done you can't tweak just one thing at a time, because everything is compared to everything else.

Note: I'm not saying it doesn't work, or that it doesn't even work well. I'm just saying I think "all tests green" is a very status quo thing at this point, not a measure meant of anything other than test greenness. :-)

@joshgoebel
Copy link
Member

There is one case, though, when you can't do it explicitly: there are languages that contain other languages.

I think in some cases you could, but that's another discussion and not super related to this. :)

@joshgoebel
Copy link
Member

When you want to highlight some snippet, you usually know your language and it should be no problem to specify it.

People who file issues seem to really like the auto-detection. :-) But I agree in principal if we depreciated how valued it was except in the circumstance it's really useful that'd be good I think.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 7, 2019

For example another way of thinking out matching. I haven't quite worked out the terminology or the way to talk about it yet, but this is the early version.

tex 37 0.09
  3 cpp 0.01
  7 awk 0.01
  21 crmsh 0.01
  4 crystal 0.01
  15 dsconfig 0.01
  9 jboss-cli 0.01
  4 properties 0.02

The only grammar in use here is tex. It's being asked to analyze itself and report a score (based on density of relevance per size of file)... then it's asked to do the same to ALL the other detect cases (since that's all the samples we have laying around).

You'll see Tex (a CRAZY simple grammar) matches almost 10% of it's size with relevance. That's pretty within the range of "normal". A lot of grammars will match themselves in the 5-15% range quite easily - so by that we know that Tex isn't an outlier, or just crazy about itself (ie, overly self-relevant, which some other grammars are). If it were ridiculous crazy about self that would make the other numbers less relevant, so that matters. But tex seems to have a good head on its shoulders.

And the rest of the list are things tex MIGHT misidentify all the way down to one whole magnitude of difference in density. And the key here is that there are only 7 possibles (out of 185). And the difference between tex identifying itself and "not tex" is about an order of magnitude.

This means empirically (irregardless of anything else, and assuming the tex sample is reasonable), that tex is a VERY good grammar at detecting itself. We can say that without knowing anything about what the OTHER grammars might think about tex files.


Now it might be true that another grammar is really good at false-positives on Tex, but that's not really tex's problem (so I don't think it makes sense to think about at the macro level). It might be the other grammar's problem, or they might just be very similar languages. But irregardless I think that something can be said for the above... that tex is very "picky"/"sticky" or some concept to it's own files. And that's a useful metric to have.


So now perhaps you're saying "well, tex is an easy one, it's very distinctive"... pick something else... and I will... but I think that is exactly the point. If we come to the (general) conclusion "these 10 grammars are GREAT at auto-detect", those other 170 really are so loosy gossy as to be almost meaningless then I think we need to change our tests to take that into account - and focus the VALUE - not just the meaningless raw relevancy numbers.

So might we want to test for?

  • Tex can identify itself in a given density range (a reasonable range)
  • Tex does not falsely identify other languages (by a given margin)

Those type of metrics could be coded into the languages themselves... then as long as tex always performs the same, tests pass. If suddenly it performs much worse, tests fail. If suddenly it performs much better, tests also fail - and someone decides if we've hit a new norm or if there is some type of "too self-important" glitch happening.

Of course all this is most useful as a before/after metric... making changes and watching these numbers go up or down, etc.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 7, 2019

As a point of comparison:

smalltalk 28 0.03
  6 cpp 0.02
  15 autoit 0.04
  10 awk 0.02
  6 bash 0.02
  12 bnf 0.02
  15 cos 0.03
  5 csp 0.02
  43 ebnf 0.08
  2 excel 0.04
  14 gherkin 0.02
  4 golo 0.02
  8 hsp 0.03
  11 http 0.04
  5 ini 0.03
  16 json 0.07
  4 makefile 0.02
  13 mel 0.03
  13 mipsasm 0.03
  45 perl 0.05
  28 mojolicious 0.04
  13 moonscript 0.02
  12 n1ql 0.03
  11 nix 0.03
  15 powershell 0.04
  20 puppet 0.02
  24 purebasic 0.03
  24 rib 0.04
  27 ruleslanguage 0.03
  13 sql 0.04
  31 step21 0.03
  11 stylus 0.02
  23 taggerscript 0.03
  23 tcl 0.04
  36 xquery 0.04

Now smalltalk is on the low end of the density scale at 3% (which doesn't help it), so perhaps it could be improved to identify itself better...

I've included all languages that are 25% less dense than it or greater. Remember with Tex I went all the way down to 90% less dense, and only found 7. Here we go down just a little and find 34 languages that smalltalk thinks are quite like smalltalk. And several of them look more like Smalltalk than Smalltalk itself does. Gone is the order of magnitude difference we saw with Tex.

So we can conclude that Smalltalk is pretty bad about identifying itself vs other languages. Now we just need to conclude is that because it's too similar or because it's own relevance isn't well tuned?

@joshgoebel
Copy link
Member

joshgoebel commented Oct 7, 2019

And there you have 2 discrete analysis of TWO grammars, only looking at language sample files... ie, entirely isolated from OTHER grammars. Ie, we're duding the grammars on their OWN merits, nothing else. I find this a very refreshing way of looking at the problem. And one that's super helpful for grammar contributors writing 3rd party grammars.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 7, 2019

So what implications might this have more broadly? Well let's say we have an unknown snippet.

Smalltalk scores it x relevance.
Tex scores it y relevance.

We can't just compare x and y directly (which is what we do now). We need to take density into account. Smalltalk doesn't think of itself as highly as Tex, so Tex naturally might beat it at score... by taking that into account you eliminate the advantage some languages have of being more "self-important" than others... but after we take that into account... lets say they have a reasonably equally density... ie, it's a tex file that has enough smalltalk-ness (density wise) to perhaps pass for smalltalk (in smalltalk opinion)...

Then we look at the fact that Tex is (say) 10x more likely to correctly identify itself than Smalltalk... making it far more likely it would be correct than Smalltalk.

Tex wins.

@joshgoebel
Copy link
Member

First actual work based on this research:

#2172

The good thing is that because we're TIGHTENING the relevancy: scores don't go up much and likeliness to match other languages goes WAY down. So in additional to being independently better usually the existing "detect" tests will all just pass without any complaint.

@egor-rogov
Copy link
Collaborator

Having a library of snippets it probably should be possible to autotomatically tune relevances so that all/most snippets are detected correctly...

Could you elaborate just a bit more? I'm not sure I follow.

It's probably possible to build a neural network or genetic algorithm to calculate relevance coefficients automatically (:

@egor-rogov
Copy link
Collaborator

Interesting research. I think a weak point in it is that you're relying heavily on our autodetection tests, which are just random code-like fragments. I think you need to collect a library of real-life snippets, to get figures you can trust.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 9, 2019

It's probably possible to build a neural network or genetic algorithm to calculate relevance coefficients automatically (:

We will assign that task to you. :-)

I think a weak point in it is that you're relying heavily on our autodetection tests, which are just random code-like fragments.

Oh, for sure. More data samples would be great. I'm trying to flesh out the reasoning and a way of thinking about this that doesn't get entangled in every grammar having to be expressly tuned against every other grammar, which is a losing battle IMHO - and shown by the fragility of the existing testing framework.

You mentioned needed more tests and better samples for access log - I was actually a little afraid to add a better sample because there is a high likelihood that a larger sample will cause one of the other grammars to now see something it doesn't like (meaning something it actually really DOES like, has relevancy, etc) and now the tests are broken and I"m going down a rabbit hole trying to fix the balance again. :-)

That's what I'm trying to get us in a place to avoid... where someone could add better samples, a better grammar for any language... and it could be accepted based on those merits alone, not whether some other language suddenly decide the new samples look "too much like X".


A good start would actually be adding longer more fully representative samples for each language... but how to go about that without breaking the existing tests...

Should we just copy all the autodetect files into a new samples folder and then go from there, building up better samples? Then a new minimal test could be added such that:

  • Ruby must always successfully parse all Ruby samples (ie, not bailing because of an illegal error, etc)

@joshgoebel joshgoebel added other big picture Policy or high level discussion labels Oct 9, 2019
@joshgoebel joshgoebel changed the title Auto-detection is generally incorrect if yaml included BIG picture: Auto-detection discussion Oct 9, 2019
@joshgoebel
Copy link
Member

joshgoebel commented Oct 9, 2019

@onury Sorry to co-opt your issue, but all of this is relevant to the core issue you reported. :-)

@egor-rogov
Copy link
Collaborator

For me something like (non-mandatory) samples folder is fine. You can first use it solely for research purposes, and after a while we can switch tests to make use of it, too.

@joshgoebel
Copy link
Member

What do you mean by non-mandatory? Not sure I followed there.

@egor-rogov
Copy link
Collaborator

I just mean we mustn't rely on this folder to exist for all languages.

@joshgoebel
Copy link
Member

I just mean we mustn't rely on this folder to exist for all languages.

Why not? We'd start by populating it with the detect samples... so there would be a sample for every language already. :-)

And we're not accepting more languages into core presently, so seems like a solved problem? :)

@egor-rogov
Copy link
Collaborator

It seems so.
(I don't quite see how external repos are included in the overall picture, but probably it's not that important at this stage.)

@joshgoebel joshgoebel changed the title BIG picture: Auto-detection discussion Discuss: Auto-detection discussion and it's difficulties Oct 22, 2019
@DonaldTsang
Copy link

Is it possible that we run a formal benchmark between highlight.js and other programming language detection tools? That way we might be able to see how accurate it is.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 23, 2019

Are you going to provide the benchmark? :-) It'd be great if there was such a thing just laying around that one could use. Indeed I mention above the idea of developing such a framework just for our use, but if someone else has already done it or wants to do it, that's great. :-)

@DonaldTsang
Copy link

DonaldTsang commented Nov 23, 2019

@yyyc514 we can start with https://github.com/andreasjansson/language-detection.el#model-performance as a start, but we can surly come up with even better benchmarks later on once we can source it.

@joshgoebel
Copy link
Member

Ok, where are the samples I'm language-detection from? They are divided into Linguist/StackOverflow and rosetta folders... but Linguist is just a project and Stackoverflow is a website... so I'm not sure I'm 100% positive where they are getting these samples from.

Are they aggregating them from other places? Are you involved with any of these projects?

@DonaldTsang
Copy link

@yyyc514 it is more of a heavy observation than an involvement, as I would like to see the software used in such systems be tested through benchmarks. This is the dataset https://github.com/andreasjansson/language-detection.el/tree/master/test/data

@joshgoebel
Copy link
Member

Yeah I saw that, I just wondered if they were pulling those 3 folders in from other sources.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 23, 2019

I would like to see the software used in such systems be tested through benchmarks.

FYI, we already do tune the auto-detection against our own sample data, but that's a very small and biased sample set, which is why we'd benefit from a larger set like this.

Thanks for bringing this to our attention. :-)

@joshgoebel
Copy link
Member

joshgoebel commented Nov 24, 2019

I think the README is wrong... once you fix a bug or two in the analysis (with regards to how we handle php and html/xml) and filter out items < 100 bytes (which we know we will do poorly on) we do reasonably well IMHO:

  • 73% correct, 83% correct (if you count the 2nd guess as matching)
  • 68%/79% if you force us to look at ALL files.

I would start bragging about how great that is but if they messed up the stats on us so badly perhaps they also messed up the stats on the others. :-/

@DonaldTsang
Copy link

@yyyc514 We should definitely try and "earn" the bragging rights by comparing our tool with the list in #2299

@joshgoebel
Copy link
Member

If you have time to work on this feel free. There are probably some low hanging fruit wins to be had. I found two just by glancing at the list of things we failed to auto-detect from language-detection.el. So finding "obvious" things and fixing grammars is only going to help.

Really we need a formalized benchmark. If I have time I might whip up something a bit nicer based on the language-detection.el stuff since that gives us 3 different data sets to work from. So you could make an improvement and then test and see (across all 3) if you were moving in the right direction or not.

So what would be worth doing for sure:

  • more formalized benchmark with better output
  • analysis of failures for "low hanging fruit" fixes

@joshgoebel joshgoebel removed other big picture Policy or high level discussion labels Jan 31, 2020
@joshgoebel joshgoebel added the help welcome Could use help from community label Aug 6, 2020
@joshgoebel
Copy link
Member

Closing and referencing at #2762.

hashar added a commit to hashar/highlight.js that referenced this issue Jun 14, 2023
Puppet configuration management supports two kinds of templating system
to generate files resources based on Puppet variables:

- Embedded Ruby (erb) which is already supported by highlightjs
- Embedded Puppet (epp) which this pull request add support for

epp is similar with some differences:
* variables are decorated with `$`
* curly braces are used instead of `do` / `end` blocks
* arguments of blocks are in slightly different position
* interpolated strings use `${...}` instead of the `#`
* etc

It is similar to erb but the backing language is Puppet rather than
plain ruby.

[detect]

The detect test is erroneously recognized as `mel` and some attempts to
tweak the input test gave me `awk` or `csharp`. Based on highlightjs#1213, claim
auto-detection to be too broad and disable it for `epp`.

[markup]

Epp can represent any other language configure it with `subLanguage: []`
and provide a markup test based on XML (which is properly detected by
the auto-detector).

Additional tests cover epp comment and containment of Puppet.

More or less related: the `erb` language has `subLanguage: xml` when it
should be `[]`, I have left a TODO note for later.

Co-authored-by: Wikimedia Foundation Inc.
Co-authored-by: Ben Ford <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-detect Issue with auto detection of language type help welcome Could use help from community
Projects
None yet
Development

No branches or pull requests

4 participants