[Correlation] Generate correlations through multiple requests #9884
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does it do?
This one might be up for debate, but I think it's a healthy addition that ensures consistent performance across all MySQL implementations (be that MariaDB, MySQL, AWS RDS or Azure MySQL etc.).
We were spotting that during ingestion of events with severe over-correlating values, we were seeing a major degradation of performance when generating correlation values.
The current code looks for the value in either
value1
orvalue2
via an OR clause, leaving MySQL to find the best optimisation strategy for performing this query. We found that on our MySQL DB (which wasn't MariaDB), the optimisation strategy was not optimal for every request. It appeared that for over-correlating IOCs, instead of using asort_union
strategy on thevalue1
andvalue2
indexes, it was instead opting to use theevent_id
as the selected index. I couldn't find a consistent method to force the correct strategy.What did this mean for our MISP instance? Instead of correlations taking <= 0.1s per attribute, it was taking around 5-7s per attribute. When trying to ingest an event with 4K objects and 80K attributes (every object had high correlations), it would've taken around 6 days to finish ingestion.
So what does this PR do? Instead of relying upon SQL to find a suitable optimisation strategy for the
value
conditions within the OR clause, separate DB calls are made for each value condition, enabling straightforward indexes to be used for the query. Correlation limits and bits like ACL are still adhered to during the lookups.Potential impact - While this will ensure consistent performance when generating correlations, for MariaDB that appears to consistently create well-optimised queries (although I haven't done much validation here), a slight reduction in performance may be sighted as it now must make up to 3 DB calls instead of just the 1. However given many MISP instances will not be using MariaDB, I reckon this is a good tradeoff.
Questions