Optimization of subunit clusterer for quaternary sym detection of PDB deposited structures#857
Conversation
… uses entity id infor. Some tests are failing
lafita
left a comment
There was a problem hiding this comment.
Looks good! I did not test the code with large viral assemblies... Thanks Jose!
| clusters.remove(c2); | ||
| } else if (clusters.get(c1).mergeIdentical(clusters.get(c2))) { | ||
| // This always makes sense as an optimization: it's far cheaper to compare the sequence | ||
| // string than doing a full S-W alignment |
There was a problem hiding this comment.
I think you are correct here but there are some problems like missing residues at the ends or modified amino acids, so using S-W alignment was an easy solution at the time, but clearly not optimal. Do you have an idea why the test is failing?
There was a problem hiding this comment.
With this optimization enabled, the test that fails is TestQuatSymmetryDetectorExamples.testLocal(), but I have no idea why. I then decided to remove this optimization as anyway for my use-case the entity id comparison takes care of it.
|
|
||
| private List<Subunit> subunits = new ArrayList<Subunit>(); | ||
| private List<List<Integer>> subunitEQR = new ArrayList<List<Integer>>(); | ||
| private List<Subunit> subunits = new ArrayList<>(); |
There was a problem hiding this comment.
Why did you remove the Class of elements in the List when it is defined? Has something changed in Java8+?
There was a problem hiding this comment.
Since Java 7 it is redundant to use the types in the variable initialization. They <> is called "diamond". See "The diamond" section here: https://docs.oracle.com/javase/tutorial/java/generics/types.html
I think it is good style to remove them, since they are redundant. It helps readability.
|
FYI for all: after this is merged I'd like to cut a 5.4.0 release (bump in minor, since this one introduces a tiny new feature). |
This pull request introduces a new switch for the subunit clusterer used for symmetry detection. If the switch
useEntityIdForSeqIdentityDeterminationis enabled, it uses the entity id of the subunits to establish identity of sequences, saving the full Smith-Waterman alignment calculation.This optimization is important in cases like large viral capsids, where there are many thousands of chains and all-to-all sequence alignments become a bottleneck. E.g. for 6Q1F the runtime goes from ~ 6 hours to 7 minutes.