-
Notifications
You must be signed in to change notification settings - Fork 118
Omicron clustering algorithm in EventTable.cluster() method #1201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1201 +/- ##
==========================================
+ Coverage 92.28% 92.32% +0.04%
==========================================
Files 229 229
Lines 17685 17774 +89
==========================================
+ Hits 16319 16409 +90
+ Misses 1366 1365 -1
Continue to review full report at Codecov.
|
Hi @duncanmmacleod , after some time I can work again on the clustering and pyomicron things. As we discussed, I simply added a conditional check within table.cluster() to choose between the current clustering method and the one implemented in omicron. However, according to codeclimate it is a bit too complex due to the break in the linear flow of the code. So I'm wondering whether we can ignore the codeclimate check? Otherwise I guess I will need to write 2 distincts methods. Regarding the test failures, I am trying to understand where the problem with test_fetch_hacr() comes from. Any ideas? |
@dcorre, please rebase this branch against the upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcorre, this will make a nice addition to gwpy!
I have requested some pedantic changes, but also some restructuring to make the code more modular, and more atomic.
I am a little confused why the original cluster method is much more complex when moved into the new clustering
method, but that might end up being clear once the other changes have been made.
gwpy/table/table.py
Outdated
@@ -720,20 +722,159 @@ def filter(self, *column_filters): | |||
""" | |||
return filter_table(self, *column_filters) | |||
|
|||
def clustering(self, index, window): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to rename this function to something like this:
def clustering(self, index, window): | |
def identify_clusters(self, index, window): |
so that the purpose is more obvious to the user.
|
||
""" | ||
# Use same algorithm as omicron | ||
if index == 'omicron': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice at this point to break out into two private methods that perform the different clustering routines, rather than having a large if
block. Something like:
def identify_clusters(...)
if index == "omicron":
return self._identify_omicron_clusters(window)
return self._identify_downselect_clusters(index, window)
Having smaller, more modular, private methods should make this easier to maintain, and would make adding additional clustering algorithms easier in the future.
gwpy/table/table.py
Outdated
self.sort('tstart') | ||
|
||
# Initialise lists and Tend of the first cluster | ||
ClustersList = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use only lowercase names for variables (CamelCase is for class names), as suggested in PEP-8.
ClustersList = [] | |
clusterslist = [] |
gwpy/table/table.py
Outdated
N = len(self) | ||
# Loop over the triggers sorted by tstart column | ||
for i in range(N): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If N
is not used anywhere else, we can just skip it entirely:
N = len(self) | |
# Loop over the triggers sorted by tstart column | |
for i in range(N): | |
# Loop over the triggers sorted by tstart column | |
for i in range(len(self)): |
gwpy/table/table.py
Outdated
if window <= 0.0: | ||
raise ValueError('Window must be a positive value') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both algorithms require this condition, it makes sense to move this up and outside of the if index == "omicron"
block.
gwpy/table/table.py
Outdated
cluster_id_tile = [] | ||
for i, s in enumerate(ClustersList): | ||
cluster_id_tile.extend([cluster_id[i]] * len(s)) | ||
self['clusterID'] = cluster_id_tile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like this forced addition of a new column; I would rather that this method just return the array that identifies the columns, and users can add it to the table themselves if they wish, however I can be convinced. But, if a column is added, I would prefer it to be called cluster_id
(all lower-case, underscore delimited)
gwpy/table/table.py
Outdated
# without reloading the EventTable, the Tstart and Tend | ||
# are the updated ones, not the original ones for the trigger | ||
# which represent a cluster. | ||
triggers = copy.deepcopy(self.clustering(index, window)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copy seems unnecessary. It makes more sense to me to start the new table as an empty table, then append rows to it as you iterate over the identified clusters.
gwpy/table/tests/test_table.py
Outdated
assert all(t['time'] == [15.09375, 18.708007, 23.265625, 23.90625]) | ||
assert all(t['snr'] == [5.92, 5.18, 5.1, 5.78]) | ||
assert all(t['amplitude'] == [6.08, 3.13, 8.93, 1.11]) | ||
assert all(t['tstart'] == [15.0625, 18.707031, 23.25, 23.875]) | ||
assert all(t['tend'] == [15.125, 18.708985, 23.28125, 24.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler to use numpy.testing.assert_array_equal
or similar here.
gwpy/table/table.py
Outdated
# the omicron method. Might need to use it as well for other method | ||
if index == 'omicron': | ||
triggers['tstart'][maxidx] = tstart_list | ||
triggers['tend'][maxidx] = tend_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the necessary logic to manipulate all other columns that are modified by the C++ version of this method, including flow/fhigh
etc. We should strive to have the python method reproduce the C++ method exactly (including all parameter values).
I included all of your recommandations. Appart from codeclimate with some cognitive complexity, it looks fine. I'll let you check codeclimate outputs to see if it really matters. There is also a problem relaed to flake8 but not related to this pull request. Regarding the reason why I changed so much the original method, well it is quite a while now since I've made theses changes... but think it was becasue it did not allow to associate a tile/trigger (i.e row) to a given cluster as I intend to do with these changes. I'll make some tests, probably tomorrow, to see if I can keep the old format with these changes. By the way, it would make more sense if the fstart/fend, tstart/tend values of the clusters identified by the original method correspond to the min/max of the rows included in them and not to the values of the row with the highest snr (or whatever quantity you want to maximise over). With these changes it is straightforward, with the original method, I am not sure. |
Actually they are the same methods, but I just added some checks for missing triggers. The reason is that I'm using an array, self.clustersList, containing the list of rows indexes included in a given cluster. This array is the output of both omicron and the 'new' original method. Then in identify_clusters() these groups of points are assigned to a given cluster and each point/row/tile is assigned to a given cluster and store in self.tile_cluster_id . So I need to associate each point to a cluster, even if it is a single point cluster. The original method does not identify successive rows that do not fullfill the window criteria. [array([0, 1, 2]), array([3, 4, 5, 6, 7]), array([8, 9]), array([12, 13, 14, 15, 16]), array([17, 18, 19])] You see that rows 10 and 11 are missing here because thay do not fullfill the window criteria. In the original method, this is not a problem thanks to the use of : mask1 = numpy.ones_like(col, dtype=bool) So missing points (i.e. individual trigger clusters) are set to True. Problem is that this workaround is not compatible with the new implementation, where I explicitely need an array of list containing the points in the clusters, even if it is a single point cluster. To summarise, old and new implementation are based on the same method, but as I want to associate each point in the EventTable to a given cluster, I had the need to add some code to search for missing points (that can be also at the beginning or end of the EventTable) and then to insert them in the "padded_sublists" (i.e. "clustersList" in the new implementation). Do you have a better solution? |
if tend[i] > cluster_Tend: | ||
cluster_Tend = tend[i] | ||
# Start new cluster | ||
else: | ||
clustersList.append(singleClusterIdxList) | ||
singleClusterIdxList = [] | ||
singleClusterIdxList.append(i) | ||
cluster_Tend = tend[i] | ||
|
||
# Append last cluster list of indexes | ||
clustersList.append(singleClusterIdxList) | ||
|
||
self.clustersList = clustersList | ||
|
||
# keep track of the order indexes | ||
self.orderidx = orderidx | ||
|
||
return True | ||
|
||
def _identify_downselect_clusters(self, index, window): | ||
""" Identify clusters of this `EventTable` over a given column, | ||
`index`. | ||
|
||
The clustering algorithm uses a pooling method to identify groups | ||
of points that are all separated in `index` by less than `window`. | ||
|
||
Add the list of identified clusters to `self`. | ||
|
||
Parameters | ||
---------- | ||
index : `str` | ||
name of the column which is used to search for clusters. | ||
If index == 'omicron': use omicron clustering algortihm | ||
|
||
window : `float` | ||
window to use when clustering data points, will raise | ||
ValueError if `window > 0` is not satisfied | ||
|
||
Returns | ||
------- | ||
None | ||
|
||
""" | ||
|
||
# sort table by the grequired column | ||
orderidx = numpy.argsort(self[index]) | ||
col = self[index][orderidx] | ||
|
||
# Find all points where the index vector changes by less than | ||
# window and divide the resulting array into clusters of | ||
# adjacent points | ||
clusterpoints = numpy.where(numpy.diff(col) <= window)[0] | ||
sublists = numpy.split(clusterpoints, | ||
numpy.where( | ||
numpy.diff(clusterpoints) > 1)[0]+1) | ||
|
||
# Add end-points to each cluster and find the index of the maximum | ||
# point in each list | ||
clustersList = [numpy.append(s, numpy.array([s[-1]+1])) | ||
for s in sublists] | ||
# Find triggers that are between two clusters | ||
# Happens when successive triggers do not fill the window | ||
# criteria. N triggers missing --> add N clusters | ||
missing_trigger = [] | ||
missing_trigger_idx = [] | ||
for i in range(len(clustersList)-1): | ||
if clustersList[i+1][0] - clustersList[i][-1] > 1: | ||
missing_trigger_idx.append(i+1) | ||
missing_trigger.append(numpy.arange( | ||
clustersList[i][-1]+1, | ||
clustersList[i+1][0])) | ||
# Insert them in the cluster list | ||
# Need to reverse the list, otherwise the list size update | ||
# is a problem | ||
for i in range(len(missing_trigger))[::-1]: | ||
for j in range(len(missing_trigger[i])): | ||
clustersList.insert(missing_trigger_idx[i]+j, | ||
numpy.atleast_1d( | ||
missing_trigger[i][j])) | ||
|
||
# Check if there are some missing points at the beginning | ||
# and end of this list | ||
if clustersList[0][0] != 0: | ||
missing_trigger = numpy.arange(0, clustersList[0][0]) | ||
for i in range(len(missing_trigger)): | ||
clustersList.insert(i, | ||
numpy.atleast_1d(missing_trigger[i])) | ||
if clustersList[-1][-1] != len(self): | ||
missing_trigger = numpy.arange(clustersList[-1][-1]+1, | ||
len(self)) | ||
for i in range(len(missing_trigger)): | ||
clustersList.insert(len(clustersList)+i, | ||
numpy.atleast_1d(missing_trigger[i])) | ||
|
||
self.clustersList = clustersList | ||
|
||
# keep track of the order indexes | ||
self.orderidx = orderidx | ||
|
||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes more sense to me to have these functions return the list of clusters, rather than storing them in another internal property, since the list is only ever used internal to the identify_clusters
method.
Can you please consider that change?
@dcorre, what's the status of this work? Can you please rebase against the upstream master branch to refresh things, and then address the comments above? |
Ping @dcorre. |
@dcorre, what is the status of this work? I'm sorry that I wasn't able to devote enough time to it when you originally proposed it. |
In addition to the current
EventTable.cluster()
algorithm, one can setindex
='omicron' to perform the clustering similarly as omicron.Examples
--------
As before, to cluster an
EventTable
(table
) whoseindex
isend_time
,window
is0.1
, and maximize oversnr
:New, To cluster an
EventTable
(table
) with the omicron clusteringalgorithm, with a
window
of0.1
, and maximize oversnr
: