Skip to content

Omicron clustering algorithm in EventTable.cluster() method #1201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

dcorre
Copy link

@dcorre dcorre commented Jan 16, 2020

In addition to the current EventTable.cluster() algorithm, one can set index='omicron' to perform the clustering similarly as omicron.

Examples
--------

As before, to cluster an EventTable (table) whose index is
end_time, window is 0.1, and maximize over snr:

    >>> table.cluster('end_time', 'snr', 0.1)

New, To cluster an EventTable (table) with the omicron clustering
algorithm, with a window of 0.1, and maximize over snr:

    >>> table.cluster('omicron', 'snr', 0.1)

@pep8speaks
Copy link

pep8speaks commented Jan 16, 2020

Hello @dcorre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-16 10:50:35 UTC

@codecov
Copy link

codecov bot commented Jan 16, 2020

Codecov Report

Merging #1201 into master will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1201      +/-   ##
==========================================
+ Coverage   92.28%   92.32%   +0.04%     
==========================================
  Files         229      229              
  Lines       17685    17774      +89     
==========================================
+ Hits        16319    16409      +90     
+ Misses       1366     1365       -1     
Flag Coverage Δ
#Darwin 91.90% <100.00%> (+0.05%) ⬆️
#Linux 92.22% <100.00%> (+0.06%) ⬆️
#Windows 75.44% <100.00%> (+0.12%) ⬆️
#conda 92.23% <100.00%> (+0.04%) ⬆️
#python 92.32% <100.00%> (+0.04%) ⬆️
#python36 92.10% <100.00%> (+0.05%) ⬆️
#python37 92.08% <100.00%> (+0.05%) ⬆️
#python38 92.29% <100.00%> (+0.04%) ⬆️
Impacted Files Coverage Δ
gwpy/table/table.py 93.47% <100.00%> (+2.12%) ⬆️
gwpy/table/tests/test_table.py 98.82% <100.00%> (+0.09%) ⬆️
gwpy/table/tests/test_gravityspy.py 94.44% <0.00%> (-5.56%) ⬇️
gwpy/table/gravityspy.py 31.52% <0.00%> (+2.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5e1987...56ea67e. Read the comment docs.

@dcorre
Copy link
Author

dcorre commented Mar 10, 2020

Hi @duncanmmacleod , after some time I can work again on the clustering and pyomicron things. As we discussed, I simply added a conditional check within table.cluster() to choose between the current clustering method and the one implemented in omicron. However, according to codeclimate it is a bit too complex due to the break in the linear flow of the code.

So I'm wondering whether we can ignore the codeclimate check? Otherwise I guess I will need to write 2 distincts methods.

Regarding the test failures, I am trying to understand where the problem with test_fetch_hacr() comes from. Any ideas?

@duncanmmacleod
Copy link
Member

@dcorre, please rebase this branch against the upstreammaster branch, which should have fix the unrelated test failures.

@duncanmmacleod duncanmmacleod added this to the 2.1.0 milestone Jul 3, 2020
Copy link
Member

@duncanmmacleod duncanmmacleod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcorre, this will make a nice addition to gwpy!

I have requested some pedantic changes, but also some restructuring to make the code more modular, and more atomic.

I am a little confused why the original cluster method is much more complex when moved into the new clustering method, but that might end up being clear once the other changes have been made.

@@ -720,20 +722,159 @@ def filter(self, *column_filters):
"""
return filter_table(self, *column_filters)

def clustering(self, index, window):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to rename this function to something like this:

Suggested change
def clustering(self, index, window):
def identify_clusters(self, index, window):

so that the purpose is more obvious to the user.


"""
# Use same algorithm as omicron
if index == 'omicron':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice at this point to break out into two private methods that perform the different clustering routines, rather than having a large if block. Something like:

    def identify_clusters(...)
        if index == "omicron":
            return self._identify_omicron_clusters(window)
        return self._identify_downselect_clusters(index, window)

Having smaller, more modular, private methods should make this easier to maintain, and would make adding additional clustering algorithms easier in the future.

self.sort('tstart')

# Initialise lists and Tend of the first cluster
ClustersList = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use only lowercase names for variables (CamelCase is for class names), as suggested in PEP-8.

Suggested change
ClustersList = []
clusterslist = []

Comment on lines 775 to 777
N = len(self)
# Loop over the triggers sorted by tstart column
for i in range(N):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If N is not used anywhere else, we can just skip it entirely:

Suggested change
N = len(self)
# Loop over the triggers sorted by tstart column
for i in range(N):
# Loop over the triggers sorted by tstart column
for i in range(len(self)):

Comment on lines 797 to 798
if window <= 0.0:
raise ValueError('Window must be a positive value')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both algorithms require this condition, it makes sense to move this up and outside of the if index == "omicron" block.

cluster_id_tile = []
for i, s in enumerate(ClustersList):
cluster_id_tile.extend([cluster_id[i]] * len(s))
self['clusterID'] = cluster_id_tile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like this forced addition of a new column; I would rather that this method just return the array that identifies the columns, and users can add it to the table themselves if they wish, however I can be convinced. But, if a column is added, I would prefer it to be called cluster_id (all lower-case, underscore delimited)

# without reloading the EventTable, the Tstart and Tend
# are the updated ones, not the original ones for the trigger
# which represent a cluster.
triggers = copy.deepcopy(self.clustering(index, window))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copy seems unnecessary. It makes more sense to me to start the new table as an empty table, then append rows to it as you iterate over the identified clusters.

Comment on lines 524 to 528
assert all(t['time'] == [15.09375, 18.708007, 23.265625, 23.90625])
assert all(t['snr'] == [5.92, 5.18, 5.1, 5.78])
assert all(t['amplitude'] == [6.08, 3.13, 8.93, 1.11])
assert all(t['tstart'] == [15.0625, 18.707031, 23.25, 23.875])
assert all(t['tend'] == [15.125, 18.708985, 23.28125, 24.0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be simpler to use numpy.testing.assert_array_equal or similar here.

# the omicron method. Might need to use it as well for other method
if index == 'omicron':
triggers['tstart'][maxidx] = tstart_list
triggers['tend'][maxidx] = tend_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the necessary logic to manipulate all other columns that are modified by the C++ version of this method, including flow/fhigh etc. We should strive to have the python method reproduce the C++ method exactly (including all parameter values).

@dcorre
Copy link
Author

dcorre commented Jul 16, 2020

I included all of your recommandations.

Appart from codeclimate with some cognitive complexity, it looks fine. I'll let you check codeclimate outputs to see if it really matters. There is also a problem relaed to flake8 but not related to this pull request.

Regarding the reason why I changed so much the original method, well it is quite a while now since I've made theses changes... but think it was becasue it did not allow to associate a tile/trigger (i.e row) to a given cluster as I intend to do with these changes.

I'll make some tests, probably tomorrow, to see if I can keep the old format with these changes.

By the way, it would make more sense if the fstart/fend, tstart/tend values of the clusters identified by the original method correspond to the min/max of the rows included in them and not to the values of the row with the highest snr (or whatever quantity you want to maximise over). With these changes it is straightforward, with the original method, I am not sure.
So if there are no particular reasons, apart from the fact that it was easier to code, we could also update this.

@dcorre
Copy link
Author

dcorre commented Jul 16, 2020

Actually they are the same methods, but I just added some checks for missing triggers.

The reason is that I'm using an array, self.clustersList, containing the list of rows indexes included in a given cluster. This array is the output of both omicron and the 'new' original method. Then in identify_clusters() these groups of points are assigned to a given cluster and each point/row/tile is assigned to a given cluster and store in self.tile_cluster_id . So I need to associate each point to a cluster, even if it is a single point cluster.

The original method does not identify successive rows that do not fullfill the window criteria.
To illustrate, here is an output of "padded_sublists" (whose aim is to group rows together)

[array([0, 1, 2]), array([3, 4, 5, 6, 7]), array([8, 9]), array([12, 13, 14, 15, 16]), array([17, 18, 19])]

You see that rows 10 and 11 are missing here because thay do not fullfill the window criteria.

In the original method, this is not a problem thanks to the use of :

mask1 = numpy.ones_like(col, dtype=bool)
mask1[numpy.concatenate(padded_sublists)] = False
mask1[maxidx] = True

So missing points (i.e. individual trigger clusters) are set to True. Problem is that this workaround is not compatible with the new implementation, where I explicitely need an array of list containing the points in the clusters, even if it is a single point cluster.

To summarise, old and new implementation are based on the same method, but as I want to associate each point in the EventTable to a given cluster, I had the need to add some code to search for missing points (that can be also at the beginning or end of the EventTable) and then to insert them in the "padded_sublists" (i.e. "clustersList" in the new implementation).

Do you have a better solution?

Comment on lines +723 to +858
if tend[i] > cluster_Tend:
cluster_Tend = tend[i]
# Start new cluster
else:
clustersList.append(singleClusterIdxList)
singleClusterIdxList = []
singleClusterIdxList.append(i)
cluster_Tend = tend[i]

# Append last cluster list of indexes
clustersList.append(singleClusterIdxList)

self.clustersList = clustersList

# keep track of the order indexes
self.orderidx = orderidx

return True

def _identify_downselect_clusters(self, index, window):
""" Identify clusters of this `EventTable` over a given column,
`index`.

The clustering algorithm uses a pooling method to identify groups
of points that are all separated in `index` by less than `window`.

Add the list of identified clusters to `self`.

Parameters
----------
index : `str`
name of the column which is used to search for clusters.
If index == 'omicron': use omicron clustering algortihm

window : `float`
window to use when clustering data points, will raise
ValueError if `window > 0` is not satisfied

Returns
-------
None

"""

# sort table by the grequired column
orderidx = numpy.argsort(self[index])
col = self[index][orderidx]

# Find all points where the index vector changes by less than
# window and divide the resulting array into clusters of
# adjacent points
clusterpoints = numpy.where(numpy.diff(col) <= window)[0]
sublists = numpy.split(clusterpoints,
numpy.where(
numpy.diff(clusterpoints) > 1)[0]+1)

# Add end-points to each cluster and find the index of the maximum
# point in each list
clustersList = [numpy.append(s, numpy.array([s[-1]+1]))
for s in sublists]
# Find triggers that are between two clusters
# Happens when successive triggers do not fill the window
# criteria. N triggers missing --> add N clusters
missing_trigger = []
missing_trigger_idx = []
for i in range(len(clustersList)-1):
if clustersList[i+1][0] - clustersList[i][-1] > 1:
missing_trigger_idx.append(i+1)
missing_trigger.append(numpy.arange(
clustersList[i][-1]+1,
clustersList[i+1][0]))
# Insert them in the cluster list
# Need to reverse the list, otherwise the list size update
# is a problem
for i in range(len(missing_trigger))[::-1]:
for j in range(len(missing_trigger[i])):
clustersList.insert(missing_trigger_idx[i]+j,
numpy.atleast_1d(
missing_trigger[i][j]))

# Check if there are some missing points at the beginning
# and end of this list
if clustersList[0][0] != 0:
missing_trigger = numpy.arange(0, clustersList[0][0])
for i in range(len(missing_trigger)):
clustersList.insert(i,
numpy.atleast_1d(missing_trigger[i]))
if clustersList[-1][-1] != len(self):
missing_trigger = numpy.arange(clustersList[-1][-1]+1,
len(self))
for i in range(len(missing_trigger)):
clustersList.insert(len(clustersList)+i,
numpy.atleast_1d(missing_trigger[i]))

self.clustersList = clustersList

# keep track of the order indexes
self.orderidx = orderidx

return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes more sense to me to have these functions return the list of clusters, rather than storing them in another internal property, since the list is only ever used internal to the identify_clusters method.

Can you please consider that change?

@duncanmmacleod
Copy link
Member

@dcorre, what's the status of this work? Can you please rebase against the upstream master branch to refresh things, and then address the comments above?

@eagoetz
Copy link
Contributor

eagoetz commented Aug 3, 2021

@dcorre, what's the status of this work? Can you please rebase against the upstream master branch to refresh things, and then address the comments above?

@dcorre A gentle nudge on this PR.

@duncanmmacleod duncanmmacleod modified the milestones: 2.1.0, 2.2.0 Aug 24, 2021
@duncanmmacleod
Copy link
Member

Ping @dcorre.

@duncanmmacleod duncanmmacleod marked this pull request as draft February 3, 2022 14:45
@duncanmmacleod duncanmmacleod modified the milestones: 2.2.0, 3.0.0 Mar 17, 2022
@duncanmmacleod
Copy link
Member

@dcorre, what is the status of this work? I'm sorry that I wasn't able to devote enough time to it when you originally proposed it.

@duncanmmacleod duncanmmacleod modified the milestones: 3.0.0, 3.1.0 Oct 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants