Omicron clustering algorithm in EventTable.cluster() method #1201

dcorre · 2020-01-16T13:31:18Z

In addition to the current EventTable.cluster() algorithm, one can set index='omicron' to perform the clustering similarly as omicron.

Examples
--------

As before, to cluster an EventTable (table) whose index is
end_time, window is 0.1, and maximize over snr:

    >>> table.cluster('end_time', 'snr', 0.1)

New, To cluster an EventTable (table) with the omicron clustering
algorithm, with a window of 0.1, and maximize over snr:

    >>> table.cluster('omicron', 'snr', 0.1)

… method

pep8speaks · 2020-01-16T13:31:22Z

Hello @dcorre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-07-16 10:50:35 UTC

codecov · 2020-01-16T13:42:11Z

Codecov Report

Merging #1201 into master will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1201      +/-   ##
==========================================
+ Coverage   92.28%   92.32%   +0.04%     
==========================================
  Files         229      229              
  Lines       17685    17774      +89     
==========================================
+ Hits        16319    16409      +90     
+ Misses       1366     1365       -1

Flag	Coverage Δ
#Darwin	`91.90% <100.00%> (+0.05%)`	⬆️
#Linux	`92.22% <100.00%> (+0.06%)`	⬆️
#Windows	`75.44% <100.00%> (+0.12%)`	⬆️
#conda	`92.23% <100.00%> (+0.04%)`	⬆️
#python	`92.32% <100.00%> (+0.04%)`	⬆️
#python36	`92.10% <100.00%> (+0.05%)`	⬆️
#python37	`92.08% <100.00%> (+0.05%)`	⬆️
#python38	`92.29% <100.00%> (+0.04%)`	⬆️

Impacted Files	Coverage Δ
gwpy/table/table.py	`93.47% <100.00%> (+2.12%)`	⬆️
gwpy/table/tests/test_table.py	`98.82% <100.00%> (+0.09%)`	⬆️
gwpy/table/tests/test_gravityspy.py	`94.44% <0.00%> (-5.56%)`	⬇️
gwpy/table/gravityspy.py	`31.52% <0.00%> (+2.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5e1987...56ea67e. Read the comment docs.

dcorre · 2020-03-10T12:58:01Z

Hi @duncanmmacleod , after some time I can work again on the clustering and pyomicron things. As we discussed, I simply added a conditional check within table.cluster() to choose between the current clustering method and the one implemented in omicron. However, according to codeclimate it is a bit too complex due to the break in the linear flow of the code.

So I'm wondering whether we can ignore the codeclimate check? Otherwise I guess I will need to write 2 distincts methods.

Regarding the test failures, I am trying to understand where the problem with test_fetch_hacr() comes from. Any ideas?

duncanmmacleod · 2020-03-23T18:45:24Z

@dcorre, please rebase this branch against the upstreammaster branch, which should have fix the unrelated test failures.

…en cluster

…ntTable

duncanmmacleod

Thanks @dcorre, this will make a nice addition to gwpy!

I have requested some pedantic changes, but also some restructuring to make the code more modular, and more atomic.

I am a little confused why the original cluster method is much more complex when moved into the new clustering method, but that might end up being clear once the other changes have been made.

duncanmmacleod · 2020-07-08T13:50:54Z

gwpy/table/table.py

@@ -720,20 +722,159 @@ def filter(self, *column_filters):
        """
        return filter_table(self, *column_filters)

+    def clustering(self, index, window):


I think it makes more sense to rename this function to something like this:

Suggested change

def clustering(self, index, window):

def identify_clusters(self, index, window):

so that the purpose is more obvious to the user.

duncanmmacleod · 2020-07-08T13:53:55Z

gwpy/table/table.py

+
+        """
+        # Use same algorithm as omicron
+        if index == 'omicron':


It might be nice at this point to break out into two private methods that perform the different clustering routines, rather than having a large if block. Something like:

def identify_clusters(...) if index == "omicron": return self._identify_omicron_clusters(window) return self._identify_downselect_clusters(index, window)

Having smaller, more modular, private methods should make this easier to maintain, and would make adding additional clustering algorithms easier in the future.

duncanmmacleod · 2020-07-08T13:54:51Z

gwpy/table/table.py

+            self.sort('tstart')
+
+            # Initialise lists and Tend of the first cluster
+            ClustersList = []


Please use only lowercase names for variables (CamelCase is for class names), as suggested in PEP-8.

Suggested change

ClustersList = []

clusterslist = []

duncanmmacleod · 2020-07-08T13:55:39Z

gwpy/table/table.py

+            N = len(self)
+            # Loop over the triggers sorted by tstart column
+            for i in range(N):


If N is not used anywhere else, we can just skip it entirely:

Suggested change

N = len(self)

# Loop over the triggers sorted by tstart column

for i in range(N):

# Loop over the triggers sorted by tstart column

for i in range(len(self)):

duncanmmacleod · 2020-07-08T13:56:49Z

gwpy/table/table.py

+            if window <= 0.0:
+                raise ValueError('Window must be a positive value')


If both algorithms require this condition, it makes sense to move this up and outside of the if index == "omicron" block.

duncanmmacleod · 2020-07-08T13:58:51Z

gwpy/table/table.py

+        cluster_id_tile = []
+        for i, s in enumerate(ClustersList):
+            cluster_id_tile.extend([cluster_id[i]] * len(s))
+        self['clusterID'] = cluster_id_tile


I'm not sure I like this forced addition of a new column; I would rather that this method just return the array that identifies the columns, and users can add it to the table themselves if they wish, however I can be convinced. But, if a column is added, I would prefer it to be called cluster_id (all lower-case, underscore delimited)

duncanmmacleod · 2020-07-08T13:59:25Z

gwpy/table/table.py

+        # without reloading the EventTable, the Tstart and Tend
+        # are the updated ones, not the original ones for the trigger
+        # which represent a cluster.
+        triggers = copy.deepcopy(self.clustering(index, window))


This copy seems unnecessary. It makes more sense to me to start the new table as an empty table, then append rows to it as you iterate over the identified clusters.

duncanmmacleod · 2020-07-08T14:00:19Z

gwpy/table/tests/test_table.py

+        assert all(t['time'] == [15.09375, 18.708007, 23.265625, 23.90625])
+        assert all(t['snr'] == [5.92, 5.18, 5.1, 5.78])
+        assert all(t['amplitude'] == [6.08, 3.13, 8.93, 1.11])
+        assert all(t['tstart'] == [15.0625, 18.707031, 23.25, 23.875])
+        assert all(t['tend'] == [15.125, 18.708985, 23.28125, 24.0])


It might be simpler to use numpy.testing.assert_array_equal or similar here.

duncanmmacleod · 2020-07-08T14:01:31Z

gwpy/table/table.py

+        # the omicron method. Might need to use it as well for other method
+        if index == 'omicron':
+            triggers['tstart'][maxidx] = tstart_list
+            triggers['tend'][maxidx] = tend_list


Please add the necessary logic to manipulate all other columns that are modified by the C++ version of this method, including flow/fhigh etc. We should strive to have the python method reproduce the C++ method exactly (including all parameter values).

… stuff directly

dcorre · 2020-07-16T09:34:48Z

I included all of your recommandations.

Appart from codeclimate with some cognitive complexity, it looks fine. I'll let you check codeclimate outputs to see if it really matters. There is also a problem relaed to flake8 but not related to this pull request.

Regarding the reason why I changed so much the original method, well it is quite a while now since I've made theses changes... but think it was becasue it did not allow to associate a tile/trigger (i.e row) to a given cluster as I intend to do with these changes.

I'll make some tests, probably tomorrow, to see if I can keep the old format with these changes.

By the way, it would make more sense if the fstart/fend, tstart/tend values of the clusters identified by the original method correspond to the min/max of the rows included in them and not to the values of the row with the highest snr (or whatever quantity you want to maximise over). With these changes it is straightforward, with the original method, I am not sure.
So if there are no particular reasons, apart from the fact that it was easier to code, we could also update this.

dcorre · 2020-07-16T10:19:12Z

Actually they are the same methods, but I just added some checks for missing triggers.

The reason is that I'm using an array, self.clustersList, containing the list of rows indexes included in a given cluster. This array is the output of both omicron and the 'new' original method. Then in identify_clusters() these groups of points are assigned to a given cluster and each point/row/tile is assigned to a given cluster and store in self.tile_cluster_id . So I need to associate each point to a cluster, even if it is a single point cluster.

The original method does not identify successive rows that do not fullfill the window criteria.
To illustrate, here is an output of "padded_sublists" (whose aim is to group rows together)

[array([0, 1, 2]), array([3, 4, 5, 6, 7]), array([8, 9]), array([12, 13, 14, 15, 16]), array([17, 18, 19])]

You see that rows 10 and 11 are missing here because thay do not fullfill the window criteria.

In the original method, this is not a problem thanks to the use of :

mask1 = numpy.ones_like(col, dtype=bool)
mask1[numpy.concatenate(padded_sublists)] = False
mask1[maxidx] = True

So missing points (i.e. individual trigger clusters) are set to True. Problem is that this workaround is not compatible with the new implementation, where I explicitely need an array of list containing the points in the clusters, even if it is a single point cluster.

To summarise, old and new implementation are based on the same method, but as I want to associate each point in the EventTable to a given cluster, I had the need to add some code to search for missing points (that can be also at the beginning or end of the EventTable) and then to insert them in the "padded_sublists" (i.e. "clustersList" in the new implementation).

Do you have a better solution?

duncanmmacleod · 2020-11-16T15:04:34Z

gwpy/table/table.py

+                if tend[i] > cluster_Tend:
+                    cluster_Tend = tend[i]
+            # Start new cluster
+            else:
+                clustersList.append(singleClusterIdxList)
+                singleClusterIdxList = []
+                singleClusterIdxList.append(i)
+                cluster_Tend = tend[i]
+
+        # Append last cluster list of indexes
+        clustersList.append(singleClusterIdxList)
+
+        self.clustersList = clustersList
+
+        # keep track of the order indexes
+        self.orderidx = orderidx
+
+        return True
+
+    def _identify_downselect_clusters(self, index, window):
+        """ Identify clusters of this `EventTable` over a given column,
+        `index`.
+
+        The clustering algorithm uses a pooling method to identify groups
+        of points that are all separated in `index` by less than `window`.
+
+        Add the list of identified clusters to `self`.
+
+        Parameters
+        ----------
+        index : `str`
+            name of the column which is used to search for clusters.
+            If index == 'omicron': use omicron clustering algortihm
+
+        window : `float`
+            window to use when clustering data points, will raise
+            ValueError if `window > 0` is not satisfied
+
+        Returns
+        -------
+        None
+
+        """
+
+        # sort table by the grequired column
+        orderidx = numpy.argsort(self[index])
+        col = self[index][orderidx]
+
+        # Find all points where the index vector changes by less than
+        # window and divide the resulting array into clusters of
+        # adjacent points
+        clusterpoints = numpy.where(numpy.diff(col) <= window)[0]
+        sublists = numpy.split(clusterpoints,
+                               numpy.where(
+                                   numpy.diff(clusterpoints) > 1)[0]+1)
+
+        # Add end-points to each cluster and find the index of the maximum
+        # point in each list
+        clustersList = [numpy.append(s, numpy.array([s[-1]+1]))
+                        for s in sublists]
+        # Find triggers that are between two clusters
+        # Happens when successive triggers do not fill the window
+        # criteria. N triggers missing --> add N clusters
+        missing_trigger = []
+        missing_trigger_idx = []
+        for i in range(len(clustersList)-1):
+            if clustersList[i+1][0] - clustersList[i][-1] > 1:
+                missing_trigger_idx.append(i+1)
+                missing_trigger.append(numpy.arange(
+                                         clustersList[i][-1]+1,
+                                         clustersList[i+1][0]))
+        # Insert them in the cluster list
+        # Need to reverse the list, otherwise the list size update
+        # is a problem
+        for i in range(len(missing_trigger))[::-1]:
+            for j in range(len(missing_trigger[i])):
+                clustersList.insert(missing_trigger_idx[i]+j,
+                                    numpy.atleast_1d(
+                                           missing_trigger[i][j]))
+
+        # Check if there are some missing points at the beginning
+        # and end of this list
+        if clustersList[0][0] != 0:
+            missing_trigger = numpy.arange(0, clustersList[0][0])
+            for i in range(len(missing_trigger)):
+                clustersList.insert(i,
+                                    numpy.atleast_1d(missing_trigger[i]))
+        if clustersList[-1][-1] != len(self):
+            missing_trigger = numpy.arange(clustersList[-1][-1]+1,
+                                           len(self))
+            for i in range(len(missing_trigger)):
+                clustersList.insert(len(clustersList)+i,
+                                    numpy.atleast_1d(missing_trigger[i]))
+
+        self.clustersList = clustersList
+
+        # keep track of the order indexes
+        self.orderidx = orderidx
+
+        return True


It makes more sense to me to have these functions return the list of clusters, rather than storing them in another internal property, since the list is only ever used internal to the identify_clusters method.

Can you please consider that change?

duncanmmacleod · 2021-03-02T10:47:05Z

@dcorre, what's the status of this work? Can you please rebase against the upstream master branch to refresh things, and then address the comments above?

eagoetz · 2021-08-03T17:29:07Z

@dcorre, what's the status of this work? Can you please rebase against the upstream master branch to refresh things, and then address the comments above?

@dcorre A gentle nudge on this PR.

duncanmmacleod · 2022-02-03T14:45:00Z

Ping @dcorre.

duncanmmacleod · 2022-10-02T15:50:45Z

@dcorre, what is the status of this work? I'm sorry that I wasn't able to devote enough time to it when you originally proposed it.

dcorre added 3 commits January 14, 2020 19:36

Added the omicron-print clustering method to EventTable class

4935fc5

Include omicron clustering algorithm in existing EventTable.cluster()…

485b5ee

… method

Removed some empty lines

c608b8a

dcorre added 2 commits January 16, 2020 14:36

Respect pep8 format

3fc2120

Respect pep8 format

4457b4b

dcorre added 2 commits March 10, 2020 13:14

Remove return from the if else conditions

398b7fa

Merge remote-tracking branch 'upstream/master' into cluster

b87999e

dcorre added 15 commits March 24, 2020 06:05

Merge remote-tracking branch 'upstream/master' into cluster

1511a5d

Tsart and Tend of a cluster is the min and max of all the tiles now

c9de20f

added some tests for the omicron clustering method

77d8180

Merge remote-tracking branch 'upstream/master'

0934de2

respect pep8 format

976fa47

Adding pytest.fixture to test omicron clustering method

4076fa3

fixed table shape for clustertable_omicron

afd051c

Merge remote-tracking branch 'upstream/master'

575051f

respect pep8 format

f7f467e

Adding pytest.fixture to test omicron clustering method

d2b8c6a

fixed table shape for clustertable_omicron

ca692a9

Added method in EventTable to keep track of the triggers within a giv…

f4edd0f

…en cluster

fix conflicts

837e31d

Added some checks for missing trigger at the beginning and end of Eve…

a169853

…ntTable

Added a new test to compare omicron and standard method

ce2de61

duncanmmacleod added this to the 2.1.0 milestone Jul 3, 2020

Merge remote-tracking branch 'upstream/master' into cluster

716301d

duncanmmacleod requested changes Jul 8, 2020

View reviewed changes

Updated identification of clusters methods in EventTable class

2ba9afd

dcorre added 3 commits July 16, 2020 10:30

EventTable: moved test ensuring window is positive to avoid duplicates

555e821

Merge remote-tracking branch 'upstream/master' into cluster

a626a18

Test EventTable, use utils.assert_arra_equal instead of numpy.testing…

f959ee6

… stuff directly

table.py, removed useless import

56ea67e

duncanmmacleod reviewed Nov 16, 2020

View reviewed changes

duncanmmacleod modified the milestones: 2.1.0, 2.2.0 Aug 24, 2021

duncanmmacleod marked this pull request as draft February 3, 2022 14:45

duncanmmacleod modified the milestones: 2.2.0, 3.0.0 Mar 17, 2022

duncanmmacleod modified the milestones: 3.0.0, 3.1.0 Oct 2, 2022

	def clustering(self, index, window):
	def identify_clusters(self, index, window):

		if window <= 0.0:
		raise ValueError('Window must be a positive value')

Omicron clustering algorithm in EventTable.cluster() method #1201

Are you sure you want to change the base?

Omicron clustering algorithm in EventTable.cluster() method #1201

Uh oh!

Conversation

dcorre commented Jan 16, 2020

Uh oh!

pep8speaks commented Jan 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-07-16 10:50:35 UTC

Uh oh!

codecov bot commented Jan 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dcorre commented Mar 10, 2020

Uh oh!

duncanmmacleod commented Mar 23, 2020

Uh oh!

duncanmmacleod left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcorre commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcorre commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanmmacleod commented Mar 2, 2021

Uh oh!

eagoetz commented Aug 3, 2021

Uh oh!

duncanmmacleod commented Feb 3, 2022

Uh oh!

duncanmmacleod commented Oct 2, 2022

Uh oh!

Uh oh!

pep8speaks commented Jan 16, 2020 •

edited

Loading

codecov bot commented Jan 16, 2020 •

edited

Loading

dcorre commented Jul 16, 2020 •

edited

Loading

dcorre commented Jul 16, 2020 •

edited

Loading