Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum edges merge strategy gives surprising results #2038

Closed
kortschak opened this issue Oct 12, 2018 · 9 comments
Closed

Sum edges merge strategy gives surprising results #2038

kortschak opened this issue Oct 12, 2018 · 9 comments

Comments

@kortschak
Copy link

Expected Behavior

The "Sum" merge strategy would give a merged edge that has an edge weight that is the sum of the edge weights that are merged and has the union of the attributes of the merged edges.

Current Behavior

Edge weights are not summed and edge attributes are serially overwritten.

Possible Solution

Given that the last attvalue is reflected in the edge, it appears that the container for edge attributes is not being extended on read.

Steps to Reproduce

  1. Load a graph from the following GEFX
<?xml version="1.0" encoding="UTF-8"?>

<gexf xmlns="http://www.gexf.net/1.2draft" version="1.2">
	<graph timeformat="dateTime" defaultedgetype="undirected" mode="dynamic">
		<attributes class="edge" mode="dynamic">
			<attribute id="mid" title="message-ID" type="string"></attribute>
		</attributes>
		<nodes count="2">
			<node id="5527" label="[email protected]"></node>
			<node id="5554" label="[email protected]"></node>
		</nodes>
		<edges count="5">
			<edge id="561202" start="2007-05-30T16:09:03" end="2007-05-30T16:09:03" source="5554" target="5527">
				<attvalues>
					<attvalue for="mid" value="&lt;F78F4A7A&gt;" start="2007-05-30T16:09:03" end="2007-05-30T16:09:03"></attvalue>
				</attvalues>
			</edge>
			<edge id="595403" start="2007-06-22T16:28:37" end="2007-06-22T16:28:37" source="5554" target="5527">
				<attvalues>
					<attvalue for="mid" value="&lt;F3F36A76&gt;" start="2007-06-22T16:28:37" end="2007-06-22T16:28:37"></attvalue>
				</attvalues>
			</edge>
			<edge id="557989" start="2007-05-05T15:14:13" end="2007-05-05T15:14:13" source="5554" target="5527">
				<attvalues>
					<attvalue for="mid" value="&lt;243A4AA0&gt;" start="2007-05-05T15:14:13" end="2007-05-05T15:14:13"></attvalue>
				</attvalues>
			</edge>
			<edge id="559298" start="2007-05-18T15:28:54" end="2007-05-18T15:28:54" source="5554" target="5527">
				<attvalues>
					<attvalue for="mid" value="&lt;98DDF62C&gt;" start="2007-05-18T15:28:54" end="2007-05-18T15:28:54"></attvalue>
				</attvalues>
			</edge>
			<edge id="560765" start="2007-05-27T10:58:26" end="2007-05-27T10:58:26" source="5554" target="5527">
				<attvalues>
					<attvalue for="mid" value="&lt;3626B2EC&gt;" start="2007-05-27T10:58:26" end="2007-05-27T10:58:26"></attvalue>
				</attvalues>
			</edge>
		</edges>
	</graph>
</gexf>
  1. Choose "Sum" edge merge strategy.
  2. Look at edges in data laboratory.
  3. Observe an edge weight of 1.0 rather than 5.0, and only a single attvalue (the last in the GEXF document) rather than 5.

Context

I have an mbox contacts construction program here that writes out a GEXF formatted email contact graph with message-ID and date information annotated onto the edges. The edge weight is used to perform community detection and Page Rank on the nodes, but the edge weights here are underestimated and the message-IDs are lost.

Your Environment

>System Info: 
  Product Version         = Gephi 0.9.2 201709241107
  Operating System        = Linux version 4.4.0-137-generic running on amd64
  Java; VM; Vendor        = 1.8.0_181; OpenJDK 64-Bit Server VM 25.181-b13; Oracle Corporation
  Runtime                 = OpenJDK Runtime Environment 1.8.0_181-8u181-b13-0ubuntu0.16.04.1-b13
  Java Home               = /usr/lib/jvm/java-8-openjdk-amd64/jre
  System Locale; Encoding = en_AU (gephi); UTF-8
@eduramiba
Copy link
Member

Hi,
I think these edges are not merged because they are not equivalent: they exist in different time periods.
And their intervals are being correctly merged, as you can see in the resulting edge, it has all the intervals.

If the edges don't have an interval, their weights will be summed, if there are intervals/timestamps of existence, then the intervals/timestamps will be merged instead.

@kortschak
Copy link
Author

There are multiple possible semantics here. The implementation that exists doesn't cover the two that are arguably the most reasonable.

  1. The edges are not equivalent and and not merged meaning that they should be normalised over the actual complete interval, resulting in sub-unit weights for all the edges. This is what is almost what is approximated except that there is no weight normalisation.
  2. The edges are equivalent and should be summed (possibly weighted).

Given that there are multiple semantics, the behaviour should be properly documented and probably there should be alternatives to how the edge merge is performed.

By way of background. Under the analysis being performed the edges are indeed equivalent, representing multiple interactions between the addresses in the mail history.

Note that the bug also includes serial overwrites of existing data; if the edges are not equivalent this should not be happening. It is.

@kortschak
Copy link
Author

Please remove the Not an issue label, it is incorrect.

@eduramiba
Copy link
Member

We will need to review the behavior in each case, but I guess that when merged, weight values can only be summed if the intervals overlap/are the same.

The overwriting of non-dynamic values seems unavoidable given that edges are chosen to be merged. You can still choose not to merge them and will have the original edges.

@kortschak
Copy link
Author

At the moment, data is being lost, but not in all cases. It looks to me very much like the intended behaviour was what I originally expected. The intervals are merged when the edge merge is done — each individual spell is listed in the spells. The attributes are not, despite the fact that the spec describes attributes as unbounded list items.

Not merging the edges is not an option. The graphs I am working with have on the order of half a million edges and I am finding that Gephi is unable to do layout on these effectively unless there is edge flattening. If I performing the edge flattening in the code that is actually constructing the graph, Gephi complains that attributes are overlapping in time (despite being attached to different edges).

@kortschak
Copy link
Author

To illustrate, the following is the GEXF export of the summed data from the fille above:

<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.3" version="1.3" xmlns:viz="http://www.gexf.net/1.3/viz" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gexf.net/1.3 http://www.gexf.net/1.3/gexf.xsd">
  <meta lastmodifieddate="2018-10-27">
    <creator>Gephi 0.9</creator>
    <description></description>
  </meta>
  <graph defaultedgetype="undirected" timeformat="datetime" timerepresentation="interval" mode="dynamic">
    <attributes class="edge" mode="dynamic">
      <attribute id="mid" title="message-ID" type="string"></attribute>
    </attributes>
    <nodes>
      <node id="5527" label="[email protected]">
        <viz:size value="10.0"></viz:size>
        <viz:position x="-27.991882" y="-74.60654"></viz:position>
      </node>
      <node id="5554" label="[email protected]">
        <viz:size value="10.0"></viz:size>
        <viz:position x="27.991882" y="74.60654"></viz:position>
      </node>
    </nodes>
    <edges>
      <edge id="561202" source="5554" target="5527">
        <spells>
          <spell start="2007-05-06T00:44:13.000+09:30" end="2007-05-06T00:44:13.000+09:30"></spell>
          <spell start="2007-05-19T00:58:54.000+09:30" end="2007-05-19T00:58:54.000+09:30"></spell>
          <spell start="2007-05-27T20:28:26.000+09:30" end="2007-05-27T20:28:26.000+09:30"></spell>
          <spell start="2007-05-31T01:39:03.000+09:30" end="2007-05-31T01:39:03.000+09:30"></spell>
          <spell start="2007-06-23T01:58:37.000+09:30" end="2007-06-23T01:58:37.000+09:30"></spell>
        </spells>
        <attvalues>
          <attvalue for="mid" value="&lt;3626B2EC&gt;" start="2007-05-27T20:28:26.000+09:30" end="2007-05-27T20:28:26.000+09:30"></attvalue>
        </attvalues>
      </edge>
    </edges>
  </graph>
</gexf>

It would be perfectly reasonable to expect it to end up looking like this

<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.3" version="1.3" xmlns:viz="http://www.gexf.net/1.3/viz" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gexf.net/1.3 http://www.gexf.net/1.3/gexf.xsd">
  <meta lastmodifieddate="2018-10-27">
    <creator>Gephi 0.9</creator>
    <description></description>
  </meta>
  <graph defaultedgetype="undirected" timeformat="datetime" timerepresentation="interval" mode="dynamic">
    <attributes class="edge" mode="dynamic">
      <attribute id="mid" title="message-ID" type="string"></attribute>
    </attributes>
    <nodes>
      <node id="5527" label="[email protected]">
        <viz:size value="10.0"></viz:size>
        <viz:position x="-27.991882" y="-74.60654"></viz:position>
      </node>
      <node id="5554" label="[email protected]">
        <viz:size value="10.0"></viz:size>
        <viz:position x="27.991882" y="74.60654"></viz:position>
      </node>
    </nodes>
    <edges>
      <edge id="561202" source="5554" target="5527">
        <spells>
          <spell start="2007-05-06T00:44:13.000+09:30" end="2007-05-06T00:44:13.000+09:30"></spell>
          <spell start="2007-05-19T00:58:54.000+09:30" end="2007-05-19T00:58:54.000+09:30"></spell>
          <spell start="2007-05-27T20:28:26.000+09:30" end="2007-05-27T20:28:26.000+09:30"></spell>
          <spell start="2007-05-31T01:39:03.000+09:30" end="2007-05-31T01:39:03.000+09:30"></spell>
          <spell start="2007-06-23T01:58:37.000+09:30" end="2007-06-23T01:58:37.000+09:30"></spell>
        </spells>
        <attvalues>
          <attvalue for="mid" value="&lt;F78F4A7A&gt;" start="2007-05-30T16:09:03" end="2007-05-30T16:09:03"></attvalue>
          <attvalue for="mid" value="&lt;F3F36A76&gt;" start="2007-06-22T16:28:37" end="2007-06-22T16:28:37"></attvalue>
          <attvalue for="mid" value="&lt;243A4AA0&gt;" start="2007-05-05T15:14:13" end="2007-05-05T15:14:13"></attvalue>
          <attvalue for="mid" value="&lt;98DDF62C&gt;" start="2007-05-18T15:28:54" end="2007-05-18T15:28:54"></attvalue>
          <attvalue for="mid" value="&lt;3626B2EC&gt;" start="2007-05-27T10:58:26" end="2007-05-27T10:58:26"></attvalue>
        </attvalues>
      </edge>
    </edges>
  </graph>
</gexf>

@mbastian mbastian added this to the 0.9.3 milestone Jan 5, 2022
@mbastian
Copy link
Member

mbastian commented Mar 7, 2022

Hi, thanks for the detailed case. I've reviewed it and my conclusion is that the weight should indeed be merged. For static weights, we generally define edge existence as a triplet (source, target, type). Given that your GEXF doesn't specify the weight column, it's implied that weight is static (default value) and therefore it should be merged.

With this change, we would still give two options for the users not wanting to merge

  1. Select "Don't merge" as merge strategy
  2. Define the edge weight column as dynamic

I hope I'm getting this right. let me know if I'm missing something.

Regarding the attributes not being merged correctly, this was a separate issue. Resolved by now.

@mbastian mbastian assigned mbastian and unassigned eduramiba Mar 7, 2022
@kortschak
Copy link
Author

Yeah, it's been three years since I was thinking about this. That's probably right.

mbastian added a commit that referenced this issue Mar 7, 2022
@mbastian
Copy link
Member

mbastian commented Mar 7, 2022

Thanks for your patience! The fix will be included in the upcoming 0.9.3 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants