e_forward() not returning results without IDs named 'src' and 'dest'
Describe the bug
Customer reported bug with GFQL's e_forward() not returning the correct results unless the edge df IDs are named 'src' and 'dest'
To Reproduce
import graphistry
import pandas as pd
from graphistry import (
# graph operators
n, e_undirected, e_forward, e_reverse, e,
# attribute predicates
is_in, ge, startswith, contains, match as match_re
)
import pandas as pd
graphistry.register(...)
edges_df = pd.read_csv('blueprint_edges.csv')
nodes_df = pd.read_csv('blueprint_nodes.csv')
edges_df['src'] = edges_df['parent_elid']
edges_df['dest'] = edges_df['elid']
# create two different graphs with the same cols, just named differently:
g = graphistry.edges(edges_df, 'parent_elid', 'elid').nodes(nodes_df, 'elid')
g2 = graphistry.edges(edges_df, 'src', 'dest').nodes(nodes_df, 'elid')
# Test:
# expected behavior: be able to use e_forward to get edges
# observed behavior: get no edges back unless using e() or e_undirected()
def test_eforward(g):
guc = g.chain([
n({"elid":"905e3174aa"}),
e_forward(),
n()
])
print('nodes:\n', guc._nodes, '\n' )
print('edges:\n', guc._edges)
test_eforward(g)
test_eforward(g2)
# notice that the first call with 'g' does not return any edges, where the call with 'g2' does
Expected behavior
be able to use e_forward to get edges
Actual behavior
get no edges back unless using e() or e_undirected(), or change the IDs to src and dest
Screenshot showing differences of counts with same column contents for IDs, but different names:
Graphistry GPU server environment
Hub v2.41.10
PyGraphistry API client environment
- Where run
Jupyter Lab local - Version
0.34.17 - Python Version
Python 3.8.5
@DataBoyTX any way to know if they have additional surprisingly named columns? we don't assume hard-coded names, so the issue is generally excess results, not missing...
@lmeyerov - I am not sure what you mean by surprisingly named columns, the column names that don't work are: parent_elid for src ID and elid for dest ID, that returns 0 results, but using src and dest IDs named src and dest does work.
Also curious what you mean by we don't assume hard-coded names?
elid is not a variable name in this repo so I don't know why it would collide, so I'm wondering if their data has more columns , and the bug report does not provide the full schema
@lmeyerov - sorry I missed this comment. I added the link to the dataset above a few weeks ago, there's elid in both the files, and the source code (renamed to src/dest for the edges to show that it works), but it's definitely there.
data file (also in description above): link
$ ls -lh
total 548K
-rwxrwxrwx 1 root root 118K Dec 10 00:58 blueprint_edges.csv
-rwxrwxrwx 1 root root 426K Dec 10 00:58 blueprint_nodes.csv
$ head -n 1 *
==> blueprint_edges.csv <==
elid,parent_elid,depth,bottom_flag,top_flag
==> blueprint_nodes.csv <==
elid,ui_role,bottom_flag
This is fascinating, I think the actual issue is when the node id col matches one of the edge src/dst col names:
g = graphistry.nodes(df1, n).edges(df2, e1, e2)
assert g._node != g._source
assert g._node != g._destination
This is surprising behavior, so digging in
Yep that's it:
=================================== FAILURES ===================================
____________________________ test_hop_binding_reuse ____________________________
def test_hop_binding_reuse():
edges_df = pd.DataFrame({'s': ['a', 'b'], 'd': ['b', 'c']})
nodes1_df = pd.DataFrame({'v': ['a', 'b', 'c']})
nodes2_df = pd.DataFrame({'s': ['a', 'b', 'c']})
nodes3_df = pd.DataFrame({'d': ['a', 'b', 'c']})
g1 = CGFull().nodes(nodes1_df, 'v').edges(edges_df, 's', 'd')
g2 = CGFull().nodes(nodes2_df, 's').edges(edges_df, 's', 'd')
g3 = CGFull().nodes(nodes3_df, 'd').edges(edges_df, 's', 'd')
g1_hop = g1.hop()
g2_hop = g2.hop()
g3_hop = g3.hop()
assert g1_hop._nodes.shape == g2_hop._nodes.shape
assert g1_hop._edges.shape == g2_hop._edges.shape
> assert g1_hop._nodes.shape == g3_hop._nodes.shape
E assert (3, 1) == (2, 1)
E
E At index 0 diff: 3 != 2
E
E Full diff:
E (
E - 2,
E ? ^
E + 3,
E ? ^
E 1,
E )
This can likely be solved through renaming src/dst col names during .hop() (see test in https://github.com/graphistry/pygraphistry/pull/623) . It's a bit tricky because predicates may use the original column name. I gave it a go and was a bit trickier than hoped, so backed out and switched to an interim workaround of detecting the issue and raising an informative exception.
Longer-term, we should be using symbol tables for names to avoid this kind of issue.