pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

e_forward() not returning results without IDs named 'src' and 'dest'

Open DataBoyTX opened this issue 1 year ago • 7 comments

Describe the bug

Customer reported bug with GFQL's e_forward() not returning the correct results unless the edge df IDs are named 'src' and 'dest'

To Reproduce

import graphistry
import pandas as pd
from graphistry import (
    # graph operators
    n, e_undirected, e_forward, e_reverse, e,
    # attribute predicates
    is_in, ge, startswith, contains, match as match_re
)
import pandas as pd

graphistry.register(...)

edges_df = pd.read_csv('blueprint_edges.csv')
nodes_df = pd.read_csv('blueprint_nodes.csv')

edges_df['src'] = edges_df['parent_elid']
edges_df['dest'] = edges_df['elid']

# create two different graphs with the same cols, just named differently: 
g = graphistry.edges(edges_df, 'parent_elid', 'elid').nodes(nodes_df, 'elid') 
g2 = graphistry.edges(edges_df, 'src', 'dest').nodes(nodes_df, 'elid') 

# Test: 
# expected behavior: be able to use e_forward to get edges
# observed behavior: get no edges back unless using e() or e_undirected()
def test_eforward(g):
    guc = g.chain([
        n({"elid":"905e3174aa"}),
        e_forward(),
        n()
    ])

    print('nodes:\n', guc._nodes, '\n' )
    print('edges:\n', guc._edges)


test_eforward(g)
test_eforward(g2)

# notice that the first call with 'g' does not return any edges, where the call with 'g2' does 

blueprint_nodes.zip

Expected behavior

be able to use e_forward to get edges

Actual behavior

get no edges back unless using e() or e_undirected(), or change the IDs to src and dest

Screenshot showing differences of counts with same column contents for IDs, but different names:

image

Graphistry GPU server environment

Hub v2.41.10

PyGraphistry API client environment

  • Where run Jupyter Lab local
  • Version 0.34.17
  • Python Version Python 3.8.5

DataBoyTX avatar Nov 20 '24 18:11 DataBoyTX

@DataBoyTX any way to know if they have additional surprisingly named columns? we don't assume hard-coded names, so the issue is generally excess results, not missing...

lmeyerov avatar Nov 20 '24 18:11 lmeyerov

@lmeyerov - I am not sure what you mean by surprisingly named columns, the column names that don't work are: parent_elid for src ID and elid for dest ID, that returns 0 results, but using src and dest IDs named src and dest does work.

Also curious what you mean by we don't assume hard-coded names?

DataBoyTX avatar Nov 26 '24 16:11 DataBoyTX

elid is not a variable name in this repo so I don't know why it would collide, so I'm wondering if their data has more columns , and the bug report does not provide the full schema

lmeyerov avatar Nov 26 '24 16:11 lmeyerov

@lmeyerov - sorry I missed this comment. I added the link to the dataset above a few weeks ago, there's elid in both the files, and the source code (renamed to src/dest for the edges to show that it works), but it's definitely there.

data file (also in description above): link

$  ls -lh
total 548K
-rwxrwxrwx 1 root root 118K Dec 10 00:58 blueprint_edges.csv
-rwxrwxrwx 1 root root 426K Dec 10 00:58 blueprint_nodes.csv

$  head -n 1 *
==> blueprint_edges.csv <==
elid,parent_elid,depth,bottom_flag,top_flag

==> blueprint_nodes.csv <==
elid,ui_role,bottom_flag

DataBoyTX avatar Dec 10 '24 07:12 DataBoyTX

This is fascinating, I think the actual issue is when the node id col matches one of the edge src/dst col names:

g = graphistry.nodes(df1, n).edges(df2, e1, e2)
assert g._node != g._source
assert g._node != g._destination

This is surprising behavior, so digging in

lmeyerov avatar Dec 24 '24 06:12 lmeyerov

Yep that's it:

=================================== FAILURES ===================================
____________________________ test_hop_binding_reuse ____________________________

    def test_hop_binding_reuse():
        edges_df = pd.DataFrame({'s': ['a', 'b'], 'd': ['b', 'c']})
        nodes1_df = pd.DataFrame({'v': ['a', 'b', 'c']})
        nodes2_df = pd.DataFrame({'s': ['a', 'b', 'c']})
        nodes3_df = pd.DataFrame({'d': ['a', 'b', 'c']})
    
        g1 = CGFull().nodes(nodes1_df, 'v').edges(edges_df, 's', 'd')
        g2 = CGFull().nodes(nodes2_df, 's').edges(edges_df, 's', 'd')
        g3 = CGFull().nodes(nodes3_df, 'd').edges(edges_df, 's', 'd')
    
        g1_hop = g1.hop()
        g2_hop = g2.hop()
        g3_hop = g3.hop()
    
        assert g1_hop._nodes.shape == g2_hop._nodes.shape
        assert g1_hop._edges.shape == g2_hop._edges.shape
>       assert g1_hop._nodes.shape == g3_hop._nodes.shape
E       assert (3, 1) == (2, 1)
E         
E         At index 0 diff: 3 != 2
E         
E         Full diff:
E           (
E         -     2,
E         ?     ^
E         +     3,
E         ?     ^
E               1,
E           )

lmeyerov avatar Dec 24 '24 07:12 lmeyerov

This can likely be solved through renaming src/dst col names during .hop() (see test in https://github.com/graphistry/pygraphistry/pull/623) . It's a bit tricky because predicates may use the original column name. I gave it a go and was a bit trickier than hoped, so backed out and switched to an interim workaround of detecting the issue and raising an informative exception.

Longer-term, we should be using symbol tables for names to avoid this kind of issue.

lmeyerov avatar Dec 24 '24 09:12 lmeyerov