S4Vectors icon indicating copy to clipboard operation
S4Vectors copied to clipboard

merging DFrames

Open lgatto opened this issue 5 years ago • 4 comments

This issue is a follow up to this email on the bioc-devel mailing list.

When merging DFrame instances, the *List types are lost:

The following two instances have NumericList columns (y and z)

> d1 <- DataFrame(x = letters[1:3], y = List(1, 1:2, 1:3))
> d2 <- DataFrame(x = letters[1:3], z = List(1:3, 1:2, 1))

That are however converted to list when merged

> merge(d1, d2, by = "x")
## DataFrame with 3 rows and 3 columns
##             x      y      z
##   <character> <list> <list>
## 1           a      1  1,2,3
## 2           b    1,2    1,2
## 3           c  1,2,3      1

I would be happy to help out with some guidance. @lawremi already mentioned

There's an opportunity to implement faster matching than base::merge(), using stuff like matchIntegerQuads(), findMatches(), and grouping().

grouping() can be really fast for character vectors, since it takes advantage of string internalization. For example, let's say you're merging on three character vector keys. Concatenate the keys of 'y' onto they keys of 'x'. Then call grouping(k1, k2, k3) and you effectively have a matching. Should be way faster than the paste() approach used by base::merge(). Would be interesting to see.

I'll have a look at these functions and report back here if I have any questions or lead.

lgatto avatar Oct 21 '20 18:10 lgatto

I am still miles away from any speed considerations, but have already some question - the code for this is available here:

Simple case, by are vectors - WORKS

> d1 <- DataFrame(a = 1:3, x = letters[1:3], y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = letters[1:3], z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2)
DataFrame with 3 rows and 4 columns
          a           x             y             z
  <integer> <character> <NumericList> <NumericList>
1         1           a             1         1,2,3
2         2           b           1,2           1,2
3         3           c         1,2,3             1
> S4Vectors::mergeDFrame(d1, d2, by = "a", suffixes = c("_x", "_y"))
DataFrame with 3 rows and 5 columns
          a         x_x             y         x_y             z
  <integer> <character> <NumericList> <character> <NumericList>
1         1           a             1           a         1,2,3
2         2           b           1,2           b           1,2
3         3           c         1,2,3           c             1

Simple case: by vector and Lists of same type - WORKS

> d1 <- DataFrame(a = 1:3, x = List(letters[1:3]), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(letters[1:3]), z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2)
DataFrame with 3 rows and 4 columns
          a               x             y             z
  <integer> <CharacterList> <NumericList> <NumericList>
1         1           a,b,c             1         1,2,3
2         2           a,b,c           1,2           1,2
3         3           a,b,c         1,2,3             1
> S4Vectors::mergeDFrame(d2, d1)
DataFrame with 3 rows and 4 columns
          a               x             z             y
  <integer> <CharacterList> <NumericList> <NumericList>
1         1           a,b,c         1,2,3             1
2         2           a,b,c           1,2           1,2
3         3           a,b,c             1         1,2,3
> S4Vectors::mergeDFrame(d1, d2, by = "a", suffixes = c("_x", "_y"))
DataFrame with 3 rows and 5 columns
          a             x_x             y             x_y             z
  <integer> <CharacterList> <NumericList> <CharacterList> <NumericList>
1         1           a,b,c             1           a,b,c         1,2,3
2         2           a,b,c           1,2           a,b,c           1,2
3         3           a,b,c         1,2,3           a,b,c             1

Lists and lists - this should probably fail!!

> d1 <- DataFrame(a = 1:3, x = List(letters[1:3]), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(letters[1:3]), z = List(1:3, 1:2, 1))
> d2$x <- rep(list(letters[1:3]), 3)
> S4Vectors::mergeDFrame(d1, d2) ## x CharacterList
DataFrame with 3 rows and 4 columns
          a               x             y             z
  <integer> <CharacterList> <NumericList> <NumericList>
1         1           a,b,c             1         1,2,3
2         2           a,b,c           1,2           1,2
3         3           a,b,c         1,2,3             1
> S4Vectors::mergeDFrame(d2, d1) ## x is list
DataFrame with 3 rows and 4 columns
          a      x             z             y
  <integer> <list> <NumericList> <NumericList>
1         1  a,b,c         1,2,3             1
2         2  a,b,c           1,2           1,2
3         3  a,b,c             1         1,2,3

Lists of different types - this should fail too!!

> d1 <- DataFrame(a = 1:3, x = List(as.character(1:3)), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(1:3), z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2) ## x is CharacterList
DataFrame with 3 rows and 4 columns
          a               x             y             z
  <integer> <CharacterList> <NumericList> <NumericList>
1         1           1,2,3             1         1,2,3
2         2           1,2,3           1,2           1,2
3         3           1,2,3         1,2,3             1
> S4Vectors::mergeDFrame(d2, d1) ## x is IntegerList
DataFrame with 3 rows and 4 columns
          a             x             z             y
  <integer> <IntegerList> <NumericList> <NumericList>
1         1         1,2,3         1,2,3             1
2         2         1,2,3           1,2           1,2
3         3         1,2,3             1         1,2,3

From this I conclude that the by should not only rely of names, but also make sure that these columns with matching names are of the same class.

Update: this actually also happens for vectors of different modes in data.frames:

> d1 <- data.frame(a = 1:3, x = letters[1:3])
> d2 <- data.frame(a = as.character(1:3), y = letters[1:3])
> str(merge(d1, d2))
'data.frame':	3 obs. of  3 variables:
 $ a: int  1 2 3
 $ x: chr  "a" "b" "c"
 $ y: chr  "a" "b" "c"
> str(merge(d2, d1))
'data.frame':	3 obs. of  3 variables:
 $ a: chr  "1" "2" "3"
 $ y: chr  "a" "b" "c"
 $ x: chr  "a" "b" "c"

lgatto avatar Oct 21 '20 21:10 lgatto

I would prefer a behaviour like this one, where an error is thrown of the columns used for merging are of different classes:

> d1 <- DataFrame(a = 1:3, x = List(as.character(1:3)), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(1:3), z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2)
Error in S4Vectors::mergeDFrame(d1, d2) : 
  The marching columns in 'x' and 'y' used for merging must be of identical classes.

lgatto avatar Oct 21 '20 22:10 lgatto

@lawremi @hpages - I would like to get some feedback on the current situation and future steps:

  • The current mergeDFrame() implementation is essentially the default merge,data.frame,data.frame method adapted to work with list and vector columns plus the check on identical by column classes (see comment above).
  • This has the advantage that the behaviour fits quite well to merge,data.frame,data.frame, at the cost of efficiency (see @lawremi's comments at the top). It seems at least a sensible starting point.
  • There is a call to .Internal(merge(...)), which is only for true R wizards (the package checker warns me that I don't qualify).

lgatto avatar Oct 29 '20 14:10 lgatto

@hpages any thoughts on this?

vjcitn avatar Oct 03 '21 10:10 vjcitn