merging DFrames
This issue is a follow up to this email on the bioc-devel mailing list.
When merging DFrame instances, the *List types are lost:
The following two instances have NumericList columns (y and z)
> d1 <- DataFrame(x = letters[1:3], y = List(1, 1:2, 1:3))
> d2 <- DataFrame(x = letters[1:3], z = List(1:3, 1:2, 1))
That are however converted to list when merged
> merge(d1, d2, by = "x")
## DataFrame with 3 rows and 3 columns
## x y z
## <character> <list> <list>
## 1 a 1 1,2,3
## 2 b 1,2 1,2
## 3 c 1,2,3 1
I would be happy to help out with some guidance. @lawremi already mentioned
There's an opportunity to implement faster matching than
base::merge(), using stuff likematchIntegerQuads(),findMatches(), andgrouping().
grouping()can be really fast for character vectors, since it takes advantage of string internalization. For example, let's say you're merging on three character vector keys. Concatenate the keys of 'y' onto they keys of 'x'. Then callgrouping(k1, k2, k3)and you effectively have a matching. Should be way faster than thepaste()approach used bybase::merge(). Would be interesting to see.
I'll have a look at these functions and report back here if I have any questions or lead.
I am still miles away from any speed considerations, but have already some question - the code for this is available here:
Simple case, by are vectors - WORKS
> d1 <- DataFrame(a = 1:3, x = letters[1:3], y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = letters[1:3], z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2)
DataFrame with 3 rows and 4 columns
a x y z
<integer> <character> <NumericList> <NumericList>
1 1 a 1 1,2,3
2 2 b 1,2 1,2
3 3 c 1,2,3 1
> S4Vectors::mergeDFrame(d1, d2, by = "a", suffixes = c("_x", "_y"))
DataFrame with 3 rows and 5 columns
a x_x y x_y z
<integer> <character> <NumericList> <character> <NumericList>
1 1 a 1 a 1,2,3
2 2 b 1,2 b 1,2
3 3 c 1,2,3 c 1
Simple case: by vector and Lists of same type - WORKS
> d1 <- DataFrame(a = 1:3, x = List(letters[1:3]), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(letters[1:3]), z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2)
DataFrame with 3 rows and 4 columns
a x y z
<integer> <CharacterList> <NumericList> <NumericList>
1 1 a,b,c 1 1,2,3
2 2 a,b,c 1,2 1,2
3 3 a,b,c 1,2,3 1
> S4Vectors::mergeDFrame(d2, d1)
DataFrame with 3 rows and 4 columns
a x z y
<integer> <CharacterList> <NumericList> <NumericList>
1 1 a,b,c 1,2,3 1
2 2 a,b,c 1,2 1,2
3 3 a,b,c 1 1,2,3
> S4Vectors::mergeDFrame(d1, d2, by = "a", suffixes = c("_x", "_y"))
DataFrame with 3 rows and 5 columns
a x_x y x_y z
<integer> <CharacterList> <NumericList> <CharacterList> <NumericList>
1 1 a,b,c 1 a,b,c 1,2,3
2 2 a,b,c 1,2 a,b,c 1,2
3 3 a,b,c 1,2,3 a,b,c 1
Lists and lists - this should probably fail!!
> d1 <- DataFrame(a = 1:3, x = List(letters[1:3]), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(letters[1:3]), z = List(1:3, 1:2, 1))
> d2$x <- rep(list(letters[1:3]), 3)
> S4Vectors::mergeDFrame(d1, d2) ## x CharacterList
DataFrame with 3 rows and 4 columns
a x y z
<integer> <CharacterList> <NumericList> <NumericList>
1 1 a,b,c 1 1,2,3
2 2 a,b,c 1,2 1,2
3 3 a,b,c 1,2,3 1
> S4Vectors::mergeDFrame(d2, d1) ## x is list
DataFrame with 3 rows and 4 columns
a x z y
<integer> <list> <NumericList> <NumericList>
1 1 a,b,c 1,2,3 1
2 2 a,b,c 1,2 1,2
3 3 a,b,c 1 1,2,3
Lists of different types - this should fail too!!
> d1 <- DataFrame(a = 1:3, x = List(as.character(1:3)), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(1:3), z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2) ## x is CharacterList
DataFrame with 3 rows and 4 columns
a x y z
<integer> <CharacterList> <NumericList> <NumericList>
1 1 1,2,3 1 1,2,3
2 2 1,2,3 1,2 1,2
3 3 1,2,3 1,2,3 1
> S4Vectors::mergeDFrame(d2, d1) ## x is IntegerList
DataFrame with 3 rows and 4 columns
a x z y
<integer> <IntegerList> <NumericList> <NumericList>
1 1 1,2,3 1,2,3 1
2 2 1,2,3 1,2 1,2
3 3 1,2,3 1 1,2,3
From this I conclude that the by should not only rely of names, but also make sure that these columns with matching names are of the same class.
Update: this actually also happens for vectors of different modes in data.frames:
> d1 <- data.frame(a = 1:3, x = letters[1:3])
> d2 <- data.frame(a = as.character(1:3), y = letters[1:3])
> str(merge(d1, d2))
'data.frame': 3 obs. of 3 variables:
$ a: int 1 2 3
$ x: chr "a" "b" "c"
$ y: chr "a" "b" "c"
> str(merge(d2, d1))
'data.frame': 3 obs. of 3 variables:
$ a: chr "1" "2" "3"
$ y: chr "a" "b" "c"
$ x: chr "a" "b" "c"
I would prefer a behaviour like this one, where an error is thrown of the columns used for merging are of different classes:
> d1 <- DataFrame(a = 1:3, x = List(as.character(1:3)), y = List(1, 1:2, 1:3))
> d2 <- DataFrame(a = 1:3, x = List(1:3), z = List(1:3, 1:2, 1))
> S4Vectors::mergeDFrame(d1, d2)
Error in S4Vectors::mergeDFrame(d1, d2) :
The marching columns in 'x' and 'y' used for merging must be of identical classes.
@lawremi @hpages - I would like to get some feedback on the current situation and future steps:
- The current
mergeDFrame()implementation is essentially the defaultmerge,data.frame,data.framemethod adapted to work with list and vector columns plus the check on identicalbycolumn classes (see comment above). - This has the advantage that the behaviour fits quite well to
merge,data.frame,data.frame, at the cost of efficiency (see @lawremi's comments at the top). It seems at least a sensible starting point. - There is a call to
.Internal(merge(...)), which is only for true R wizards (the package checker warns me that I don't qualify).
@hpages any thoughts on this?