systemds icon indicating copy to clipboard operation
systemds copied to clipboard

[SYSTEMDS-3444] Spark Write Compressed

Open Baunsgaard opened this issue 3 years ago • 2 comments

Recent commit did not address the spark writing completely therefore this PR follows up and adds the writing of a RDD to disk.

Baunsgaard avatar Oct 12 '22 21:10 Baunsgaard

Preview Compressed format for binary matrices, comparing text binary and compressed.: image

Baunsgaard avatar Oct 17 '22 11:10 Baunsgaard

Reading -> plus -> sum of ... empty ... 256k x 2k matrix.

Binary:

1.536E9
SystemDS Statistics:
Total elapsed time:		5.827 sec.
Total compilation time:		0.383 sec.
Total execution time:		5.444 sec.
Cache hits (Mem/Li/WB/FS/HDFS):	1/0/0/0/1.
Cache writes (Li/WB/FS/HDFS):	0/0/1/0.
Cache times (ACQr/m, RLS, EXP):	0.076/0.000/3.880/0.000 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		0.253 sec.
Total JVM GC count:		0.
Total JVM GC time:		0.0 sec.
Heavy hitter instructions:
 #  Instruction  Time(s)  Count
 1  +              5.190      1
 2  uak+           0.246      1
 3  createvar      0.006      2
 4  print          0.001      1
 5  rmvar          0.001      2

Compressed:

1.536E9
SystemDS Statistics:
Total elapsed time:		0.437 sec.
Total compilation time:		0.376 sec.
Total execution time:		0.060 sec.
Cache hits (Mem/Li/WB/FS/HDFS):	1/0/0/0/1.
Cache writes (Li/WB/FS/HDFS):	0/0/0/0.
Cache times (ACQr/m, RLS, EXP):	0.045/0.000/0.000/0.000 sec.
HOP DAGs recompiled (PRED, SB):	0/0.
HOP DAGs recompile time:	0.000 sec.
Total JIT compile time:		0.132 sec.
Total JVM GC count:		0.
Total JVM GC time:		0.0 sec.
Heavy hitter instructions:
 #  Instruction  Time(s)  Count
 1  +              0.050      1
 2  createvar      0.006      2
 3  uak+           0.003      1
 4  print          0.001      1
 5  rmvar          0.000      2

Baunsgaard avatar Oct 17 '22 11:10 Baunsgaard

Initial Results: Writing a Matrix Block of 64k x 2k.

Disk and Memory Size Screenshot from 2022-10-18 18-27-38 M = in Memory size D = on Disk size

Read and Write end-to-end times:

Screenshot from 2022-10-18 18-27-58 script1 - write:

m = rand(rows= $1, cols=$2, min=$3, max=$4, sparsity=$5)
write(m, $7, format=$6)

script2 - read:

m = read($1)
m  = m + 3
print(sum(m))

Binary -> reading and writing using our binary format Comp -> Compressed writing and reading using new compressed format.

The times are end to end, so the read script include time to do addition and sum.

Baunsgaard avatar Oct 18 '22 16:10 Baunsgaard

Optimization 1. Do not use Hadoop compression but only internal serialization: Improved performance Read: 1.6 -> 0.939 sec 1k blocks and 0.9 to 0.6 with 128k blocks. Write: 2 -> 1.4 sec

image

Baunsgaard avatar Oct 19 '22 14:10 Baunsgaard

Optimization 2: n way combining.

This optimization makes smaller blocks viable, since the combination of the column groups and allocation of all sub combinations was done before this. Note the execution did in general improve and reduce the time of execution, but it is not obvious in the plot since 1. clock speed was low since the experiments were run multiple times on laptop 2. the overhead of other things in execution. One such example is compilation of 300 ms.

more detailed breakdown (Binary matrix 0.5 sparsity 64k x 2k):

Compressed:
blockSize (k),  Total Read       , Read Time        , Plus Time        , Sum Time
          0.5,       1.013+-0.029,      0.386+-0.012,      0.403+-0.012,      0.037+-0.005
          1.0,       0.884+-0.030,      0.273+-0.014,      0.291+-0.015,      0.034+-0.005
          2.0,       0.816+-0.025,      0.210+-0.008,      0.227+-0.008,      0.036+-0.005
          4.0,       0.787+-0.031,      0.176+-0.006,      0.196+-0.011,      0.036+-0.006
          8.0,       0.754+-0.033,      0.153+-0.008,      0.170+-0.008,      0.031+-0.004
         16.0,              0.611,             0.121,             0.133,             0.026
         32.0,       0.749+-0.033,      0.141+-0.007,      0.158+-0.008,      0.034+-0.005
         64.0,       0.739+-0.034,      0.136+-0.004,      0.152+-0.004,      0.032+-0.005
        128.0,       0.732+-0.038,      0.117+-0.011,      0.134+-0.011,      0.033+-0.005

Binary:
blockSize (k),  Total Read       , Read Time        , Plus Time        , Sum Time
          0.5,       1.660+-0.014,      0.768+-0.018,      1.161+-0.022,      0.121+-0.013
          1.0,       1.665+-0.040,      0.739+-0.017,      1.147+-0.025,      0.106+-0.008
          2.0,       1.774+-0.029,      0.805+-0.014,      1.233+-0.026,      0.117+-0.008
          4.0,       1.811+-0.016,      0.833+-0.011,      1.269+-0.012,      0.116+-0.009
          8.0,       1.995+-0.039,      0.968+-0.020,      1.423+-0.029,      0.126+-0.013
         16.0,              1.979,             0.999,             1.426,             0.114
         32.0,       2.313+-0.058,      1.372+-0.044,      1.781+-0.048,      0.120+-0.019
         64.0,       2.420+-0.120,      1.350+-0.060,      1.770+-0.065,      0.115+-0.004
        128.0,       3.040+-0.127,      2.093+-0.090,      2.508+-0.099,      0.112+-0.008

image

When increasing the size of the matrix the differences become larger, and the overhead of compressing and writing to disk disappear:

example 256k x 2k binary matrix:

Compressed:
blockSize (k), Disk Size        , Total Read       , Read Time        , Plus Time        , Sum Time
          0.5,            103116,      1.175+-0.031,      0.763+-0.024,      0.779+-0.024,      0.027+-0.003
          1.0,             83536,      1.014+-0.030,      0.585+-0.016,      0.599+-0.016,      0.027+-0.003
          2.0,             73760,      0.931+-0.022,      0.469+-0.015,      0.483+-0.015,      0.027+-0.004
          4.0,             67880,      0.922+-0.040,      0.424+-0.017,      0.438+-0.018,      0.032+-0.005
          8.0,             64940,      0.838+-0.034,      0.346+-0.021,      0.360+-0.021,      0.035+-0.007
         16.0,             63720,      0.817+-0.026,      0.321+-0.007,      0.335+-0.007,      0.028+-0.003
         32.0,             63112,      0.814+-0.027,      0.309+-0.011,      0.324+-0.012,      0.027+-0.003
         64.0,             63112,      0.987+-0.081,      0.348+-0.026,      0.364+-0.026,      0.034+-0.004
        128.0,             62656,      0.962+-0.117,      0.337+-0.037,      0.354+-0.038,      0.030+-0.003
Binary:
blockSize (k),  Disk Size          , Total Read       , Read Time        , Plus Time        , Sum Time
          0.5,              4000152,     10.187+-0.181,      3.308+-0.236,      9.306+-0.142,      0.281+-0.032
          1.0,   4000070.667+-4.619,     10.449+-0.419,      3.373+-0.172,      9.635+-0.396,      0.284+-0.007
          2.0,   4000073.333+-4.619,     10.391+-0.270,      3.385+-0.258,      9.559+-0.286,      0.298+-0.008
          4.0,   4000070.667+-4.619,     10.644+-0.326,      3.645+-0.301,      9.804+-0.345,      0.298+-0.016
          8.0,  4000101.333+-29.484,      9.954+-0.190,      3.707+-0.139,      9.123+-0.160,      0.288+-0.025
         16.0,              4000080,     10.589+-0.650,      4.211+-0.601,      9.718+-0.650,      0.299+-0.003
         32.0,              4000056,     10.757+-0.679,      4.422+-0.548,      9.904+-0.663,      0.307+-0.009
         64.0,   4000057.333+-9.238,     10.969+-0.388,      4.224+-0.395,     10.131+-0.386,      0.296+-0.007
        128.0,              4000020,     12.660+-0.556,      6.316+-0.380,     11.765+-0.559,      0.325+-0.019

image

Baunsgaard avatar Oct 20 '22 16:10 Baunsgaard