Samstag, 20. Juni 2015

Minimal Coding versus Indirection

Yesterday I tried to use a C implementation of the LCS problem within Perl via XS.

Reuse an existing library (libmba) sounded nice. It has good documentation and minimal foreign dependencies. The promise for libmba::diff is working for sequences of anything.

First problem was to get it compiling, stepping from one missing include to the next. Also had to patch a source file, because of some incompatible code for printing error messages.

Per default libmba::diff works on byte-strings. The comparison with String::Similarity showed only 40% of the speed of String::Similarity. Both use the same algorithm ["An O(ND) Difference Algorithm and its Variations", Eugene Myers,
 Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;], both written in C, both interfaced from Perl via XS.

But ok, fast enough for the first step, can tune later.

In the next step I wanted to support array references instead of strings as input. This needs to call libmba::diff() with references of two subroutines, one for comparing two elements, one for indexing an element. Here I lost (or wasted) some time getting it compiled without warnings (see on Stackoverflow).

The speed slowed down now to 24%, 0.237 MHz versus 1 MHz (i.e. 1 million cases processed per second) of String::Similarity. 

What's the difference in detail?

Let's measure rough figures.

The lines of code of String::Similarity:

~/github/String-Similarity-1.04$ cloc --by-file-by-lang .
      17 text files.
      13 unique files.                              
      17 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.20 s (45.6 files/s, 10777.8 lines/s)

[ details snipped ]

---------------------------------------------------------------------
Language           files          blank        comment           code
---------------------------------------------------------------------
make                   1            231            106            702
C                      2            127            193            570
YAML                   2              0              0             41
JSON                   1              0              0             39
Perl                   2             31             24             38
C/C++ Header           1              7             14              4
---------------------------------------------------------------------
SUM:                   9            396            337           1394
---------------------------------------------------------------------


And the same of LCS::XS:



~/github/LCS-XS$ cloc --by-file-by-lang .
      44 text files.
      42 unique files.                              
       8 files ignored.

http://cloc.sourceforge.net v 1.62  T=0.33 s (115.4 files/s, 37434.7 lines/s)

[ details snipped ]

----------------------------------------------------------------------
Language            files          blank        comment           code
----------------------------------------------------------------------
C/C++ Header           25           1109           3652           4317
C                       6            230            252           1376
make                    1            231            111            704
Perl                    3             70             63            146
JSON                    1              0              0             39
YAML                    2              1              1             28
----------------------------------------------------------------------
SUM:                   38           1641           4079           6610
----------------------------------------------------------------------

The lines of code in C are 1376 versus 570 for the same algorithm. The reason is more abstraction, more subroutines, more internal interfaces, more flexibility. This also causes more documentation for interfaces and more time to read the documentation and master the interfaces.

The size of the compiled objects is 7 KB versus 33 KB.

And a difference of factor 4 in speed.



Keine Kommentare:

Kommentar veröffentlichen