Mittwoch, 10. Juni 2015

Porting some modules to Perl6

Playing around with Perl6 since 2008 I wanted to port two of my CPAN distributions, Set::Similarity and Bag::Similarity, to Perl6. It would be a nice example of using the Perl6 built-in types Set and Bag (also known as multisets).

Last year I visited the Austrian Perl Workshop and also stayed the additional two days on the hackaton. My intention was to discuss details of the long missing NFG (Normalization Form Grapheme) with the core developers, but they spent their time discussing GLR (Grand List Refactor/Redesign).

So I worked on the port of Set::Similarity and Bag::Similarity.

These modules are reference implementations of similarity coefficients known under the names Dice, Jaccard, cosine and overlap.

Input are always two sequences, which are tokenized if necessary, transformed to sets (arrays of unique tokens in Perl6) or bags (hashes in Perl5).

In case of strings as input they are tokenized into single characters by default, or into n-grams if n is specified.

Splitting into single characters is easy in Perl6, if you know that the method is named comb.


$ perl6
> say "banana".comb;
b a n a n a
Transforming it to a set

> say "banana".comb.Set;
set(b, a, n)
With sets we can use the intersection operator (a logical and reduction)

> say "banana".comb.Set (&) "anna".comb.Set;
set(a, n)
But Perl6 understands it shorter

> say "banana".comb (&) "anna".comb;
set(a, n)
N-grams are not so obvious (but there is more than one way ...)

> my $w = 4;'wollmersdorfer'.comb.rotor($w,$w-1).map({[~] @$_})
woll ollm llme lmer mers ersd rsdo sdor dorf orfe rfer
The Dice coefficient is defined as count(intersection(A,B)) / (count(A) + count(B)). With the pieces from above we can code:

> my $a="banana".comb; my $b="anna".comb; say ($a (&) $b)/($a.Set + $b.Set)
0.4
Very short, isn't it? Note, that Perl6 automatically uses the number of elements of a set in the context of numeric operations.
Next the Jaccard coefficient is defined as count(intersection(A,B)) / count(union(A,B)). If the operator for intersection is the logical and, then the operator for the union should be the logical or. Let's try.

> my $a = 'banana';my $b = 'anna'; my $union = ($a.comb (|) $b.comb)
set(b, a, n)
Put the pieces together:

> y $a = 'banana';my $b = 'anna'; my $jaccard = ($a.comb (&) $b.comb) / ($a.comb (|) $b.comb);
0.666667
Understanding the code for for cosine and overlap is left as an exercise to the reader:

my $cosine = ($a.comb (&) $b.comb) / ($a.comb.Set.elems.sqrt * $b.comb.Set.elems.sqrt);
my $overlap = ($a.comb (&) $b.comb) / ($a.comb.Set.elems, $b.comb.Set.elems).min;
Is this short code really worth own modules? No, it isn't. Thus I contributed it to Perl6-One-Liners as chapter Text analysis.
Many thanks to the Perl6 hackers answering my questions.

Keine Kommentare:

Kommentar veröffentlichen