using this we can save a few more cycles (code size not yet compared).
working principle:
we can distinguish which number to multiply by with the lower nibble
only. to have maximum overlap, use a decision tree to select where to
go. the upper nibble is then just a straight path (there is not much
to deduplicate).
worst case (roughly):
- mul-tree: 30cy tree + 5cy to load data[t]
- cpi-breq-rjmp: 37cy mul_xx + 5cy to load data[t]
- ldZ-ijmp: 28cy mul_xx + 10cy to load mul_jmptable[t]