TODO:
- najit, kde to zije, kdo to udrzuje
- pripadne submitnout internacionalni patch
- v tom pripade ho musis udelat pres binmode utf-8 a lc a ne prasit tr!
- opravdu absolutni vaha na 4-gramech ⇒ vety ze tri slov maji 0
sub NormalizeText { my ($norm_text) = @_; # language-independent part: $norm_text =~ s/<skipped>//g; # strip "skipped" tags $norm_text =~ s/-\n//g; # strip end-of-line hyphenation and join lines ... # language-dependent part (assuming Western languages): $norm_text = " $norm_text "; $norm_text =~ tr/[A-Z]/[a-z]/ unless $preserve_case; # BEWARE! PRIDAL JSEM RUCNE !!! $norm_text =~ tr/ÁÉÍÓÚŮČĎĚŘŠŤŽ/áéíóúůčďěřšťž/ unless $preserve_case; # BEWARE! PRIDAL JSEM RUCNE !!!