You will still get a score of +1 if afterwards changing back to Latin
or anything else, but at least the Latin/anything -> Emoticon
transition is free now (score 0). And if ending with an emoji it
also means a score 0 (as far as this is concerned).
Example output:
*** SPAMINFO ***
This will show the original text and the deconfused text which can be used in a spamfilter block with input-conversion deconfused;
Original spam text: ẔŽŽẐ𝞕ȤℤΖℨℨ𝒁𝓩ẒŹƵᏃŻẒŽℨŹ𝒵𝛧Ż𝝛𝛧ℨℤ𝜡Ƶ𝞕𝘡ŹẐ𝑍ẔẐẐΖ𝜡Ẕ𝜡Ẕ𝞕ꓜ𝚭ᏃẐẔ𝙕
Deconfused spam text: ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
AntiMixedUTF8 points: 64
Number of Unicode characters in total: 50
Number of different Unicode blocks used: 8
Unicode Block breakdown (name: bytes [capped at 255]):
- Latin Extended-A: 8
- Latin Extended-B: 3
- Greek and Coptic: 2
- Cherokee: 2
- Latin Extended Additional: 12
- Letterlike Symbols: 6
- Lisu: 1
- Mathematical Alphanumeric Symbols: 16
Make match_spamfilter use the clictx->textanalysis->deconfused rather than
calculating its own. The latter will probably disappear altogether.
Unrelated but also fixed: properly set e->unicode_blocks.
switches like antimixedutf8 did, and counts the number of characters
used per unicode block. Potentially more can be added later, this is
flexible and modules can add stuff (..well not yet.. the struct is
missing some members..).
Use it from antimixedutf8 so that it now uses the new code, which is
similar to what I made and then reverted in July 2023:
https://github.com/unrealircd/unrealircd/commit/3e2f668f10fccedfd035526d7b20d7ca6819a8ae
..except that it now calculated in src/modules/utf8functions.c.
But yeah, this needs more testing and possibly (default) score
adjustments to deal with false positives !! And a warning in release notes :D
Put the text analysis in ClientContext member textanalysis,
so typically accessed through clictx->textanalysis.
Note that this struct can (and often is) NULL, for example if it is
a remote client, if it is not a PRIVMSG/NOTICE (will improve later)
or if the utf8functions module is not loaded (to keep things optional).
BREAKING CHANGE is that ClientContext is now passed in the
HOOKTYPE_CAN_SEND_TO_CHANNEL and HOOKTYPE_CAN_SEND_TO_USER hooks.
So HOOKTYPE_CAN_SEND_TO_USER prototype changed from:
int hooktype_can_send_to_user(Client *client, Client *target, const char **text, const char **errmsg, SendType sendtype);
To:
int hooktype_can_send_to_user(Client *client, Client *target, const char **text, const char **errmsg, SendType sendtype, ClientContext *clictx);
And HOOKTYPE_CAN_SEND_TO_CHANNEL prototype changes from:
int hooktype_can_send_to_channel(Client *client, Channel *channel, Membership *member, const char **text, const char **errmsg, SendType sendtype);
To:
int hooktype_can_send_to_channel(Client *client, Channel *channel, Membership *member, const char **text, const char **errmsg, SendType sendtype, ClientContext *clictx);
A side-affect of this change for antimixedutf8 purposes is that,
while the analysis is only done once per line, the 'actions' are
performed for each target, so the action will run 4 times for
"PRIVMSG a,b,c,d :text" although that may not be important in
practice. Just mentioning.
generators/sources plus some manual tweaking.
This is not complete and not always correct. Sometimes there are
simple mistakes like ф -> f because that is a cyrillic f but it
should be seen as an o or something like that. Those still need to
be polished out. And some other things are just plain weird but
probably similar cases. In any case, with this commit things are
getting better. It will never be perfect or anything close to perfect
anyway!
I started work on this back then but didn't finalize it. Now I
have to figure out what was left to be done :D. Other than the
obvious case of seeing some debugging code that prints out for
every converted character. Not yet visible / usable by end-users!