1
0
mirror of https://github.com/unrealircd/unrealircd.git synced 2026-07-02 16:13:13 +02:00
Commit Graph

11 Commits

Author SHA1 Message Date
Bram Matthys e58768eb65 antimixedutf8: ignore general punctuation block transitions
Since those can happen in ordinary text.
2025-09-06 14:02:31 +02:00
Bram Matthys 641413cfa9 Update Unicode block lists with Unicode 16.0.0 from 2024-02-02.
And provide instructions on how to generate this thing.
2025-03-24 09:32:50 +01:00
Bram Matthys fafe16a673 AntiMixedUTF8: change emoticon transition score from 1 to 0
You will still get a score of +1 if afterwards changing back to Latin
or anything else, but at least the Latin/anything -> Emoticon
transition is free now (score 0). And if ending with an emoji it
also means a score 0 (as far as this is concerned).
2025-03-23 13:21:01 +01:00
Bram Matthys 74e17b7a26 Make SPAMINFO show the UTF8 block names a text uses.
Example output:
*** SPAMINFO ***
This will show the original text and the deconfused text which can be used in a spamfilter block with input-conversion deconfused;
Original spam text: ẔŽŽẐ𝞕ȤℤΖℨℨ𝒁𝓩ẒŹƵᏃŻẒŽℨŹ𝒵𝛧Ż𝝛𝛧ℨℤ𝜡Ƶ𝞕𝘡ŹẐ𝑍ẔẐẐΖ𝜡Ẕ𝜡Ẕ𝞕ꓜ𝚭ᏃẐẔ𝙕
Deconfused spam text: ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
AntiMixedUTF8 points: 64
Number of Unicode characters in total: 50
Number of different Unicode blocks used: 8
Unicode Block breakdown (name: bytes [capped at 255]):
- Latin Extended-A: 8
- Latin Extended-B: 3
- Greek and Coptic: 2
- Cherokee: 2
- Latin Extended Additional: 12
- Letterlike Symbols: 6
- Lisu: 1
- Mathematical Alphanumeric Symbols: 16
2025-03-23 13:03:58 +01:00
Bram Matthys 6bd6e974d4 Add num_bytes and num_unicode_characters to TextAnalysis struct.
Also so you can easily put the unicode_blockmap[] in perspective
e.g. if you want to do percentages.
2025-03-23 12:43:01 +01:00
Bram Matthys 9b89166280 Add deconfused to TextAnalysis. Add ClientContext * to match_spamfilter().
Make match_spamfilter use the clictx->textanalysis->deconfused rather than
calculating its own. The latter will probably disappear altogether.

Unrelated but also fixed: properly set e->unicode_blocks.
2025-03-23 12:13:38 +01:00
Bram Matthys 9691a6d819 Create TextAnalysis framework (hook), this counts the unicode block
switches like antimixedutf8 did, and counts the number of characters
used per unicode block. Potentially more can be added later, this is
flexible and modules can add stuff (..well not yet.. the struct is
missing some members..).

Use it from antimixedutf8 so that it now uses the new code, which is
similar to what I made and then reverted in July 2023:
https://github.com/unrealircd/unrealircd/commit/3e2f668f10fccedfd035526d7b20d7ca6819a8ae
..except that it now calculated in src/modules/utf8functions.c.
But yeah, this needs more testing and possibly (default) score
adjustments to deal with false positives !! And a warning in release notes :D

Put the text analysis in ClientContext member textanalysis,
so typically accessed through clictx->textanalysis.
Note that this struct can (and often is) NULL, for example if it is
a remote client, if it is not a PRIVMSG/NOTICE (will improve later)
or if the utf8functions module is not loaded (to keep things optional).

BREAKING CHANGE is that ClientContext is now passed in the
HOOKTYPE_CAN_SEND_TO_CHANNEL and HOOKTYPE_CAN_SEND_TO_USER hooks.

So HOOKTYPE_CAN_SEND_TO_USER prototype changed from:
int hooktype_can_send_to_user(Client *client, Client *target, const char **text, const char **errmsg, SendType sendtype);
To:
int hooktype_can_send_to_user(Client *client, Client *target, const char **text, const char **errmsg, SendType sendtype, ClientContext *clictx);

And HOOKTYPE_CAN_SEND_TO_CHANNEL prototype changes from:
int hooktype_can_send_to_channel(Client *client, Channel *channel, Membership *member, const char **text, const char **errmsg, SendType sendtype);
To:
int hooktype_can_send_to_channel(Client *client, Channel *channel, Membership *member, const char **text, const char **errmsg, SendType sendtype, ClientContext *clictx);

A side-affect of this change for antimixedutf8 purposes is that,
while the analysis is only done once per line, the 'actions' are
performed for each target, so the action will run 4 times for
"PRIVMSG a,b,c,d :text" although that may not be important in
practice. Just mentioning.
2025-03-23 11:44:24 +01:00
Bram Matthys 2c33103d28 Fix OOB read, write and NULL dereference code from yesterday. 2025-03-23 07:21:00 +01:00
Bram Matthys d137a95606 Update confusables. Generated with a python script from 2 different
generators/sources plus some manual tweaking.
This is not complete and not always correct. Sometimes there are
simple mistakes like ф -> f because that is a cyrillic f but it
should be seen as an o or something like that. Those still need to
be polished out. And some other things are just plain weird but
probably similar cases. In any case, with this commit things are
getting better. It will never be perfect or anything close to perfect
anyway!
2025-03-22 15:40:32 +01:00
Bram Matthys e1fac402d5 Add spamfilter { input-conversion confusables; ..... } for UTF8 conversion
of lookalike characters to simple latin characters.

Also add SPAMINFO command so you can see the result of the conversion.
2025-03-22 08:31:22 +01:00
Bram Matthys 9b3d219743 Add utf8functions with utf8_convert_confusables() from July 16 2023.
I started work on this back then but didn't finalize it. Now I
have to figure out what was left to be done :D. Other than the
obvious case of seeing some debugging code that prints out for
every converted character. Not yet visible / usable by end-users!
2025-03-22 07:56:11 +01:00