unrealircd

mirror of https://github.com/unrealircd/unrealircd.git synced 2026-07-02 16:13:13 +02:00

Author	SHA1	Message	Date
Bram Matthys	e58768eb65	antimixedutf8: ignore general punctuation block transitions Since those can happen in ordinary text.	2025-09-06 14:02:31 +02:00
Bram Matthys	641413cfa9	Update Unicode block lists with Unicode 16.0.0 from 2024-02-02. And provide instructions on how to generate this thing.	2025-03-24 09:32:50 +01:00
Bram Matthys	fafe16a673	AntiMixedUTF8: change emoticon transition score from 1 to 0 You will still get a score of +1 if afterwards changing back to Latin or anything else, but at least the Latin/anything -> Emoticon transition is free now (score 0). And if ending with an emoji it also means a score 0 (as far as this is concerned).	2025-03-23 13:21:01 +01:00
Bram Matthys	74e17b7a26	Make SPAMINFO show the UTF8 block names a text uses. Example output: * SPAMINFO * This will show the original text and the deconfused text which can be used in a spamfilter block with input-conversion deconfused; Original spam text: ẔŽŽẐ𝞕ȤℤΖℨℨ𝒁𝓩ẒŹƵᏃŻẒŽℨŹ𝒵𝛧Ż𝝛𝛧ℨℤ𝜡Ƶ𝞕𝘡ŹẐ𝑍ẔẐẐΖ𝜡Ẕ𝜡Ẕ𝞕ꓜ𝚭ᏃẐẔ𝙕 Deconfused spam text: ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ AntiMixedUTF8 points: 64 Number of Unicode characters in total: 50 Number of different Unicode blocks used: 8 Unicode Block breakdown (name: bytes [capped at 255]): - Latin Extended-A: 8 - Latin Extended-B: 3 - Greek and Coptic: 2 - Cherokee: 2 - Latin Extended Additional: 12 - Letterlike Symbols: 6 - Lisu: 1 - Mathematical Alphanumeric Symbols: 16	2025-03-23 13:03:58 +01:00
Bram Matthys	6bd6e974d4	Add num_bytes and num_unicode_characters to TextAnalysis struct. Also so you can easily put the unicode_blockmap[] in perspective e.g. if you want to do percentages.	2025-03-23 12:43:01 +01:00
Bram Matthys	9b89166280	Add deconfused to TextAnalysis. Add ClientContext * to match_spamfilter(). Make match_spamfilter use the clictx->textanalysis->deconfused rather than calculating its own. The latter will probably disappear altogether. Unrelated but also fixed: properly set e->unicode_blocks.	2025-03-23 12:13:38 +01:00
Bram Matthys	9691a6d819	Create TextAnalysis framework (hook), this counts the unicode block switches like antimixedutf8 did, and counts the number of characters used per unicode block. Potentially more can be added later, this is flexible and modules can add stuff (..well not yet.. the struct is missing some members..). Use it from antimixedutf8 so that it now uses the new code, which is similar to what I made and then reverted in July 2023: https://github.com/unrealircd/unrealircd/commit/3e2f668f10fccedfd035526d7b20d7ca6819a8ae ..except that it now calculated in src/modules/utf8functions.c. But yeah, this needs more testing and possibly (default) score adjustments to deal with false positives !! And a warning in release notes :D Put the text analysis in ClientContext member textanalysis, so typically accessed through clictx->textanalysis. Note that this struct can (and often is) NULL, for example if it is a remote client, if it is not a PRIVMSG/NOTICE (will improve later) or if the utf8functions module is not loaded (to keep things optional). BREAKING CHANGE is that ClientContext is now passed in the HOOKTYPE_CAN_SEND_TO_CHANNEL and HOOKTYPE_CAN_SEND_TO_USER hooks. So HOOKTYPE_CAN_SEND_TO_USER prototype changed from: int hooktype_can_send_to_user(Client client, Client target, const char text, const char errmsg, SendType sendtype); To: int hooktype_can_send_to_user(Client client, Client target, const char text, const char errmsg, SendType sendtype, ClientContext clictx); And HOOKTYPE_CAN_SEND_TO_CHANNEL prototype changes from: int hooktype_can_send_to_channel(Client client, Channel channel, Membership member, const char text, const char errmsg, SendType sendtype); To: int hooktype_can_send_to_channel(Client client, Channel channel, Membership member, const char text, const char errmsg, SendType sendtype, ClientContext clictx); A side-affect of this change for antimixedutf8 purposes is that, while the analysis is only done once per line, the 'actions' are performed for each target, so the action will run 4 times for "PRIVMSG a,b,c,d :text" although that may not be important in practice. Just mentioning.	2025-03-23 11:44:24 +01:00
Bram Matthys	2c33103d28	Fix OOB read, write and NULL dereference code from yesterday.	2025-03-23 07:21:00 +01:00
Bram Matthys	d137a95606	Update confusables. Generated with a python script from 2 different generators/sources plus some manual tweaking. This is not complete and not always correct. Sometimes there are simple mistakes like ф -> f because that is a cyrillic f but it should be seen as an o or something like that. Those still need to be polished out. And some other things are just plain weird but probably similar cases. In any case, with this commit things are getting better. It will never be perfect or anything close to perfect anyway!	2025-03-22 15:40:32 +01:00
Bram Matthys	e1fac402d5	Add spamfilter { input-conversion confusables; ..... } for UTF8 conversion of lookalike characters to simple latin characters. Also add SPAMINFO command so you can see the result of the conversion.	2025-03-22 08:31:22 +01:00
Bram Matthys	9b3d219743	Add utf8functions with utf8_convert_confusables() from July 16 2023. I started work on this back then but didn't finalize it. Now I have to figure out what was left to be done :D. Other than the obvious case of seeing some debugging code that prints out for every converted character. Not yet visible / usable by end-users!	2025-03-22 07:56:11 +01:00

11 Commits