This was previously tried at 19-apr-2020 in bc70882bd3
in UnrealIRCd 5.0.5. Sadly it had to be reverted immediately with a quick 5.0.5.1
release, all because of a PCRE2 100% CPU usage. Since then that bug has been fixed,
plus another bug. I'm now readding it "as an option" that is marked experimental.
Hopefully people test it out and can report back if it works well and then we can
make it the default someday.
This makes it a runtime setting so makes it much easier to switch back/forth if
there are any issues without recompiling anything. Had to use a bit more code now
though to handle the recompiling of spamfilters if the setting is changed.
Original issue was https://bugs.unrealircd.org/view.php?id=5187
* [Spamfilter](https://www.unrealircd.org/docs/Spamfilter) can be made UTF8-aware.
* This is experimental, to enable: `set { spamfilter { utf8 yes; } }``
* Case insensitive matches will then work better. For example, with extended
Latin, a spamfilter on `ę` then also matches `Ę`.
* Other PCRE2 features such as [\p](https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5)
can then be used. For example you can then set a spamfilter with the regex
`\p{Arabic}` to block all Arabic script.
Please do use these new tools with care. Blocking an entire language
or script is quite a drastic measure.
* As a consequence of this we require PCRE2 10.36 or newer. If your system
PCRE2 is older than this will mean the UnrealIRCd-shipped-library version
will be compiled and `./Config` may take a little longer than usual.
When you set this to 'yes' you get more options...
See next (modified) copy-paste from April 2020, which had to be reverted
because PCRE2 was broken. Now it's an opt-in and hopefully matured a bit.
This means:
* Case insensitive matches work better in UTF8 now, such as extended Latin.
For example, a spamfilter on "ę" now also matches "Ę", while previously
it did not catch this.
* Other PCRE2 features such as https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5
are now available. For example you can now set a spamfilter with the regex
\p{Arabic} to block all Arabic script, or
\p{Cyrillic} to block all Cyrillic script (such as Russian)
Use these new tools with care, of course. Blocking an entire language,
or script, is quite a drastic measure.
All of this was possible because of the new PCRE2_MATCH_INVALID_UTF
compile time option which was introduced in PCRE2 10.34. Now, that
version turned out to be buggy. As recent as PCRE 10.36 some major bugs
were fixed. This also means we now require at least PCRE2 10.36 version
so everyone can benefit from this new spamfilter UTF8 feature, IF they
enable set::spamfilter::utf8-support, that is.
Many systems come with older PCRE2 versions so this means we will
fall back to the shipped PCRE2 version in UnrealIRCd. This means
./Config will take a little longer to compile things.
For packagers (rpm/deb/ports): if you choose to patch configure to
not require such a recent PCRE2, then please do not allow enabling
of set::spamfilter::utf8-support since it will likely cause crashes
and misbehavior. Check PCRE2 changelog, CTRL+F at PCRE2_MATCH_INVALID_UTF
version or newer on the sytem, otherwise we fall back to shipped version.
This fixes https://bugs.unrealircd.org/view.php?id=5187 among others.
It means:
* Case insensitive matches work better in UTF8 now, such as extended Latin.
For example, a spamfilter on "ę" now also matches "Ę", while previously
it did not catch this.
* Other PCRE2 features such as https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5
are now available. For example you can now set a spamfilter with the regex
\p{Arabic} to block all Arabic script, or
\p{Cyrillic} to block all Cyrillic script (such as Russian)
Use these new tools with care, of course. Blocking an entire language,
or script, is quite a drastic measure.
All of this was possible because of the new PCRE2_MATCH_INVALID_UTF
compile time option which was introduced in PCRE2 10.34.
This also means we now require at least that PCRE2 version so
everyone can benefit from this new spamfilter UTF8 feature.
Many systems come with older PCRE2 versions so this means we will
fall back to the shipped PCRE2 version in UnrealIRCd. This means
./Config will take a little longer to compile things.
Although there is no indication as of now, but if this feature would
break things heavily then it might get reverted or configurable.
This is also why it was added just after 5.0.4 release and not right
before it, it needs some heavy testing.
extern int strnatcmp(char const *a, char const *b);
extern int strnatcasecmp(char const *a, char const *b);
This will be handy for version comparisons. For example they will
return -1 (=lower) for things like ("1.4.9", "1.4.10"), unlike strcmp.
Also, some loosely related spelling fixes elsewhere.
aChannel to Channel, and some more. Third party module coders will
love this. But.. it makes things more logical and the doxygen output
will look more clean and logical as well.
(More changes will follow)
of match_simple() and match_esc(). So, developers, be aware, this is how
you should use the function in a correct way:
if (match_simple("*fun*", str))
printf("It was fun\n");
Rationale:
I've always been annoyed by the inversed logic, even though it was similar
to strcmp. So I've reverted it.
I could have chosen to maintain match() rather than this match_simple()
name, but this way I force (3rd party module) devs to update their function,
while otherwise everything would mysteriously fail due to the inverted logic.
and has various outstanding crash and 100% CPU issues.
We have been encouraging the PCRE2 engine since the start of
UnrealIRCd 4 already.
TRE is being phased out of U4 by the end of the year, so we can
safely remove it in U5 already.
usually the fast badwords system is used instead)
* Code deduplication in src/modules/{chanmodes,usermodes}/censor.c
to src/match.c -- which may be moved later again to efuncs.
* Add --without-tre:
This means USE_TRE will be enabled by default right now
but if using --without-tre it will be undef'ed. This so we
can prepare for the TRE phase-out in 2020.
* Remove include/badwords.h, put contents in include/struct.h
In 3.2.x we didn't fix these bugs since servers are trusted and
should send correct commands. In 4.0.x we changed this so we would
fix them when we come across such issues at normal priority (not
consider them security issues). I now took it a step further and
actively checked/looked for these issues and a bunch of them were
found. Almost all are NULL pointer dereferences, with some exceptions.
* S2S: MODE: check conv_param return value (NULL ptr crash)
* S2S: MODE: floodprot: More checks (NULL ptr crash)
* S2S: MODE: OOB write of NULL (write NULL past last element in an array)
* S2S: NICK: old compat fixes (NULL ptr crash)
* S2S: PROTOCTL: Check for double SID=
* S2S: SERVER: require at least 3 parameters (NULL ptr crash)
* S2S: SJOIN: require at least 3 parameters (NULL ptr crash)
* S2S: SJOIN: Fix OOB read (read 1 byte past buffer)
* S2S: TKL: validate set_at and expire_at (NULL ptr crash)
* S2S: TKL: require at least 9 parameters for spamf, not 8 (NULL ptr crash)
* S2S: TKL: ignore invalid spamfilter matching type (remove abort() call)
* S2S: TOPIC: querying for topic is not permitted (NULL ptr crash)
* S2S: UID: require 12 parameters (NULL ptr crash)
* S2S: WATCH: this is not a server command (NULL ptr crash)
* Fix OOB read (1 byte beyond string) for timevals. This was reachable
from config code, TKL (S2S) and /*LINE (Oper). In practice no crash.
* MODE: make code less confusing (effectively no change)
* TRACE: remove strange output in case of 0 lines of output
* Fix unimportant memory leak on boot (#4713, reported by dg)
* Fix small memory leak upon 'DNS i' (oper only command)
* Always work on a copy in clean_ban_mask(). This fixes a bug that could
result in a strlcpy(buf, buf, sizeof(buf)). So, overlapping strings,
which is undefined behavior.
Tizen, DBoyz and Valdebrick helped tracing the issue.
Removed MATCH_USE_IDENT since it had no useful purpose.. for all cases one has to check identd first and then non-identd anyway.
* add general matching framework (aMatch type, unreal_match_xxx functions)
* change spamfilter { } block syntax
* add support for simple wildcard matching (non-regex, just '?' and '*')
This is the initial commit so the new lib is not in yet, 'regex' is not
functional (but 'posix' and 'simple' are working), linking has not been
fully tested and no warnings are printed yet. IOTW: work in progress!
means no longer weird issues with +b *\* etc not banning nicks with \ in it.
ExtBan ~c/~r get special treatment and will use our match_esc [match with escaping]
routine, that way you can ban channels such as "#f*ck" via "+b ~c:#f\*ck".
Fix triggered by bugreport of vonitsanet (#0002782).
the switchover we were accidently using different ones which caused funny kill messages
like "You were killed by a.b.c (a!a.b.c (SOMENICK[N\A](?) <- d.e.f))." This also broke
some bans in pre2/rc1. Bug reported by HERZ (#0002772).
but is actually understandable and has less bugs. This fixes +b ~c:#c\*t not properly
matching #c*t, reported by Jason (#0002752). Initial results look good, but this needs
some good testing ;).
still cutoff if the nick is too long. Basically this is the same way as Hybrid does it
so it should work ok :).
- Added nick character system. This allows you to choose which (additional) characters
to allow in nicks via set::allowed-nickchars. See unreal32docs.html -> section 3.16
for a list of available languages and more info on how to use it.
Current list: dutch, french, german, italian, spanish, euro-west, chinese-trad,
chinese-simp, chinese-ja, chinese.
If you wonder why your language is not yet included or why a certain mistake is present,
then please understand that we are most likely not experienced (at all) in your language.
If you are a native of your language (or know the language well), and your language
is not included yet or you have some corrections, then contact syzop@vulnscan.org or
report it as a bug on http://bugs.unrealircd.org/
These bans look like ~<type>:<stuff>. Currently the following bans are available:
~q: quiet bans (ex: ~q:*!*@blah.blah.com). People matching these bans can join
but are unable to speak, unless they have +v or higher.
~c: channel bans (ex: ~c:#idiots). People in #idiots are unable to join the channel.
~r: gecos (realname) bans (ex: ~r:*Stupid_bot_script*). If the realname of a user
matches this then (s)he is unable to join.
NOTE: an underscore ('_') matches both a space (' ') and an underscore ('_'),
so this ban would match 'Stupid bot script v1.4'.
These bantypes can also be used in the channel exception list (+e).
+e ~r:*w00t* makes anyone with 'w00t' in their realname able to join,
and +e ~c:#admin makes anyone in #admin able to join, etc..
This system allows modules to add extended bantypes too.
This feature requires some additional testing, also the module interface will
probably be changed in the next few weeks, and perhaps more extended bans will
be added before next release.. we'll see...