On Jun 1, 2009, at 8:36 PM, till wrote:
And what's the performance trade off to always converting?
It isn't the processing to to do the encoding conversion, it is that
each message has a regex search to see how it should be converted.
The way I read that code, even if the message is UTF-8, the regex
will be done to determine the validity of the if statement.
if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
Maybe there should be a nested if statement, so that only messages
that marked as ISO-8859-1 are tested for the Black Hole of Windows-1252.
if ($from == "ISO-8859-1") if (preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
With UTF-8 becoming more common, that would make the regex be skipped
for likely the bulk of messages.
However, the same problem could occur no matter what the message
header says the encoding should be. A message that has a UTF-8 header
could very well have WINDOWS-1252 encoding inside it. The above
solution works because as the OP said :
The Windows-1252 character set is effectively a superset of the
iso-8859-1 character set,
Not true of WINDOWS-1252 encoded data marked as, or should I say
masquerading as, UTF-8 content.
Does RC really want to parse all messages and apply heuristics to
determine the encoding ?
Yes, this is a relatively simple case, but you open the door for
other patches to solve other specific encoding mismatches.
We have no numbers as to how often this exact encoding mismatch
happens other than " I ran into this once. "
No offense to the OP, he provided a simple fix to the problem, but it
is a very specific problem.
Here's one to fix :
If you subscribe to a mail list run by mailman in plain digest mode,
it doesn't convert the incoming messages to a consistent encoding, it
just mashes the original message in its original encoding into the
digest message that is labeled as 7-bit us-ascii. How does RoundCube
handle that ? It punts because it is an upstream problem.
BTW, the MIME digest mode of mailman makes each message a separate
part that is labeled with its own encoding ( but then you get
attachments to messages, which is sub-optimal for me).