Devs,
I recently received a bulk e-mail from an event organizer that displayed in RoundCube (using Firefox 3) with the little square hex-code glyphs in place of some of the punctuation marks. I researched why this was happening, and tracked it down to an encoding issue.
The text/html message part in the e-mail source specified iso-8859-1 encoding. After RoundCube converted the message part to UTF-8, there were still non-UTF8 characters in the resulting text. One such character was 0x92, which is not even a valid iso-8859-1 character. It turns out that the message originator must have been using Windows-1252 encoding (in which 0x92 is a single-quote character, which was correct in the context in which it appeared), but incorrectly specified iso-8859-1 encoding in the MIME message.
The Windows-1252 character set is effectively a superset of the iso-8859-1 character set, replacing some of the seldom-used control character code points with additional punctuation and accent characters. Some mail agents incorrectly blur the line between these two encodings, and send Windows-1252 characters in iso-8859-1 messages.
The following workaround (in rcube_charset_convert()) corrects the issue (at least for my one test case):
// Workaround for mail agents that include Windows-1252 characters // in text advertised as ISO-8859-1 if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
What does everyone think of including a workaround like this? I'm generally reluctant to work around improper behavior from other software, but this particular kind of relaxed interpretation seems common (check out the ISO-8859-1 page on Wikipedia).
On Jun 1, 2009, at 11:25 AM, Eric Stadtherr wrote:
// Workaround for mail agents that include Windows-1252 characters // in text advertised as ISO-8859-1 if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
Since $str can be very large, isn't there a performance / resource
penalty to the preg_match in order to compensate for a mistake that
is outside of RoundCube ?
Yes, this may help you with this particular sender, but the potential
is for each and every RoundCube user to pay this performance price
processing each and every message. Maybe my old-school sensibilities
are too worried about CPU time with today's CPUs.
On Mon, 1 Jun 2009 12:44:35 -0500, chasd chasd@silveroaks.com wrote:
On Jun 1, 2009, at 11:25 AM, Eric Stadtherr wrote:
// Workaround for mail agents that include Windows-1252 characters // in text advertised as ISO-8859-1 if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
Since $str can be very large, isn't there a performance / resource
penalty to the preg_match in order to compensate for a mistake that
is outside of RoundCube ?Yes, this may help you with this particular sender, but the potential
is for each and every RoundCube user to pay this performance price
processing each and every message. Maybe my old-school sensibilities
are too worried about CPU time with today's CPUs.
At least preg_match stops after first encountering the search string. I guess the worst case is a large iso-8859-1 part that is truly iso-8859-1 and preg_match has to search the whole thing.
The other option (and this is *shudder* recommended in the HTML 5 RFC) is to just always decode iso-8859-1 strings as windows-1252.
On Mon, Jun 1, 2009 at 8:34 PM, Eric Stadtherr estadtherr@gmail.com wrote:
On Mon, 1 Jun 2009 12:44:35 -0500, chasd chasd@silveroaks.com wrote:
On Jun 1, 2009, at 11:25 AM, Eric Stadtherr wrote:
// Workaround for mail agents that include Windows-1252 characters // in text advertised as ISO-8859-1 if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
Since $str can be very large, isn't there a performance / resource penalty to the preg_match in order to compensate for a mistake that is outside of RoundCube ?
Yes, this may help you with this particular sender, but the potential is for each and every RoundCube user to pay this performance price processing each and every message. Maybe my old-school sensibilities are too worried about CPU time with today's CPUs.
At least preg_match stops after first encountering the search string. I guess the worst case is a large iso-8859-1 part that is truly iso-8859-1 and preg_match has to search the whole thing.
The other option (and this is *shudder* recommended in the HTML 5 RFC) is to just always decode iso-8859-1 strings as windows-1252.
And what's the performance trade off to always converting?
Maybe we open an issue and keep trac(k) of the problem. If more people have the same issue, then we should think about fixing it. My proposal in this case would be to tell them [the event organizer] that there's an obvious flaw in their mailings which prevents their customer from viewing it.
I personally don't really want to fix other people's issues and make RoundCube slower. Doesn't sound like win, win. ;-)
Till _______________________________________________ List info: http://lists.roundcube.net/dev/
till wrote:
The other option (and this is *shudder* recommended in the HTML 5 RFC) is to just always decode iso-8859-1 strings as windows-1252.
And what's the performance trade off to always converting?
In my opinion this is performance neutral. +1 for this option.
till wrote:
The other option (and this is *shudder* recommended in the HTML 5 RFC) is to just always decode iso-8859-1 strings as windows-1252.
And what's the performance trade off to always converting?
Maybe we open an issue and keep trac(k) of the problem. If more people have the same issue, then we should think about fixing it. My proposal in this case would be to tell them [the event organizer] that there's an obvious flaw in their mailings which prevents their customer from viewing it.
I personally don't really want to fix other people's issues and make RoundCube slower. Doesn't sound like win, win. ;-)
If you think about it, there's not really a performance penalty to interpreting iso-8859-1 text as windows-1252. The only difference is that if a character in the range 0x80-0x9F is encountered in the text, that character is re-encoded in the UTF-8 equivalent of the windows-1252 character. If no characters are encountered in that range, there is no difference in behavior.
I'm starting to like that option, despite it being a deliberate workaround for rule-breakers!
-Eric
List info: http://lists.roundcube.net/dev/
On Jun 1, 2009, at 8:36 PM, till wrote:
And what's the performance trade off to always converting?
It isn't the processing to to do the encoding conversion, it is that
each message has a regex search to see how it should be converted.
The way I read that code, even if the message is UTF-8, the regex
will be done to determine the validity of the if statement.
if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
Maybe there should be a nested if statement, so that only messages
that marked as ISO-8859-1 are tested for the Black Hole of Windows-1252.
if ($from == "ISO-8859-1") if (preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
With UTF-8 becoming more common, that would make the regex be skipped
for likely the bulk of messages.
However, the same problem could occur no matter what the message
header says the encoding should be. A message that has a UTF-8 header
could very well have WINDOWS-1252 encoding inside it. The above
solution works because as the OP said :
The Windows-1252 character set is effectively a superset of the
iso-8859-1 character set,
Not true of WINDOWS-1252 encoded data marked as, or should I say
masquerading as, UTF-8 content.
Does RC really want to parse all messages and apply heuristics to
determine the encoding ?
Yes, this is a relatively simple case, but you open the door for
other patches to solve other specific encoding mismatches.
We have no numbers as to how often this exact encoding mismatch
happens other than " I ran into this once. "
No offense to the OP, he provided a simple fix to the problem, but it
is a very specific problem.
Here's one to fix :
If you subscribe to a mail list run by mailman in plain digest mode,
it doesn't convert the incoming messages to a consistent encoding, it
just mashes the original message in its original encoding into the
digest message that is labeled as 7-bit us-ascii. How does RoundCube
handle that ? It punts because it is an upstream problem.
BTW, the MIME digest mode of mailman makes each message a separate
part that is labeled with its own encoding ( but then you get
attachments to messages, which is sub-optimal for me).
On Tue, 2 Jun 2009 09:24:01 -0500, chasd chasd@silveroaks.com wrote:
On Jun 1, 2009, at 8:36 PM, till wrote:
And what's the performance trade off to always converting?
It isn't the processing to to do the encoding conversion, it is that
each message has a regex search to see how it should be converted. The way I read that code, even if the message is UTF-8, the regex
will be done to determine the validity of the if statement.if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
Like most other languages, PHP won't evaluate the second sub-expression in an " && " expression if the first evaluates to false. My proposed order was intentional based on that fact.
In any case (see later e-mails) it seems most efficient to skip the regex search and just interpret ISO-8859-1 as Windows-1252 in all cases. No harm done if the text was labeled correctly.
Maybe there should be a nested if statement, so that only messages
that marked as ISO-8859-1 are tested for the Black Hole of Windows-1252.if ($from == "ISO-8859-1") if (preg_match("/[\x80-\x9F]/", $str)) $from = "WINDOWS-1252";
With UTF-8 becoming more common, that would make the regex be skipped
for likely the bulk of messages.However, the same problem could occur no matter what the message
header says the encoding should be. A message that has a UTF-8 header
could very well have WINDOWS-1252 encoding inside it. The above
solution works because as the OP said :The Windows-1252 character set is effectively a superset of the
iso-8859-1 character set,Not true of WINDOWS-1252 encoded data marked as, or should I say
masquerading as, UTF-8 content.Does RC really want to parse all messages and apply heuristics to
determine the encoding ? Yes, this is a relatively simple case, but you open the door for
other patches to solve other specific encoding mismatches. We have no numbers as to how often this exact encoding mismatch
happens other than " I ran into this once. " No offense to the OP, he provided a simple fix to the problem, but it
is a very specific problem.
It is a very specific problem, but a common problem nonetheless. For example, HTML 5 *requires* this misinterpretation:
http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
Here's one to fix : If you subscribe to a mail list run by mailman in plain digest mode,
it doesn't convert the incoming messages to a consistent encoding, it
just mashes the original message in its original encoding into the
digest message that is labeled as 7-bit us-ascii. How does RoundCube
handle that ? It punts because it is an upstream problem.BTW, the MIME digest mode of mailman makes each message a separate
part that is labeled with its own encoding ( but then you get
attachments to messages, which is sub-optimal for me).
In a case like you described, RoundCube has no knowledge of the original encoding. In the workaround I'm suggesting, a specific no-cost re-interpretation would be applied based on foreknowledge of common mislabeling.
On Jun 2, 2009, at 10:09 AM, Eric Stadtherr wrote:
Like most other languages, PHP won't evaluate the second sub- expression in an " && " expression if the first evaluates to false. My proposed
order was intentional based on that fact.
OK, I don't have a hard CS background, didn't know that.
It is a very specific problem, but a common problem nonetheless. For example, HTML 5 *requires* this misinterpretation:
http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
Wow.
That is a huge break from XHTML.
I haven't really read through the HTML 5 spec ( in fact, the draft
was updated today, again ) so I wasn't aware of that.
At least there is a big disclaimer :
Note: The requirement to treat certain encodings as other encodings
according to the table above is a willful violation of the W3C
Character Model specification, motivated by a desire for
compatibility with legacy content
I see what's going on now, and I agree, mimicking what a HTML 5
browser would do is a good idea.
Eric Stadtherr wrote:
It is a very specific problem, but a common problem nonetheless. For example, HTML 5 *requires* this misinterpretation:
http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
In favor of interpreting the encoding label for what it is, and keeping a clearly defined behavior, and at the same time not incurring the performance penalty on properly labelled messages due to the regex search, I have the following suggestion:
On iso-8859-1-labelled messages, provide a "fix encoding" button in an unobtrusive place, like the lower edge of the message. Users can then click this button when they see an encoding problem.
When the button is clicked RC would re-read the message, interpreting the iso-8859-1 part as windows-1252.
Sincerely, Sebastian