[RCD] URLs with 8bit chars?

Thomas Bruederli thomas at roundcube.net
Sat Feb 22 16:02:02 CET 2014

On Sat, Feb 22, 2014 at 3:47 PM, Rimas Kudelis <rq at akl.lt> wrote:
> 2014.02.22 14:35, Thomas Bruederli rašė:
>> On Mon, Feb 17, 2014 at 11:54 PM, Reindl Harald <h.reindl at thelounge.net>
>> wrote:
>>>>> Roundcube does not fully recognize URLs with 8bit chars, they are being
>>>>> truncated upon the first occurrence of any such 8 bit char
>>> where does roundcube need to recognize any URL?
>>> in which context should it recognize what URL and why?
>> The context where Roundcube should (and does) try to recognize URLs is
>> when displaying a plain text message. For convenience reasons we want
>> to make detected URLs clickable and not leave the user to copy & paste
>> it. This is done using regular expressions and we hereby stick to the
>> RFC specification of allowed chars in URLs which doesn't include any
>> 8bit characters. Indeed, it's stupid for mail senders to not properly
>> encode their URLs and unfortunately there's little we can and want do
>> about this. It's already hard enough to reliably detect URLs in a
>> plain text string, especially finding the end of it. If 8bit
>> characters should be taken into account as well, we'll likely add more
>> characters from the surrounding text to the URL which may leads to
>> false detections even for correctly encoded URLs.
>> Thus, I'm sorry but this is strictly a sender issue and in this case
>> you'd need to manually copy the URL and paste it to your browser's
>> location bar. You might argue that FF supports these URLs and you're
>> right. But unlike Roundcube, FF understands the entire string to be an
>> URL and doesn't need to "find" it within a random text. Therefore FF
>> can accept any string of characters. But also FF first converts it
>> into proper URL encoded characters before it actually sends the URL to
>> the server.
> Hi Thomas,
> let me disagree here. While it's sort of true that a *real* URL may only
> contain a limited subset of ASCII characters, there's also such thing as
> *visible* URLs, which should be taken into account. As an extreme example,
> Russia has had the .рф (Cyrillic) top-level domain [1] for quite some time
> now. Most, if not all, subdomains of that domain are written in Cyrillic
> characters. And surely, the web servers serving these domains might contain
> pages with Cyrillic names as well. Technically, URL's of these pages would
> are a mix of punycode and URL escaped entities (%xx%yy%zz...). However, from
> a users point of view, such low-level representation is absolutely
> unfriendly and looks like a bunch of random symbols. I think most of the
> users would favor writing URL's like these in native alphabet instead of
> their low-level ASCII representation.
> Regarding difficulty of detection, I would dare to disagree with you as
> well. Since PHP 5.1, PCRE has had support for Unicode character properties,
> so I'm pretty sure that it must be possible to add all alphanumeric
> characters to your regex easily.

I certainly agree to this. And we'd very much appreciate any
contribution for this, preferably in terms of a regex that detect
unicode URLs or even better with a set of text cases that demonstrate
the correct detection of real and false urls within plain text.
> Regards,
> Rimas
> [1] http://en.wikipedia.org/wiki/.%D1%80%D1%84 . Note how this looks hardly
> readable compared to http://en.wikipedia.org/wiki/.рф .

A possible optimization on our side could be to decode the URL
encoding (and punycode) when displaying links in message view. This
however, alters the actual message content which might be undesirable.


More information about the dev mailing list