URLs with 8bit chars?

List overview All Threads
Download

newer

older

Re: [RCD] contextmenu plugin new...

$rcmail_config['default_host'] =...

Michael Heydekamp

17 Feb 2014 17 Feb '14

11:18 p.m.

Roundcube does not fully recognize URLs with 8bit chars, they are being truncated upon the first occurrence of any such 8 bit char. But much to my surprise, they do exist and they do work - at least in FF 27.0.1. For instance this one:

http://suchen.mobile.de/auto-inserat/bmw-335i-a-navi-prof-hifi-h-k-el-glasda...

???

(And no, I didn't buy the car... ;))

Cheers,

Michael Heydekamp Co-Admin freexp.de Düsseldorf/Germany

Show replies by date

Reindl Harald

17 Feb 17 Feb

11:29 p.m.

Am 17.02.2014 23:18, schrieb Michael Heydekamp:

...

Roundcube does not fully recognize URLs with 8bit chars, they are being truncated upon the first occurrence of any such 8 bit char. But much to my surprise, they do exist and they do work - at least in FF 27.0.1. For instance this one:

http://suchen.mobile.de/auto-inserat/bmw-335i-a-navi-prof-hifi-h-k-el-glasda...

???

(And no, I didn't buy the car... ;))

what are you trying to tell us? nobody right in his mind is using non-encoded URLs! what has this to do with roundcube?

Michael Heydekamp

11:43 p.m.

Am 17.02.2014 23:29, schrieb Reindl Harald:

...

what are you trying to tell us?

Who do you think is "us"? Harald, believe me, especially you I didn't try to tell anything, as I really don't appreciate your unfriendly attitude.

Cheers,

Michael Heydekamp Co-Admin freexp.de Düsseldorf/Germany

Reindl Harald

11:54 p.m.

Am 17.02.2014 23:43, schrieb Michael Heydekamp:

...

Am 17.02.2014 23:29, schrieb Reindl Harald:

...
what are you trying to tell us?

Who do you think is "us"?

anybody receiving your message

...

Harald, believe me, especially you I didn't try to tell anything

to what topic? your question does not parse

...

...
Roundcube does not fully recognize URLs with 8bit chars, they are being truncated upon the first occurrence of any such 8 bit char

where does roundcube need to recognize any URL? in which context should it recognize what URL and why?

...

as I really don't appreciate your unfriendly attitude

so be it

Thomas Bruederli

22 Feb 22 Feb

1:35 p.m.

On Mon, Feb 17, 2014 at 11:54 PM, Reindl Harald h.reindl@thelounge.net wrote:

...

...
...
Roundcube does not fully recognize URLs with 8bit chars, they are being truncated upon the first occurrence of any such 8 bit char

where does roundcube need to recognize any URL? in which context should it recognize what URL and why?

The context where Roundcube should (and does) try to recognize URLs is when displaying a plain text message. For convenience reasons we want to make detected URLs clickable and not leave the user to copy & paste it. This is done using regular expressions and we hereby stick to the RFC specification of allowed chars in URLs which doesn't include any 8bit characters. Indeed, it's stupid for mail senders to not properly encode their URLs and unfortunately there's little we can and want do about this. It's already hard enough to reliably detect URLs in a plain text string, especially finding the end of it. If 8bit characters should be taken into account as well, we'll likely add more characters from the surrounding text to the URL which may leads to false detections even for correctly encoded URLs.

Thus, I'm sorry but this is strictly a sender issue and in this case you'd need to manually copy the URL and paste it to your browser's location bar. You might argue that FF supports these URLs and you're right. But unlike Roundcube, FF understands the entire string to be an URL and doesn't need to "find" it within a random text. Therefore FF can accept any string of characters. But also FF first converts it into proper URL encoded characters before it actually sends the URL to the server.

Kind regards, Thomas

Rimas Kudelis

3:47 p.m.

2014.02.22 14:35, Thomas Bruederli rašė:

...

On Mon, Feb 17, 2014 at 11:54 PM, Reindl Harald h.reindl@thelounge.net wrote:

...
...
...
Roundcube does not fully recognize URLs with 8bit chars, they are being truncated upon the first occurrence of any such 8 bit char

where does roundcube need to recognize any URL? in which context should it recognize what URL and why?

The context where Roundcube should (and does) try to recognize URLs is when displaying a plain text message. For convenience reasons we want to make detected URLs clickable and not leave the user to copy & paste it. This is done using regular expressions and we hereby stick to the RFC specification of allowed chars in URLs which doesn't include any 8bit characters. Indeed, it's stupid for mail senders to not properly encode their URLs and unfortunately there's little we can and want do about this. It's already hard enough to reliably detect URLs in a plain text string, especially finding the end of it. If 8bit characters should be taken into account as well, we'll likely add more characters from the surrounding text to the URL which may leads to false detections even for correctly encoded URLs.

Thus, I'm sorry but this is strictly a sender issue and in this case you'd need to manually copy the URL and paste it to your browser's location bar. You might argue that FF supports these URLs and you're right. But unlike Roundcube, FF understands the entire string to be an URL and doesn't need to "find" it within a random text. Therefore FF can accept any string of characters. But also FF first converts it into proper URL encoded characters before it actually sends the URL to the server.

Hi Thomas,

let me disagree here. While it's sort of true that a *real* URL may only contain a limited subset of ASCII characters, there's also such thing as *visible* URLs, which should be taken into account. As an extreme example, Russia has had the .рф (Cyrillic) top-level domain [1] for quite some time now. Most, if not all, subdomains of that domain are written in Cyrillic characters. And surely, the web servers serving these domains might contain pages with Cyrillic names as well. Technically, URL's of these pages would are a mix of punycode and URL escaped entities (%xx%yy%zz...). However, from a users point of view, such low-level representation is absolutely unfriendly and looks like a bunch of random symbols. I think most of the users would favor writing URL's like these in native alphabet instead of their low-level ASCII representation.

Regarding difficulty of detection, I would dare to disagree with you as well. Since PHP 5.1, PCRE has had support for Unicode character properties, so I'm pretty sure that it must be possible to add all alphanumeric characters to your regex easily.

Regards, Rimas

[1] http://en.wikipedia.org/wiki/.%D1%80%D1%84 . Note how this looks hardly readable compared to http://en.wikipedia.org/wiki/.%D1%80%D1%84 .

Thomas Bruederli

4:02 p.m.

On Sat, Feb 22, 2014 at 3:47 PM, Rimas Kudelis rq@akl.lt wrote:

...

2014.02.22 14:35, Thomas Bruederli rašė:

...
On Mon, Feb 17, 2014 at 11:54 PM, Reindl Harald h.reindl@thelounge.net wrote:

...
...
...
Roundcube does not fully recognize URLs with 8bit chars, they are being truncated upon the first occurrence of any such 8 bit char

where does roundcube need to recognize any URL? in which context should it recognize what URL and why?

The context where Roundcube should (and does) try to recognize URLs is when displaying a plain text message. For convenience reasons we want to make detected URLs clickable and not leave the user to copy & paste it. This is done using regular expressions and we hereby stick to the RFC specification of allowed chars in URLs which doesn't include any 8bit characters. Indeed, it's stupid for mail senders to not properly encode their URLs and unfortunately there's little we can and want do about this. It's already hard enough to reliably detect URLs in a plain text string, especially finding the end of it. If 8bit characters should be taken into account as well, we'll likely add more characters from the surrounding text to the URL which may leads to false detections even for correctly encoded URLs.

Thus, I'm sorry but this is strictly a sender issue and in this case you'd need to manually copy the URL and paste it to your browser's location bar. You might argue that FF supports these URLs and you're right. But unlike Roundcube, FF understands the entire string to be an URL and doesn't need to "find" it within a random text. Therefore FF can accept any string of characters. But also FF first converts it into proper URL encoded characters before it actually sends the URL to the server.

Hi Thomas,

let me disagree here. While it's sort of true that a *real* URL may only contain a limited subset of ASCII characters, there's also such thing as *visible* URLs, which should be taken into account. As an extreme example, Russia has had the .рф (Cyrillic) top-level domain [1] for quite some time now. Most, if not all, subdomains of that domain are written in Cyrillic characters. And surely, the web servers serving these domains might contain pages with Cyrillic names as well. Technically, URL's of these pages would are a mix of punycode and URL escaped entities (%xx%yy%zz...). However, from a users point of view, such low-level representation is absolutely unfriendly and looks like a bunch of random symbols. I think most of the users would favor writing URL's like these in native alphabet instead of their low-level ASCII representation.

Regarding difficulty of detection, I would dare to disagree with you as well. Since PHP 5.1, PCRE has had support for Unicode character properties, so I'm pretty sure that it must be possible to add all alphanumeric characters to your regex easily.

I certainly agree to this. And we'd very much appreciate any contribution for this, preferably in terms of a regex that detect unicode URLs or even better with a set of text cases that demonstrate the correct detection of real and false urls within plain text.

...

Regards, Rimas

[1] http://en.wikipedia.org/wiki/.%D1%80%D1%84 . Note how this looks hardly readable compared to http://en.wikipedia.org/wiki/.%D1%80%D1%84 .

A possible optimization on our side could be to decode the URL encoding (and punycode) when displaying links in message view. This however, alters the actual message content which might be undesirable.

~Thomas

Rimas Kudelis

5:51 p.m.

Hello,

2014.02.22 17:02, Thomas Bruederli wrote:

...

On Sat, Feb 22, 2014 at 3:47 PM, Rimas Kudelis rq@akl.lt wrote:

...
Regarding difficulty of detection, I would dare to disagree with you as well. Since PHP 5.1, PCRE has had support for Unicode character properties, so I'm pretty sure that it must be possible to add all alphanumeric characters to your regex easily.

I certainly agree to this. And we'd very much appreciate any contribution for this, preferably in terms of a regex that detect unicode URLs or even better with a set of text cases that demonstrate the correct detection of real and false urls within plain text.

I could take a look if you point me to the right file to edit.

...

...
[1] http://en.wikipedia.org/wiki/.%D1%80%D1%84 . Note how this looks hardly readable compared to http://en.wikipedia.org/wiki/.%D1%80%D1%84 .

A possible optimization on our side could be to decode the URL encoding (and punycode) when displaying links in message view. This however, alters the actual message content which might be undesirable.

I don't think there's need for that. Especially if the assumption was that you can just write URL's as you see them.

Rimas

Reindl Harald

4:03 p.m.

Am 22.02.2014 15:47, schrieb Rimas Kudelis:

...

[1] http://en.wikipedia.org/wiki/.%D1%80%D1%84 . Note how this looks hardly readable compared to http://en.wikipedia.org/wiki/.%D1%80%D1%84

and now look exactly what happens if you click on the second one for a short moment you see in the browser exactly the same a for the first, technically the second URL don't exist

the complete web was and is ASCII in case of domains and URLs on any lowlevel you only have punnycode and ASCII ecnodings

frankly the idea to allow special chars with technical tricks in domains was the largest mistake of the last 20 years

what people mostly do not realize is the security impact frankly i can register a punnycode domain for the user in the addressbar looking like a well known one and use that for phising attacks including a valid and accepted certificate - that is why not that long ago Firefox switched back to display Punnycode as the first attacks of this sort appeared, now it's again the dangerous way

Rimas Kudelis

8:36 p.m.

Hello Reindl,

2014.02.22 17:03, Reindl Harald wrote:

...

Am 22.02.2014 15:47, schrieb Rimas Kudelis:

...
[1] http://en.wikipedia.org/wiki/.%D1%80%D1%84 . Note how this looks hardly readable compared to http://en.wikipedia.org/wiki/.%D1%80%D1%84

and now look exactly what happens if you click on the second one for a short moment you see in the browser exactly the same a for the first, technically the second URL don't exist

the complete web was and is ASCII in case of domains and URLs on any lowlevel you only have punnycode and ASCII ecnodings

frankly the idea to allow special chars with technical tricks in domains was the largest mistake of the last 20 years

what people mostly do not realize is the security impact frankly i can register a punnycode domain for the user in the addressbar looking like a well known one and use that for phising attacks including a valid and accepted certificate - that is why not that long ago Firefox switched back to display Punnycode as the first attacks of this sort appeared, now it's again the dangerous way

of course, security is important. But it's not the only thing that matters. HTML e-mails were, and perhaps still are, considered insecure, but Roundcube supports them and takes every precaution it can to avoid these security issues. With browsers and unicode domains, the case is somewhat similar: when there is no regulation, issues you are talking about might of course arise. That's why many TLD registries have implemented strict rules on which Unicode characters are and which aren't allowed in domain names registered under particular TLD's. For example, in Lithuanian (.lt) zone, only these IDN's are allowed, which are composed of "usual" ASCII and specific Lithuanian letters, but not anything else. You cannot register a domain name containing a Cyrillic letter under .lt zone. IIRC, browsers have whitelists of such zones and they don't blindly enable punycode for all zones, but only for specific ones, which enforce such strict rules.

Rimas

A.L.E.C

23 Feb 23 Feb

1:09 p.m.

On 02/22/2014 01:35 PM, Thomas Bruederli wrote:

...

Thus, I'm sorry but this is strictly a sender issue and in this case you'd need to manually copy the URL and paste it to your browser's location bar. You might argue that FF supports these URLs and you're right. But unlike Roundcube, FF understands the entire string to be an URL and doesn't need to "find" it within a random text. Therefore FF can accept any string of characters. But also FF first converts it into proper URL encoded characters before it actually sends the URL to the server.

However, Thunderbird is able to recognize such URLs in plain text messages. So, doing the same in Roundcube is not unreasonable. Also, we already support unicode in domain names, why not in the rest of the URL?

So, the place in the code to look at is the rcube_string_replacer class. If anyone provides a patch with a set of test scripts I'll be for accepting that addition. Feel free to open a ticket.

-- Aleksander 'A.L.E.C' Machniak LAN Management System Developer [http://lms.org.pl] Roundcube Webmail Developer [http://roundcube.net] --------------------------------------------------- PGP: 19359DC1 @@ GG: 2275252 @@ WWW: http://alec.pl

4200

Age (days ago)

4206

Last active (days ago)

dev@lists.roundcube.net

10 comments

5 participants

tags (0)

participants (5)

A.L.E.C
Michael Heydekamp
Reindl Harald
Rimas Kudelis
Thomas Bruederli