View Issue Details

IDProjectCategoryView StatusLast Update
0000625Subversion for OS/2 & eCSBugpublic2020-09-03 13:47
ReporterdmikAssigned To 
PrioritynormalSeveritymajorReproducibilityalways
Status newResolutionopen 
Product Version 
Target VersionFixed in Version 
Summary0000625: Unicode support
DescriptionIt seems that both subversion 1.6 and subversion 1.7 have problems supporting the characters outside the range of the current 8-bit OS/2 character set.

For instance, there is a repository http://svn.netlabs.org/repos/java/branches/vendor/sourceforge/icedtea-web which contains a few files with the characters from the extended Latin set (the ones with diacritics) in their names. Since I have the Russian locale with the CP866 codepage, these characters are missing from my codepage and here the problems start.

I need to attach screenshots to the ticket and since Mantis doesn't let you attach more than one screenshot at a time, I will put the steps to reproduce in separate comments.
TagsNo tags attached.

Activities

dmik

2014-09-25 18:48

reporter  

svn16.png (15,206 bytes)
svn16.png (15,206 bytes)

dmik

2014-09-25 18:52

reporter   ~0002837

With subversion 1.6, checkout per se works (meaning it doesn't abort) but any further operation with the affected files (update, change & commit, etc) fails.

After doing `svn co http://svn.netlabs.org/repos/java/branches/vendor/sourceforge/icedtea-web/current@430` followed by `svn stat` I get what you see on svn16.png. I.e. for the file named `encodingTests?Š??ŽÝÁÍÉ?É??ÝÚ?ÍÓÁŠ?Ž??` I get latin letters corresponding to diacritic ones followed by a character with code 0x7F (the "house" mark on the screen shot). You may also see that svn doesn't recognize these files well after the checkout: they are marked as BOTH missing and as new (untracked).

dmik

2014-09-25 18:54

reporter   ~0002838

Apparently, Mantis itself has big problems with unicode. The `encodingTests` file should look like on the screenshot `svn_proper.png`.

dmik

2014-09-25 18:54

reporter  

svn_proper.png (35,273 bytes)
svn_proper.png (35,273 bytes)

dmik

2014-09-25 18:58

reporter  

svn17.png (31,819 bytes)
svn17.png (31,819 bytes)

dmik

2014-09-25 19:04

reporter   ~0002839

With subversion 1.7, after doing the same checkout as above, the picture is a bit different. See `svn17.png`. In particular, the names of the "missing" and "new" files don't match — in the file system I see the "new" files to be created.

I.e. svn 1.6 creates diacritics with the "house" marks in the file system (and they still look "not the same" to the ones it stores in the index when you do stat though "house" is used there as well), while svn 1.7 uses "house" marks in the index and various graphical chars in the file system.

dmik

2014-09-25 19:06

reporter   ~0002840

I see only one solution to this problem: svn should refuse to check out repositories with file names which it can't create in the local character set. Any other solution is unaccepted since it would be to dangerous and error prone. It is easy to detect a failure in character set conversion, so it shouldn't be too much work to implement this.

psmedley

2014-10-03 14:53

administrator   ~0002843

Seems this describes a similar situation on MacOS from SVN 1.5.x timeframe.. http://subversion.tigris.org/issues/show_bug.cgi?id=2464

dmik

2014-10-03 19:12

reporter   ~0002844

Last edited: 2014-10-03 19:52

View 3 revisions

JFYI, there is also a somewhat similar problem in git: http://stackoverflow.com/questions/5581857/git-and-the-umlaut-problem-on-mac-os-x.

BTW, there is another solution besides refusing to work with such repos at all: we may escape non-representable chars using printable-ASCII using one of the known algorithms (e.g. like they do when passing URLs between systems on the Internet). This would let fully support all file operations (adding, renaming, deletion) on any Unicode character within any 8-bit codepage. The only drawback is that the user will see not what it actually is for characters that are not in their code page.

For URLs they do what is called "percent-encoding": http://en.wikipedia.org/wiki/Percent-encoding. We may just use that algorithm. It may operate on UTF-8 strings (stored by the SVN/GIT server and transferred on the wire if I get it right) so there is not much to implement. Only one mangle/demangle function. We should find a right place for it though. And the existing MacOS solution may help with that.

BTW, we should not percent-encode characters that CAN be represented using the current 8-bit code page — to keep user-native characters look native. This makes the working copy non portable (you won't be able to zip it and transfer to another machine with a different code page) but this doesn't make any worse since SVN working copies are already non-portable on OS/2 due to the nature of 8-bit code pages. And in either case non-portability is not a big issue at all. Nobody should normally do that.

Issue History

Date Modified Username Field Change
2014-09-25 18:48 dmik New Issue
2014-09-25 18:48 dmik File Added: svn16.png
2014-09-25 18:52 dmik Note Added: 0002837
2014-09-25 18:54 dmik Note Added: 0002838
2014-09-25 18:54 dmik File Added: svn_proper.png
2014-09-25 18:58 dmik File Added: svn17.png
2014-09-25 19:04 dmik Note Added: 0002839
2014-09-25 19:06 dmik Note Added: 0002840
2014-10-03 14:53 psmedley Note Added: 0002843
2014-10-03 19:12 dmik Note Added: 0002844
2014-10-03 19:49 dmik Note Edited: 0002844 View Revisions
2014-10-03 19:52 dmik Note Edited: 0002844 View Revisions