From: davis@pacific.mps.ohio-state.edu ("John E. Davis") Subject: 8 bit clean implies what? Reply-To: davis@pacific.mps.ohio-state.edu (John E. Davis) Date: Sat, 6 Feb 1993 18:22:29 GMT Lines: 46 Hi, I have a few questions regarding the meaning of 8 bit clean editors. As I understand it, an editor which is 8 bit clean can display ALL 256 characters on the output device. That is, the character should not be mapped to a displayable representation (i.e., ascii char 1 to two character sequence ^A). So for example, if character 235 corresponds to the greek letter alpha on the output device, an alpha should appear when char 235 is sent. In addition, the editor should be able to take ANY 8 bit character form the input device and display it. That is, if the input device is capable of sending the char 235 (alpha), then the char should be deisplayed as above. Is this correct? On my PC, the char 255 do not display anything on the screen (just a space). 255 is also -1 when converted to signed char and usually denoted end of file or something special like that. Is it just a coincidence that 255 displays nothing on my PC or is this a general feature? Should I make any assumptions regarding 255? I would like to reserve it for my own purposes. Finally, are characters with the hi bit set (>= 128) ever involved in keymaps? This might seem like a silly question but for my purposes, it is the most important question. I tend to think of keymaps as involving only 7 bit chars, e.g., escape map. But is any known case of a keymap where the prefix character has the high bit set? In case you are wondering, I am working on an editor (JED). Recently, I released version 0.80 which I thought to be 8 bit clean, but in retrospect, it is not. I hear people say ``Just treat ALL characters the same!''. However, I am concerned with memory usage on PCs and I would like to cut corners wherever I can. Berfore I release the next version (0.81), I want to make SURE that I get the 8 bit thing correct. I appreciate any comments on the subject. Thank You. -- _____________ #___/John E. Davis\_________________________________________________________ # # internet: davis@amy.tch.harvard.edu # bitnet: davis@ohstpy # office: 617-735-6746 # From: michael@chpc.utexas.edu (Michael Lemke) Subject: Re: 8 bit clean implies what? Organization: The University of Texas System - CHPC Date: Sat, 6 Feb 93 20:38:11 GMT Lines: 38 In article davis@pacific.mps.ohio-state.edu (John E. Davis) writes: >Hi, > >I have a few questions regarding the meaning of 8 bit clean editors. > > >As I understand it, an editor which is 8 bit clean can display ALL 256 >characters on the output device. >That is, the character should not be mapped >to a displayable representation (i.e., ascii char 1 to two character sequence >^A). I don't think this is correct but I am not an expert on this. Check out what ISO-Latin-1 means. There are quite a lot of control sequences >128. e.g., CSI which is ESC [? in 7bit. Any terminal sending or accepting 8bit controls will use them. Secondly, an 8bit clean editor needs to know what are corresponding uppercase and lower case characters, e.g. åis lower case of Å. >Finally, are characters with the hi bit set (>= 128) ever involved in keymaps? Yes, see my comment above. >This might seem like a silly question but for my purposes, it is the most >important question. I tend to think of keymaps as involving only 7 bit chars, >e.g., escape map. But is any known case of a keymap where the prefix character >has the high bit set? I do think but haven't tried that my vt220 will send CSI something or so if I tell it to use 8bit control chars, which I haven't. You might also look at the C LC_CTYPE stuff or how that is called, don't have my C book handy. There is support for 8bit char sets. Michael -- Michael Lemke Astronomy, UT Austin, Texas (michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN]) From: davis@pacific.mps.ohio-state.edu ("John E. Davis") Subject: Re: 8 bit clean implies what? Reply-To: davis@pacific.mps.ohio-state.edu (John E. Davis) Organization: "Dept. of Physics, The Ohio State University" Date: Sat, 6 Feb 1993 21:16:29 GMT Lines: 19 In article <1993Feb6.203811.24134@chpc.utexas.edu> michael@chpc.utexas.edu (Michael Lemke) writes: ...accepting 8bit controls will use them. Secondly, an 8bit clean editor needs to know what are corresponding uppercase and lower case characters, e.g. åis lower case of Å. This is an excellent point that I have not thought of. The natural solution is through the use of a lookup table. But, in general, this requires TWO tables: uppercase and lowercase. However, a single CHANGE_CASE table is sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does anyone know if this assumption is valid? -- _____________ #___/John E. Davis\_________________________________________________________ # # internet: davis@amy.tch.harvard.edu # bitnet: davis@ohstpy # office: 617-735-6746 # From: michael@chpc.utexas.edu (Michael Lemke) Subject: Re: 8 bit clean implies what? Organization: The University of Texas System - CHPC Date: Sat, 6 Feb 93 22:49:10 GMT Lines: 31 In article davis@pacific.mps.ohio-state.edu (John E. Davis) writes: >In article <1993Feb6.203811.24134@chpc.utexas.edu> michael@chpc.utexas.edu >(Michael Lemke) writes: > ...accepting 8bit controls will use them. Secondly, an 8bit clean editor > needs to know what are corresponding uppercase and lower case > characters, e.g. å is lower case of Å. > >This is an excellent point that I have not thought of. The natural solution >is through the use of a lookup table. But, in general, this requires TWO >tables: uppercase and lowercase. However, a single CHANGE_CASE table is >sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does >anyone know if this assumption is valid? Yes, this is true for at least ISO-Latin-1 and DEC Multinational Character Set (almost identical). The high order part of these char sets are pretty much an image of the low order part except the 8th bit is 1. You should really have a look at the ISO-Latin-1 character tables, for example in the appendix of a terminal that has 8bit chars (e.g., vt220 and higher, GraphOn225 and higher). Also think about sort sequences. An ANSI C implementation must provide functions like isupper(int C) which are controled by the current locale which in turn is controled by the setlocale function. I haven't done anything with it but this is exactly the kind of problem the stuff was invented for. The world isn't ASCII anymore. Michael -- Michael Lemke Astronomy, UT Austin, Texas (michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN]) From: scott@inferno.Kodak.COM (Kevin Scott) Subject: Re: 8 bit clean implies what? Organization: Eastman Kodak Company Date: Sun, 7 Feb 93 18:30:16 GMT Lines: 19 For what it's worth, here is my OPINION on what 8-bit clean means: 1) you can use an 8-bit-clean text editor to edit non-text files (such as .EXE files or .COM files or binary data files). This would be of occasional use to hack in changes in any embedded text in the file you are editing. I have been able to use the Turbo C editor (ver 1.0) to do this type of thing (or perhaps it was a Turbo Pascal editor; I forget; the timeframe was 1987 or so). Of course, if you are editing a file that is not intended to be text, the editor must not have any restriction on line length or requirement that non-empty files end with a newline (sequence). 2) it is perfectly OK to represent non-printable characters as a multicharacter sequence (such as ^A for ASCII code 1). What is "printable" vs. "non-printable" is determined by the environment. 3) it is possible to enter any 8-bit character from the keyboard on any IBM-PC compatible system. Just hold down Alt while typing the desired character code on the numeric keypad. From: jhallen@world.std.com (Joseph H Allen) Subject: Re: 8 bit clean implies what? Organization: The World Public Access UNIX, Brookline, MA Date: Sun, 7 Feb 1993 21:25:10 GMT Lines: 73 In article davis@pacific.mps.ohio-state.edu (John E. Davis) writes: >Hi, >I have a few questions regarding the meaning of 8 bit clean editors. >As I understand it, an editor which is 8 bit clean can display ALL 256 >characters on the output device. That is, the character should not be mapped >to a displayable representation (i.e., ascii char 1 to two character sequence >^A). So for example, if character 235 corresponds to the greek letter alpha >on the output device, an alpha should appear when char 235 is sent. In >addition, the editor should be able to take ANY 8 bit character form the input >device and display it. That is, if the input device is capable of sending the >char 235 (alpha), then the char should be deisplayed as above. Is this >correct? Yes. But here's another fly in the ointment: You shouldn't be so eurocentric... there are apparently versions of vt220s which display two successive characters as a single chinese or japanese character. So you need to make a mode where all deletes operate on two characters... (I would appreciate it if someone would elaborate on this more.. I still need to add this mode to my editor JOE. I don't understand yet if the character set is broken into half-charatcers which fit together or if the first character is really a prefix character). >On my PC, the char 255 do not display anything on the screen (just a space). >255 is also -1 when converted to signed char and usually denoted end of file >or something special like that. Is it just a coincidence that 255 displays >nothing on my PC or is this a general feature? Should I make any assumptions >regarding 255? I would like to reserve it for my own purposes. Nope, can't do that. Some international character sets use 255. Originally I tried to make all of the 'chars' in JOE 'unsigned chars' so that when a character was returned in an 'int' the range was 0 - 255 instead of -128 - 127. That way -1 could still be an error return. The only problem is that stupid ANSI compilers give bazillions of warnings (it's bad enough they give warning for 'char *' being mixed with 'const char *', but char/unsigned-char warnings are rediculous. I hate ANSI-C. I wish it would go away. (stupid catering to the IBM PC..)). Anyway, I now use MAXINT (defined as 2^31-1 or 32767) for error returns and have characters in the range of -128 to 127. You still need to cast them to unsigned sometimes (for table lookup), but not very often. I've decided that 'unsigned' as a C keyword is close to useless because of the compatibility problems, so I now avoid it as much as possible. >Finally, are characters with the hi bit set (>= 128) ever involved in keymaps? >This might seem like a silly question but for my purposes, it is the most >important question. I tend to think of keymaps as involving only 7 bit chars, >e.g., escape map. But is any known case of a keymap where the prefix character >has the high bit set? True gnu-emacs keyboards are supposed to have a Meta- key, which sets the high bit. In Linux, there is a mode which makes the ALT- key the Meta- key... >In case you are wondering, I am working on an editor (JED). Recently, I >released version 0.80 which I thought to be 8 bit clean, but in retrospect, it >is not. I hear people say ``Just treat ALL characters the same!''. However, I >am concerned with memory usage on PCs and I would like to cut corners wherever >I can. :-) Software virtual memory... > Berfore I release the next version (0.81), I want to make SURE that I >get the 8 bit thing correct. JED is neat. The extension language looks like reverse-polish LISP, but without parenthasis. -- /* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);} From: michael@chpc.utexas.edu (Michael Lemke) Subject: Re: 8 bit clean implies what? Organization: The University of Texas System - CHPC Date: Sun, 7 Feb 93 21:39:01 GMT Lines: 36 In article <1993Feb7.183016.23290@kodak.kodak.com> scott@inferno.Kodak.COM (Kevin Scott) writes: >For what it's worth, here is my OPINION on what 8-bit clean means: > >1) you can use an 8-bit-clean text editor to edit non-text files > (such as .EXE files or .COM files or binary data files). This > would be of occasional use to hack in changes in any embedded text > in the file you are editing. I have been able to use the Turbo C > editor (ver 1.0) to do this type of thing (or perhaps it was a > Turbo Pascal editor; I forget; the timeframe was 1987 or so). > Of course, if you are editing a file that is not intended to be > text, the editor must not have any restriction on line length or > requirement that non-empty files end with a newline (sequence). > >2) it is perfectly OK to represent non-printable characters as a > multicharacter sequence (such as ^A for ASCII code 1). What is > "printable" vs. "non-printable" is determined by the environment. > >3) it is possible to enter any 8-bit character from the keyboard on > any IBM-PC compatible system. Just hold down Alt while typing the > desired character code on the numeric keypad. This I would call a *binary* editor. A 8-bit clean text editor allows me to enter any *printable* character from my keyboard, which is done with the compose key in my current set up. Don't restrict your views to PCs. As I said in an other post in this thread, it also means the editor knows how to capitalize åNgSTrÖm as Ångström. The thing I am using right now does not allow me to do either of these but I can enter the characters numerically in a similar fashion as you describe above. Not really 8bit clean but quite a pain. Michael -- Michael Lemke Astronomy, UT Austin, Texas (michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN]) From: upham@cs.ubc.ca (Derek Upham) Subject: Re: 8 bit clean implies what? Date: 7 Feb 1993 18:32:33 -0800 Organization: Raven's Auto Body Repair Shop, Mega-Tokyo. Lines: 24 jhallen@world.std.com (Joseph H Allen) writes: >Yes. But here's another fly in the ointment: You shouldn't be so >eurocentric... there are apparently versions of vt220s which display two >successive characters as a single chinese or japanese character. So you >need to make a mode where all deletes operate on two characters... Actually, it gets worse than that. The GB and Big5 character sets used in Taiwan have FOUR-byte characters. In general, an application looks at the high-order bit of the byte "n". If it is zero, the byte is interpreted as 7-bit ASCII. Otherwise it is interpreted as the first byte of some character in the alternate set. What's more, there are various ways of interpreting high-bits in successive bytes to switch between character sets and save space (the specifics now escape me, unfortunately). In general, if you want to be safe, do everything on four byte characters internally, and then add conversion interfaces to work with whatever character set is needed. Derek -- Derek Lynn Upham University of British Columbia upham@cs.ubc.ca Computer Science Department ============================================================================= "Ha! Your Leaping Tiger Kung Fu is no match for my Frightened Piglet Style!" From: ketil@edb.tih.no (Ketil Albertsen,TIH) Subject: Re: 8 bit clean implies what? Organization: T I H / T I S I P Date: Mon, 8 Feb 1993 09:07:53 GMT Lines: 48 In article , davis@pacific.mps.ohio-state.edu ("John E. Davis") writes: >As I understand it, an editor which is 8 bit clean can display ALL 256 >characters on the output device. If you go by international standards (frequently called ANSI standards by the US community... :->): No. The 190 (+space) characters. There just aren't 256 character (code)s. The 64 codes from 0 to 31, and 127 (DEL) to 160 are NOT character codes but control codes. The correct handling is to *process* them rather than to display them. The processing may have an effect on the display, eg. CR and LF, both changing the active position, or ESC-sequences switching to a different character set (among other things), but the codes are not, per se, "displayed". The "processing" may be limited to simply storing (conserving) them because the display or software does not support the defined function for the control code. Wrt. input: There should be no restriction on how you enter the control functions. CR (13) has its own key (did you ever notice that uppercase letters are entered by key combinations?), but there is nothing wrong by entering ESC [ 1 2 m ("Second alternative font") as a menu choice rather than as five separate hex values. But obviously this assumes that you plan to honor ISO character code definitions. So, many people would say that it is not "clean". But as another poster commented, there is a distinction between a binary editor and an 8-bit clean editor. If you want to be able to edit arbitrary character sets, with arbitrary use of the control codes (CR/LF relocated to other code positions...), then you need a binary editor. IMHO it is sufficient for an editor anno 1993 to support ISO character sets - preferably all of them. >Is it just a coincidence that 255 displays >nothing on my PC or is this a general feature? Should I make any assumptions >regarding 255? I would like to reserve it for my own purposes. In 8859/1, 255 is umlaut y. In 8859/2, /3 and /4 it is dot above. Several character sets do not use 160 and 255 because it would prohibit representation in a 7-bit environment; ISO 2022 distinguishes between 94 and 96 character C1 sets. Before you run out to buy the entire collection of ISO standards for character sets and control functions: If you were to implement all of it, you'd have enough to do for the rest of your life... Writing a binary editor may be simpler. From: wolff@inf.fu-berlin.de (Thomas Wolff) Subject: Re: 8 bit clean implies what? Organization: Free University of Berlin, Germany Date: Mon, 8 Feb 1993 18:15:16 GMT Lines: 10 scott@inferno.Kodak.COM (Kevin Scott) writes: >For what it's worth, here is my OPINION on what 8-bit clean means: >3) it is possible to enter any 8-bit character from the keyboard on > any IBM-PC compatible system. Just hold down Alt while typing the > desired character code on the numeric keypad. Due to some ridiculous mis-function in Microsoft's standard keyboard driver, one of the codes can not be entered that way. (I seem to remember it was 156 or so.) From: guy@Auspex.COM (Guy Harris) Subject: Re: 8 bit clean implies what? Date: 8 Feb 93 19:07:55 GMT Organization: Auspex Systems, Santa Clara Lines: 15 >>However, a single CHANGE_CASE table is >>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does >>anyone know if this assumption is valid? > >Yes, this is true for at least ISO-Latin-1 and DEC Multinational >Character Set (almost identical). I.e., the upper-case version of the German "sz" character has a loweer code than the lower-case version? Warning: that is a trick question, at least as I understand the German case conventions. It may be unwise to assume that translating a string from lower-case to upper-case can be done simply by replacing each lower-case letter in the string with a character that's the upper-case version of that letter. From: rnelson@wsuaix.csc.wsu.edu (roger nelson;S23487) Subject: Re: 8 bit clean implies what? Organization: Washington State University Date: Mon, 8 Feb 93 20:22:05 GMT Lines: 34 In article <1993Feb6.224910.3822@chpc.utexas.edu> michael@chpc.utexas.edu (Michael Lemke) writes: >In article davis@pacific.mps.ohio-state.edu (John E. Davis) writes: >>In article <1993Feb6.203811.24134@chpc.utexas.edu> michael@chpc.utexas.edu >>(Michael Lemke) writes: >> ...accepting 8bit controls will use them. Secondly, an 8bit clean editor >> needs to know what are corresponding uppercase and lower case >> characters, e.g. å is lower case of Å. >> >>This is an excellent point that I have not thought of. The natural solution >>is through the use of a lookup table. But, in general, this requires TWO >>tables: uppercase and lowercase. However, a single CHANGE_CASE table is >>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does >>anyone know if this assumption is valid? > > >Yes, this is true for at least ISO-Latin-1 and DEC Multinational >Character Set (almost identical). The high order part of these char >sets are pretty much an image of the low order part except the 8th bit >is 1. You should really have a look at the ISO-Latin-1 character >tables, for example in the appendix of a terminal that has 8bit chars >(e.g., vt220 and higher, GraphOn225 and higher). Also think about sort >sequences. An ANSI C implementation must provide functions like >isupper(int C) which are controled by the current locale which in turn >is controled by the setlocale function. I haven't done anything with it >but this is exactly the kind of problem the stuff was invented for. >The world isn't ASCII anymore. > >Michael >-- >Michael Lemke >Astronomy, UT Austin, Texas >(michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN]) From: goer@ellis.uchicago.edu (Richard L. Goerwitz) Subject: Wide Characters (was Re: 8 bit clean implies what?) Date: 9 Feb 93 01:43:10 GMT Organization: University of Chicago Lines: 18 allen@world.std.com (Joseph H Allen) writes: >>As I understand it, an editor which is 8 bit clean can display ALL 256 >>characters on the output device. >Yes. But here's another fly in the ointment: You shouldn't be so >eurocentric... there are apparently versions of vt220s which display two >successive characters as a single chinese or japanese character.... The idea is to keep all character-based code potentially indifferent to char- acter size. Soon we hope that the internationalization/localization issue will be solved, to some extent, by ISO 10646, which specifies, as I recall, 32-bit wide characters. Somebody correct me if I'm wrong. -- -Richard L. Goerwitz goer%midway@uchicago.bitnet goer@midway.uchicago.edu rutgers!oddjob!ellis!goer From: michael@chpc.utexas.edu (Michael Lemke) Subject: Re: 8 bit clean implies what? Organization: The University of Texas System - CHPC Date: Tue, 9 Feb 93 03:28:15 GMT Lines: 28 In article <16849@auspex-gw.auspex.com> guy@Auspex.COM (Guy Harris) writes: >>>However, a single CHANGE_CASE table is >>>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does >>>anyone know if this assumption is valid? >> >>Yes, this is true for at least ISO-Latin-1 and DEC Multinational >>Character Set (almost identical). > >I.e., the upper-case version of the German "sz" character has a loweer >code than the lower-case version? Well, not really. But this is indeed tricky as the reverse, lowercasing SS, is not unique. `MASSE' can be `Masse' or `Maße', depending on context. As a native German I'd let these cases alone. > >Warning: that is a trick question, at least as I understand the German >case conventions. It may be unwise to assume that translating a string >from lower-case to upper-case can be done simply by replacing each >lower-case letter in the string with a character that's the upper-case >version of that letter. Michael -- Michael Lemke Astronomy, UT Austin, Texas (michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN]) From: rnelson@wsuaix.csc.wsu.edu (roger nelson;S23487) Subject: Re: 8 bit clean implies what? Sender: news@serval.net.wsu.edu (USENET News System) Organization: Washington State University Date: Tue, 9 Feb 93 07:59:20 GMT Lines: 34 >>This is an excellent point that I have not thought of. The natural solution >>is through the use of a lookup table. But, in general, this requires TWO >>tables: uppercase and lowercase. However, a single CHANGE_CASE table is >>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does >>anyone know if this assumption is valid? One will notice that (with the exception of codes 32-63) the lower order nyble of the ASCII code for the uppercase character is the same as the respective lowercase character (and also the respective control character). The most significant bit of the upper order nyble is an encoding of the shift keys used: codes 0000 - 0001 Denote a ctrl character Ie Ctrl-A = 0000 0001 codes 0010 - 0011 don't follow the general encoding scheme ^ codes 0100 - 0101 Denote a shifted char. Ie A = 0100 0001 ^ codes 1000 - 1111 Denote an unshifted char Ie a = 0110 0001 ^ (Note that there are a few exceptions to the shift key encoding: 64,94,95,96,126 and 127.) >Yes, this is true for at least ISO-Latin-1 and DEC Multinational >Character Set (almost identical). The high order part of these char >sets are pretty much an image of the low order part except the 8th bit >is 1. Is the shift key encoding of the characters in the 8-bit character set preserved? Roger From: ketil@edb.tih.no (Ketil Albertsen,TIH) Subject: Re: 8 bit clean implies what? Sender: ketil@edb.tih.no (Ketil Albertsen,TIH) Organization: T I H / T I S I P Date: Tue, 9 Feb 1993 15:37:50 GMT Lines: 17 In article <1993Feb9.075920.20683@serval.net.wsu.edu>, rnelson@wsuaix.csc.wsu.edu (roger nelson;S23487) writes: >One will notice that (with the exception of codes 32-63) the lower order >nyble of the ASCII code for the uppercase character is the same as the >respective lowercase character (and also the respective control character). Basing a case conversion on this is not a good solution. Eg. 8859/2 (suiting a number of East European languages) follows this pattern with a distance of 16 for the codes A9 to AF, but a distance of 32 for C0 to DE. True, the low nibble is the same, but it doesn't help you that much. IS 6937 also has a distance of 16 for most upper-half codes, but with exceptions. And there will always be a number of special cases, such as the German double-s. So, a translation table gets you a lot further. If you extend the table with some trapping mechanism for special cases, you could get it good enough for "any" use. From: ant@mks.com (Anthony Howe) Subject: Re: 8 bit clean implies what? Organization: Mortice Kern Systems Inc., Waterloo, Ontario, CANADA Date: Tue, 9 Feb 1993 14:21:34 GMT Lines: 52 To my knowledge, 8-bit clean means that you must make no assumptions about any characters in the character set other than what the ctype macro/functions tell you. (See ANSI C section 4.3 Character Handling.) "The header declares several functions useful for testing and mapping characters. In all cases the argument is an int, the value of which shall be representable as an assigned char or shall equal the value of the macro EOF. If the argument has any other value, the behaviour is undefined." int isalnum(int c); int iscntrl(int c); int isupper(int c); ... P.J. Plauger has a column in "The C User Journal". In one issue (which I can't remember) he discuss and issues concerning 8-bit clean. I recommend doing a search back over that last two years for it. >From my understanding of the quote above, the ctype table must be at least 256 bytes. You must be careful of sign-extension with char pointers, like { char *p = "Hi\375\376\377 there"; ... if (isalpha(p[2])) { ... } else if (iscntrl(p[4])) { ... } } If your compiler defaults to chars being signed, the results of the ctype ctype table look up will be undefined, since p[2] will be sign-extended to -3 and p[4] will be sign-extended to -1 and so fall off the bottom of the table. Also EOF, typically -1, does NOT equal 255. Remember that the argument is an int, so EOF is really going to be (int) -1 while 255 will be (unsigned char) 255, which are not the same. You should come up with a mapping function something like unctrl() that will represent control characters (non-printables) in a sensible manner. Allow the mapping to be altered/configured from system to system. You also have to be careful about 9-bit char. There are still systems out there that have 9-bit bytes, which would mean a ctype table of 512 bytes. Plauger's article covers all these issues very well. -ant -- ant@mks.com Anthony C Howe Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9 "Nice legs. For a human that is." - Worf (Q-pid) From: jschief@finbol.toppoint.de (Joerg Schlaeger) Subject: Re: 8 bit clean implies what? Date: Tue, 09 Feb 93 17:24:01 GMT Lines: 35 upham@cs.ubc.ca writes in article <1l4go1INNq8v@cascade.cs.ubc.ca>: > .................. > switch between character sets and save space (the specifics now escape > me, unfortunately). In general, if you want to be safe, do everything > on four byte characters internally, and then add conversion interfaces > to work with whatever character set is needed. > > Derek > > -- > Derek Lynn Upham University of British Columbia > upham@cs.ubc.ca Computer Science Department > ============================================================================= > "Ha! Your Leaping Tiger Kung Fu is no match for my Frightened Piglet Style!" > Hi, and please don't forget the big- & small endian problem for more than one byte per character, because you can't besure that your stdin is a keyboard and that the byteorder is the allways the same. I've the problem with named pipe's filled with messages from Intel & Motorala Workstations and 16-Bit charsets. Is there anyone who knows the solution for every character set,1 & 2 & 4 Byte. Perhabs a sign like "\n" thats all the same, to detect the need for byteswapping. Joerg -- +++++++++++++++++++++++++++++++++++ Joerg Schlaeger Home: +49 431 682210 (voice & fax & ...) jschief@finbol.toppoint.de ----------------------------------- (to be faster with the /2) +++++++++++++++++++++++++++++++++++ From: jimc@tau-ceti.isc-br.com (Jim Cathey) Newsgroups: comp.editors Subject: Re: 8 bit clean implies what? Date: 10 Feb 93 23:45:35 GMT Organization: Olivetti North America, Spokane, WA Lines: 20 In article <729278641snx@finbol.toppoint.de> jschief@finbol.toppoint.de (Joerg Schlaeger) writes: >and please don't forget the big- & small endian problem for more than one >byte per character, because you can't besure that your stdin is a keyboard >and that the byteorder is the allways the same. Unicode has a magic cookie that's a NOP whose byte-reversed form is also a (different) NOP. It may be embedded in any string (presumably at the front) as a byte-sex tag if you need such things. -- +----------------+ ! II CCCCCC ! Jim Cathey ! II SSSSCC ! ISC-Bunker Ramo ! II CC ! TAF-C8; Spokane, WA 99220 ! IISSSS CC ! UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com) ! II CCCCCC ! (509) 927-5757 +----------------+ One Design to rule them all; one Design to find them. One Design to bring them all and in the darkness bind them. In the land of Mediocrity where the PC's lie.