753 lines
31 KiB
Plaintext
753 lines
31 KiB
Plaintext
|
From: davis@pacific.mps.ohio-state.edu ("John E. Davis")
|
|||
|
Subject: 8 bit clean implies what?
|
|||
|
Reply-To: davis@pacific.mps.ohio-state.edu (John E. Davis)
|
|||
|
Date: Sat, 6 Feb 1993 18:22:29 GMT
|
|||
|
Lines: 46
|
|||
|
|
|||
|
Hi,
|
|||
|
|
|||
|
I have a few questions regarding the meaning of 8 bit clean editors.
|
|||
|
|
|||
|
|
|||
|
As I understand it, an editor which is 8 bit clean can display ALL 256
|
|||
|
characters on the output device. That is, the character should not be mapped
|
|||
|
to a displayable representation (i.e., ascii char 1 to two character sequence
|
|||
|
^A). So for example, if character 235 corresponds to the greek letter alpha
|
|||
|
on the output device, an alpha should appear when char 235 is sent. In
|
|||
|
addition, the editor should be able to take ANY 8 bit character form the input
|
|||
|
device and display it. That is, if the input device is capable of sending the
|
|||
|
char 235 (alpha), then the char should be deisplayed as above. Is this
|
|||
|
correct?
|
|||
|
|
|||
|
On my PC, the char 255 do not display anything on the screen (just a space).
|
|||
|
255 is also -1 when converted to signed char and usually denoted end of file
|
|||
|
or something special like that. Is it just a coincidence that 255 displays
|
|||
|
nothing on my PC or is this a general feature? Should I make any assumptions
|
|||
|
regarding 255? I would like to reserve it for my own purposes.
|
|||
|
|
|||
|
Finally, are characters with the hi bit set (>= 128) ever involved in keymaps?
|
|||
|
This might seem like a silly question but for my purposes, it is the most
|
|||
|
important question. I tend to think of keymaps as involving only 7 bit chars,
|
|||
|
e.g., escape map. But is any known case of a keymap where the prefix character
|
|||
|
has the high bit set?
|
|||
|
|
|||
|
In case you are wondering, I am working on an editor (JED). Recently, I
|
|||
|
released version 0.80 which I thought to be 8 bit clean, but in retrospect, it
|
|||
|
is not. I hear people say ``Just treat ALL characters the same!''. However, I
|
|||
|
am concerned with memory usage on PCs and I would like to cut corners wherever
|
|||
|
I can. Berfore I release the next version (0.81), I want to make SURE that I
|
|||
|
get the 8 bit thing correct.
|
|||
|
|
|||
|
I appreciate any comments on the subject. Thank You.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
--
|
|||
|
_____________
|
|||
|
#___/John E. Davis\_________________________________________________________
|
|||
|
#
|
|||
|
# internet: davis@amy.tch.harvard.edu
|
|||
|
# bitnet: davis@ohstpy
|
|||
|
# office: 617-735-6746
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
From: michael@chpc.utexas.edu (Michael Lemke)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: The University of Texas System - CHPC
|
|||
|
Date: Sat, 6 Feb 93 20:38:11 GMT
|
|||
|
Lines: 38
|
|||
|
|
|||
|
In article <DAVIS.93Feb6132229@pacific.mps.ohio-state.edu> davis@pacific.mps.ohio-state.edu (John E. Davis) writes:
|
|||
|
>Hi,
|
|||
|
>
|
|||
|
>I have a few questions regarding the meaning of 8 bit clean editors.
|
|||
|
>
|
|||
|
>
|
|||
|
>As I understand it, an editor which is 8 bit clean can display ALL 256
|
|||
|
>characters on the output device.
|
|||
|
>That is, the character should not be mapped
|
|||
|
>to a displayable representation (i.e., ascii char 1 to two character sequence
|
|||
|
>^A).
|
|||
|
|
|||
|
I don't think this is correct but I am not an expert on this. Check out
|
|||
|
what ISO-Latin-1 means. There are quite a lot of control sequences
|
|||
|
>128. e.g., CSI which is ESC [? in 7bit. Any terminal sending or
|
|||
|
accepting 8bit controls will use them. Secondly, an 8bit clean editor
|
|||
|
needs to know what are corresponding uppercase and lower case
|
|||
|
characters, e.g. <20>is lower case of <20>.
|
|||
|
|
|||
|
>Finally, are characters with the hi bit set (>= 128) ever involved in keymaps?
|
|||
|
|
|||
|
Yes, see my comment above.
|
|||
|
|
|||
|
>This might seem like a silly question but for my purposes, it is the most
|
|||
|
>important question. I tend to think of keymaps as involving only 7 bit chars,
|
|||
|
>e.g., escape map. But is any known case of a keymap where the prefix character
|
|||
|
>has the high bit set?
|
|||
|
|
|||
|
I do think but haven't tried that my vt220 will send CSI something or so
|
|||
|
if I tell it to use 8bit control chars, which I haven't. You might also
|
|||
|
look at the C LC_CTYPE stuff or how that is called, don't have my C book
|
|||
|
handy. There is support for 8bit char sets.
|
|||
|
|
|||
|
Michael
|
|||
|
--
|
|||
|
Michael Lemke
|
|||
|
Astronomy, UT Austin, Texas
|
|||
|
(michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN])
|
|||
|
|
|||
|
|
|||
|
From: davis@pacific.mps.ohio-state.edu ("John E. Davis")
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Reply-To: davis@pacific.mps.ohio-state.edu (John E. Davis)
|
|||
|
Organization: "Dept. of Physics, The Ohio State University"
|
|||
|
Date: Sat, 6 Feb 1993 21:16:29 GMT
|
|||
|
Lines: 19
|
|||
|
|
|||
|
In article <1993Feb6.203811.24134@chpc.utexas.edu> michael@chpc.utexas.edu
|
|||
|
(Michael Lemke) writes:
|
|||
|
...accepting 8bit controls will use them. Secondly, an 8bit clean editor
|
|||
|
needs to know what are corresponding uppercase and lower case
|
|||
|
characters, e.g. <20>is lower case of <20>.
|
|||
|
|
|||
|
This is an excellent point that I have not thought of. The natural solution
|
|||
|
is through the use of a lookup table. But, in general, this requires TWO
|
|||
|
tables: uppercase and lowercase. However, a single CHANGE_CASE table is
|
|||
|
sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does
|
|||
|
anyone know if this assumption is valid?
|
|||
|
--
|
|||
|
_____________
|
|||
|
#___/John E. Davis\_________________________________________________________
|
|||
|
#
|
|||
|
# internet: davis@amy.tch.harvard.edu
|
|||
|
# bitnet: davis@ohstpy
|
|||
|
# office: 617-735-6746
|
|||
|
#
|
|||
|
|
|||
|
|
|||
|
From: michael@chpc.utexas.edu (Michael Lemke)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: The University of Texas System - CHPC
|
|||
|
Date: Sat, 6 Feb 93 22:49:10 GMT
|
|||
|
Lines: 31
|
|||
|
|
|||
|
In article <DAVIS.93Feb6161629@pacific.mps.ohio-state.edu> davis@pacific.mps.ohio-state.edu (John E. Davis) writes:
|
|||
|
>In article <1993Feb6.203811.24134@chpc.utexas.edu> michael@chpc.utexas.edu
|
|||
|
>(Michael Lemke) writes:
|
|||
|
> ...accepting 8bit controls will use them. Secondly, an 8bit clean editor
|
|||
|
> needs to know what are corresponding uppercase and lower case
|
|||
|
> characters, e.g. <20> is lower case of <20>.
|
|||
|
>
|
|||
|
>This is an excellent point that I have not thought of. The natural solution
|
|||
|
>is through the use of a lookup table. But, in general, this requires TWO
|
|||
|
>tables: uppercase and lowercase. However, a single CHANGE_CASE table is
|
|||
|
>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does
|
|||
|
>anyone know if this assumption is valid?
|
|||
|
|
|||
|
|
|||
|
Yes, this is true for at least ISO-Latin-1 and DEC Multinational
|
|||
|
Character Set (almost identical). The high order part of these char
|
|||
|
sets are pretty much an image of the low order part except the 8th bit
|
|||
|
is 1. You should really have a look at the ISO-Latin-1 character
|
|||
|
tables, for example in the appendix of a terminal that has 8bit chars
|
|||
|
(e.g., vt220 and higher, GraphOn225 and higher). Also think about sort
|
|||
|
sequences. An ANSI C implementation must provide functions like
|
|||
|
isupper(int C) which are controled by the current locale which in turn
|
|||
|
is controled by the setlocale function. I haven't done anything with it
|
|||
|
but this is exactly the kind of problem the stuff was invented for.
|
|||
|
The world isn't ASCII anymore.
|
|||
|
|
|||
|
Michael
|
|||
|
--
|
|||
|
Michael Lemke
|
|||
|
Astronomy, UT Austin, Texas
|
|||
|
(michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN])
|
|||
|
|
|||
|
|
|||
|
From: scott@inferno.Kodak.COM (Kevin Scott)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: Eastman Kodak Company
|
|||
|
Date: Sun, 7 Feb 93 18:30:16 GMT
|
|||
|
Lines: 19
|
|||
|
|
|||
|
For what it's worth, here is my OPINION on what 8-bit clean means:
|
|||
|
|
|||
|
1) you can use an 8-bit-clean text editor to edit non-text files
|
|||
|
(such as .EXE files or .COM files or binary data files). This
|
|||
|
would be of occasional use to hack in changes in any embedded text
|
|||
|
in the file you are editing. I have been able to use the Turbo C
|
|||
|
editor (ver 1.0) to do this type of thing (or perhaps it was a
|
|||
|
Turbo Pascal editor; I forget; the timeframe was 1987 or so).
|
|||
|
Of course, if you are editing a file that is not intended to be
|
|||
|
text, the editor must not have any restriction on line length or
|
|||
|
requirement that non-empty files end with a newline (sequence).
|
|||
|
|
|||
|
2) it is perfectly OK to represent non-printable characters as a
|
|||
|
multicharacter sequence (such as ^A for ASCII code 1). What is
|
|||
|
"printable" vs. "non-printable" is determined by the environment.
|
|||
|
|
|||
|
3) it is possible to enter any 8-bit character from the keyboard on
|
|||
|
any IBM-PC compatible system. Just hold down Alt while typing the
|
|||
|
desired character code on the numeric keypad.
|
|||
|
|
|||
|
|
|||
|
From: jhallen@world.std.com (Joseph H Allen)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: The World Public Access UNIX, Brookline, MA
|
|||
|
Date: Sun, 7 Feb 1993 21:25:10 GMT
|
|||
|
Lines: 73
|
|||
|
|
|||
|
In article <DAVIS.93Feb6132229@pacific.mps.ohio-state.edu> davis@pacific.mps.ohio-state.edu (John E. Davis) writes:
|
|||
|
>Hi,
|
|||
|
|
|||
|
>I have a few questions regarding the meaning of 8 bit clean editors.
|
|||
|
|
|||
|
>As I understand it, an editor which is 8 bit clean can display ALL 256
|
|||
|
>characters on the output device. That is, the character should not be mapped
|
|||
|
>to a displayable representation (i.e., ascii char 1 to two character sequence
|
|||
|
>^A). So for example, if character 235 corresponds to the greek letter alpha
|
|||
|
>on the output device, an alpha should appear when char 235 is sent. In
|
|||
|
>addition, the editor should be able to take ANY 8 bit character form the input
|
|||
|
>device and display it. That is, if the input device is capable of sending the
|
|||
|
>char 235 (alpha), then the char should be deisplayed as above. Is this
|
|||
|
>correct?
|
|||
|
|
|||
|
Yes. But here's another fly in the ointment: You shouldn't be so
|
|||
|
eurocentric... there are apparently versions of vt220s which display two
|
|||
|
successive characters as a single chinese or japanese character. So you
|
|||
|
need to make a mode where all deletes operate on two characters... (I would
|
|||
|
appreciate it if someone would elaborate on this more.. I still need to add
|
|||
|
this mode to my editor JOE. I don't understand yet if the character set is
|
|||
|
broken into half-charatcers which fit together or if the first character is
|
|||
|
really a prefix character).
|
|||
|
|
|||
|
>On my PC, the char 255 do not display anything on the screen (just a space).
|
|||
|
>255 is also -1 when converted to signed char and usually denoted end of file
|
|||
|
>or something special like that. Is it just a coincidence that 255 displays
|
|||
|
>nothing on my PC or is this a general feature? Should I make any assumptions
|
|||
|
>regarding 255? I would like to reserve it for my own purposes.
|
|||
|
|
|||
|
Nope, can't do that. Some international character sets use 255. Originally
|
|||
|
I tried to make all of the 'chars' in JOE 'unsigned chars' so that when a
|
|||
|
character was returned in an 'int' the range was 0 - 255 instead of -128 -
|
|||
|
127. That way -1 could still be an error return. The only problem is that
|
|||
|
stupid ANSI compilers give bazillions of warnings (it's bad enough they give
|
|||
|
warning for 'char *' being mixed with 'const char *', but char/unsigned-char
|
|||
|
warnings are rediculous. I hate ANSI-C. I wish it would go away. (stupid
|
|||
|
catering to the IBM PC..)). Anyway, I now use MAXINT (defined as 2^31-1 or
|
|||
|
32767) for error returns and have characters in the range of -128 to 127.
|
|||
|
You still need to cast them to unsigned sometimes (for table lookup), but
|
|||
|
not very often.
|
|||
|
|
|||
|
I've decided that 'unsigned' as a C keyword is close to useless because of
|
|||
|
the compatibility problems, so I now avoid it as much as possible.
|
|||
|
|
|||
|
>Finally, are characters with the hi bit set (>= 128) ever involved in keymaps?
|
|||
|
>This might seem like a silly question but for my purposes, it is the most
|
|||
|
>important question. I tend to think of keymaps as involving only 7 bit chars,
|
|||
|
>e.g., escape map. But is any known case of a keymap where the prefix character
|
|||
|
>has the high bit set?
|
|||
|
|
|||
|
True gnu-emacs keyboards are supposed to have a Meta- key, which sets the
|
|||
|
high bit. In Linux, there is a mode which makes the ALT- key the Meta-
|
|||
|
key...
|
|||
|
|
|||
|
>In case you are wondering, I am working on an editor (JED). Recently, I
|
|||
|
>released version 0.80 which I thought to be 8 bit clean, but in retrospect, it
|
|||
|
>is not. I hear people say ``Just treat ALL characters the same!''. However, I
|
|||
|
>am concerned with memory usage on PCs and I would like to cut corners wherever
|
|||
|
>I can.
|
|||
|
|
|||
|
:-) Software virtual memory...
|
|||
|
|
|||
|
> Berfore I release the next version (0.81), I want to make SURE that I
|
|||
|
>get the 8 bit thing correct.
|
|||
|
|
|||
|
JED is neat. The extension language looks like reverse-polish LISP, but
|
|||
|
without parenthasis.
|
|||
|
--
|
|||
|
/* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */
|
|||
|
int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0)
|
|||
|
+r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2
|
|||
|
]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}
|
|||
|
|
|||
|
|
|||
|
From: michael@chpc.utexas.edu (Michael Lemke)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: The University of Texas System - CHPC
|
|||
|
Date: Sun, 7 Feb 93 21:39:01 GMT
|
|||
|
Lines: 36
|
|||
|
|
|||
|
In article <1993Feb7.183016.23290@kodak.kodak.com> scott@inferno.Kodak.COM (Kevin Scott) writes:
|
|||
|
>For what it's worth, here is my OPINION on what 8-bit clean means:
|
|||
|
>
|
|||
|
>1) you can use an 8-bit-clean text editor to edit non-text files
|
|||
|
> (such as .EXE files or .COM files or binary data files). This
|
|||
|
> would be of occasional use to hack in changes in any embedded text
|
|||
|
> in the file you are editing. I have been able to use the Turbo C
|
|||
|
> editor (ver 1.0) to do this type of thing (or perhaps it was a
|
|||
|
> Turbo Pascal editor; I forget; the timeframe was 1987 or so).
|
|||
|
> Of course, if you are editing a file that is not intended to be
|
|||
|
> text, the editor must not have any restriction on line length or
|
|||
|
> requirement that non-empty files end with a newline (sequence).
|
|||
|
>
|
|||
|
>2) it is perfectly OK to represent non-printable characters as a
|
|||
|
> multicharacter sequence (such as ^A for ASCII code 1). What is
|
|||
|
> "printable" vs. "non-printable" is determined by the environment.
|
|||
|
>
|
|||
|
>3) it is possible to enter any 8-bit character from the keyboard on
|
|||
|
> any IBM-PC compatible system. Just hold down Alt while typing the
|
|||
|
> desired character code on the numeric keypad.
|
|||
|
|
|||
|
|
|||
|
This I would call a *binary* editor. A 8-bit clean text editor allows
|
|||
|
me to enter any *printable* character from my keyboard, which is done
|
|||
|
with the compose key in my current set up. Don't restrict your views
|
|||
|
to PCs. As I said in an other post in this thread, it also means the
|
|||
|
editor knows how to capitalize <20>NgSTr<54>m as <20>ngstr<74>m.
|
|||
|
The thing I am using right now does not allow me to do either of these
|
|||
|
but I can enter the characters numerically in a similar fashion as you
|
|||
|
describe above. Not really 8bit clean but quite a pain.
|
|||
|
|
|||
|
Michael
|
|||
|
--
|
|||
|
Michael Lemke
|
|||
|
Astronomy, UT Austin, Texas
|
|||
|
(michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN])
|
|||
|
|
|||
|
|
|||
|
From: upham@cs.ubc.ca (Derek Upham)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Date: 7 Feb 1993 18:32:33 -0800
|
|||
|
Organization: Raven's Auto Body Repair Shop, Mega-Tokyo.
|
|||
|
Lines: 24
|
|||
|
|
|||
|
jhallen@world.std.com (Joseph H Allen) writes:
|
|||
|
>Yes. But here's another fly in the ointment: You shouldn't be so
|
|||
|
>eurocentric... there are apparently versions of vt220s which display two
|
|||
|
>successive characters as a single chinese or japanese character. So you
|
|||
|
>need to make a mode where all deletes operate on two characters...
|
|||
|
|
|||
|
Actually, it gets worse than that. The GB and Big5 character sets
|
|||
|
used in Taiwan have FOUR-byte characters. In general, an application
|
|||
|
looks at the high-order bit of the byte "n". If it is zero, the byte
|
|||
|
is interpreted as 7-bit ASCII. Otherwise it is interpreted as the
|
|||
|
first byte of some character in the alternate set. What's more, there
|
|||
|
are various ways of interpreting high-bits in successive bytes to
|
|||
|
switch between character sets and save space (the specifics now escape
|
|||
|
me, unfortunately). In general, if you want to be safe, do everything
|
|||
|
on four byte characters internally, and then add conversion interfaces
|
|||
|
to work with whatever character set is needed.
|
|||
|
|
|||
|
Derek
|
|||
|
|
|||
|
--
|
|||
|
Derek Lynn Upham University of British Columbia
|
|||
|
upham@cs.ubc.ca Computer Science Department
|
|||
|
=============================================================================
|
|||
|
"Ha! Your Leaping Tiger Kung Fu is no match for my Frightened Piglet Style!"
|
|||
|
|
|||
|
|
|||
|
From: ketil@edb.tih.no (Ketil Albertsen,TIH)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: T I H / T I S I P
|
|||
|
Date: Mon, 8 Feb 1993 09:07:53 GMT
|
|||
|
Lines: 48
|
|||
|
|
|||
|
In article <DAVIS.93Feb6132229@pacific.mps.ohio-state.edu>, davis@pacific.mps.ohio-state.edu
|
|||
|
("John E. Davis") writes:
|
|||
|
|
|||
|
>As I understand it, an editor which is 8 bit clean can display ALL 256
|
|||
|
>characters on the output device.
|
|||
|
|
|||
|
If you go by international standards (frequently called ANSI standards by the
|
|||
|
US community... :->): No. The 190 (+space) characters. There just aren't 256
|
|||
|
character (code)s.
|
|||
|
|
|||
|
The 64 codes from 0 to 31, and 127 (DEL) to 160 are NOT character codes but
|
|||
|
control codes. The correct handling is to *process* them rather than to
|
|||
|
display them. The processing may have an effect on the display, eg. CR and
|
|||
|
LF, both changing the active position, or ESC-sequences switching to a
|
|||
|
different character set (among other things), but the codes are not, per se,
|
|||
|
"displayed". The "processing" may be limited to simply storing (conserving)
|
|||
|
them because the display or software does not support the defined function
|
|||
|
for the control code.
|
|||
|
|
|||
|
Wrt. input: There should be no restriction on how you enter the control
|
|||
|
functions. CR (13) has its own key (did you ever notice that uppercase
|
|||
|
letters are entered by key combinations?), but there is nothing wrong by
|
|||
|
entering ESC [ 1 2 m ("Second alternative font") as a menu choice rather
|
|||
|
than as five separate hex values.
|
|||
|
|
|||
|
But obviously this assumes that you plan to honor ISO character code
|
|||
|
definitions. So, many people would say that it is not "clean". But as
|
|||
|
another poster commented, there is a distinction between a binary
|
|||
|
editor and an 8-bit clean editor. If you want to be able to edit arbitrary
|
|||
|
character sets, with arbitrary use of the control codes (CR/LF relocated
|
|||
|
to other code positions...), then you need a binary editor. IMHO it is
|
|||
|
sufficient for an editor anno 1993 to support ISO character sets -
|
|||
|
preferably all of them.
|
|||
|
|
|||
|
>Is it just a coincidence that 255 displays
|
|||
|
>nothing on my PC or is this a general feature? Should I make any assumptions
|
|||
|
>regarding 255? I would like to reserve it for my own purposes.
|
|||
|
|
|||
|
In 8859/1, 255 is umlaut y. In 8859/2, /3 and /4 it is dot above. Several
|
|||
|
character sets do not use 160 and 255 because it would prohibit representation
|
|||
|
in a 7-bit environment; ISO 2022 distinguishes between 94 and 96 character
|
|||
|
C1 sets.
|
|||
|
|
|||
|
Before you run out to buy the entire collection of ISO standards for character
|
|||
|
sets and control functions: If you were to implement all of it, you'd have
|
|||
|
enough to do for the rest of your life... Writing a binary editor may be
|
|||
|
simpler.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
From: wolff@inf.fu-berlin.de (Thomas Wolff)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: Free University of Berlin, Germany
|
|||
|
Date: Mon, 8 Feb 1993 18:15:16 GMT
|
|||
|
Lines: 10
|
|||
|
|
|||
|
scott@inferno.Kodak.COM (Kevin Scott) writes:
|
|||
|
|
|||
|
>For what it's worth, here is my OPINION on what 8-bit clean means:
|
|||
|
|
|||
|
>3) it is possible to enter any 8-bit character from the keyboard on
|
|||
|
> any IBM-PC compatible system. Just hold down Alt while typing the
|
|||
|
> desired character code on the numeric keypad.
|
|||
|
Due to some ridiculous mis-function in Microsoft's standard keyboard
|
|||
|
driver, one of the codes can not be entered that way. (I seem to remember
|
|||
|
it was 156 or so.)
|
|||
|
|
|||
|
|
|||
|
From: guy@Auspex.COM (Guy Harris)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Date: 8 Feb 93 19:07:55 GMT
|
|||
|
Organization: Auspex Systems, Santa Clara
|
|||
|
Lines: 15
|
|||
|
|
|||
|
>>However, a single CHANGE_CASE table is
|
|||
|
>>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does
|
|||
|
>>anyone know if this assumption is valid?
|
|||
|
>
|
|||
|
>Yes, this is true for at least ISO-Latin-1 and DEC Multinational
|
|||
|
>Character Set (almost identical).
|
|||
|
|
|||
|
I.e., the upper-case version of the German "sz" character has a loweer
|
|||
|
code than the lower-case version?
|
|||
|
|
|||
|
Warning: that is a trick question, at least as I understand the German
|
|||
|
case conventions. It may be unwise to assume that translating a string
|
|||
|
from lower-case to upper-case can be done simply by replacing each
|
|||
|
lower-case letter in the string with a character that's the upper-case
|
|||
|
version of that letter.
|
|||
|
|
|||
|
|
|||
|
From: rnelson@wsuaix.csc.wsu.edu (roger nelson;S23487)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: Washington State University
|
|||
|
Date: Mon, 8 Feb 93 20:22:05 GMT
|
|||
|
Lines: 34
|
|||
|
|
|||
|
In article <1993Feb6.224910.3822@chpc.utexas.edu> michael@chpc.utexas.edu (Michael Lemke) writes:
|
|||
|
>In article <DAVIS.93Feb6161629@pacific.mps.ohio-state.edu> davis@pacific.mps.ohio-state.edu (John E. Davis) writes:
|
|||
|
>>In article <1993Feb6.203811.24134@chpc.utexas.edu> michael@chpc.utexas.edu
|
|||
|
>>(Michael Lemke) writes:
|
|||
|
>> ...accepting 8bit controls will use them. Secondly, an 8bit clean editor
|
|||
|
>> needs to know what are corresponding uppercase and lower case
|
|||
|
>> characters, e.g. <20> is lower case of <20>.
|
|||
|
>>
|
|||
|
>>This is an excellent point that I have not thought of. The natural solution
|
|||
|
>>is through the use of a lookup table. But, in general, this requires TWO
|
|||
|
>>tables: uppercase and lowercase. However, a single CHANGE_CASE table is
|
|||
|
>>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does
|
|||
|
>>anyone know if this assumption is valid?
|
|||
|
>
|
|||
|
>
|
|||
|
>Yes, this is true for at least ISO-Latin-1 and DEC Multinational
|
|||
|
>Character Set (almost identical). The high order part of these char
|
|||
|
>sets are pretty much an image of the low order part except the 8th bit
|
|||
|
>is 1. You should really have a look at the ISO-Latin-1 character
|
|||
|
>tables, for example in the appendix of a terminal that has 8bit chars
|
|||
|
>(e.g., vt220 and higher, GraphOn225 and higher). Also think about sort
|
|||
|
>sequences. An ANSI C implementation must provide functions like
|
|||
|
>isupper(int C) which are controled by the current locale which in turn
|
|||
|
>is controled by the setlocale function. I haven't done anything with it
|
|||
|
>but this is exactly the kind of problem the stuff was invented for.
|
|||
|
>The world isn't ASCII anymore.
|
|||
|
>
|
|||
|
>Michael
|
|||
|
>--
|
|||
|
>Michael Lemke
|
|||
|
>Astronomy, UT Austin, Texas
|
|||
|
>(michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN])
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
From: goer@ellis.uchicago.edu (Richard L. Goerwitz)
|
|||
|
Subject: Wide Characters (was Re: 8 bit clean implies what?)
|
|||
|
Date: 9 Feb 93 01:43:10 GMT
|
|||
|
Organization: University of Chicago
|
|||
|
Lines: 18
|
|||
|
|
|||
|
allen@world.std.com (Joseph H Allen) writes:
|
|||
|
|
|||
|
>>As I understand it, an editor which is 8 bit clean can display ALL 256
|
|||
|
>>characters on the output device.
|
|||
|
|
|||
|
>Yes. But here's another fly in the ointment: You shouldn't be so
|
|||
|
>eurocentric... there are apparently versions of vt220s which display two
|
|||
|
>successive characters as a single chinese or japanese character....
|
|||
|
|
|||
|
The idea is to keep all character-based code potentially indifferent to char-
|
|||
|
acter size. Soon we hope that the internationalization/localization issue
|
|||
|
will be solved, to some extent, by ISO 10646, which specifies, as I recall,
|
|||
|
32-bit wide characters. Somebody correct me if I'm wrong.
|
|||
|
|
|||
|
--
|
|||
|
|
|||
|
-Richard L. Goerwitz goer%midway@uchicago.bitnet
|
|||
|
goer@midway.uchicago.edu rutgers!oddjob!ellis!goer
|
|||
|
|
|||
|
|
|||
|
From: michael@chpc.utexas.edu (Michael Lemke)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: The University of Texas System - CHPC
|
|||
|
Date: Tue, 9 Feb 93 03:28:15 GMT
|
|||
|
Lines: 28
|
|||
|
|
|||
|
In article <16849@auspex-gw.auspex.com> guy@Auspex.COM (Guy Harris) writes:
|
|||
|
>>>However, a single CHANGE_CASE table is
|
|||
|
>>>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does
|
|||
|
>>>anyone know if this assumption is valid?
|
|||
|
>>
|
|||
|
>>Yes, this is true for at least ISO-Latin-1 and DEC Multinational
|
|||
|
>>Character Set (almost identical).
|
|||
|
>
|
|||
|
>I.e., the upper-case version of the German "sz" character has a loweer
|
|||
|
>code than the lower-case version?
|
|||
|
|
|||
|
Well, not really. But this is indeed tricky as the reverse, lowercasing
|
|||
|
SS, is not unique. `MASSE' can be `Masse' or `Ma<4D>e', depending on context.
|
|||
|
As a native German I'd let these cases alone.
|
|||
|
|
|||
|
>
|
|||
|
>Warning: that is a trick question, at least as I understand the German
|
|||
|
>case conventions. It may be unwise to assume that translating a string
|
|||
|
>from lower-case to upper-case can be done simply by replacing each
|
|||
|
>lower-case letter in the string with a character that's the upper-case
|
|||
|
>version of that letter.
|
|||
|
|
|||
|
|
|||
|
Michael
|
|||
|
--
|
|||
|
Michael Lemke
|
|||
|
Astronomy, UT Austin, Texas
|
|||
|
(michael@io.as.utexas.edu or UTSPAN::UTADNX::IO::MICHAEL [SPAN])
|
|||
|
|
|||
|
|
|||
|
From: rnelson@wsuaix.csc.wsu.edu (roger nelson;S23487)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Sender: news@serval.net.wsu.edu (USENET News System)
|
|||
|
Organization: Washington State University
|
|||
|
Date: Tue, 9 Feb 93 07:59:20 GMT
|
|||
|
Lines: 34
|
|||
|
|
|||
|
>>This is an excellent point that I have not thought of. The natural solution
|
|||
|
>>is through the use of a lookup table. But, in general, this requires TWO
|
|||
|
>>tables: uppercase and lowercase. However, a single CHANGE_CASE table is
|
|||
|
>>sufficient if it is guaranteed that lower_case(x) >= upper_case(x). Does
|
|||
|
>>anyone know if this assumption is valid?
|
|||
|
|
|||
|
One will notice that (with the exception of codes 32-63) the lower order
|
|||
|
nyble of the ASCII code for the uppercase character is the same as the
|
|||
|
respective lowercase character (and also the respective control character).
|
|||
|
The most significant bit of the upper order nyble is an encoding of the
|
|||
|
shift keys used:
|
|||
|
|
|||
|
codes 0000 - 0001 Denote a ctrl character Ie Ctrl-A = 0000 0001
|
|||
|
codes 0010 - 0011 don't follow the general encoding scheme
|
|||
|
^
|
|||
|
codes 0100 - 0101 Denote a shifted char. Ie A = 0100 0001
|
|||
|
^
|
|||
|
codes 1000 - 1111 Denote an unshifted char Ie a = 0110 0001
|
|||
|
^
|
|||
|
(Note that there are a few exceptions to the shift key encoding:
|
|||
|
64,94,95,96,126 and 127.)
|
|||
|
|
|||
|
>Yes, this is true for at least ISO-Latin-1 and DEC Multinational
|
|||
|
>Character Set (almost identical). The high order part of these char
|
|||
|
>sets are pretty much an image of the low order part except the 8th bit
|
|||
|
>is 1.
|
|||
|
|
|||
|
Is the shift key encoding of the characters in the 8-bit character set
|
|||
|
preserved?
|
|||
|
|
|||
|
Roger
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
From: ketil@edb.tih.no (Ketil Albertsen,TIH)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Sender: ketil@edb.tih.no (Ketil Albertsen,TIH)
|
|||
|
Organization: T I H / T I S I P
|
|||
|
Date: Tue, 9 Feb 1993 15:37:50 GMT
|
|||
|
Lines: 17
|
|||
|
|
|||
|
In article <1993Feb9.075920.20683@serval.net.wsu.edu>, rnelson@wsuaix.csc.wsu.edu
|
|||
|
(roger nelson;S23487) writes:
|
|||
|
|
|||
|
>One will notice that (with the exception of codes 32-63) the lower order
|
|||
|
>nyble of the ASCII code for the uppercase character is the same as the
|
|||
|
>respective lowercase character (and also the respective control character).
|
|||
|
|
|||
|
Basing a case conversion on this is not a good solution. Eg. 8859/2 (suiting
|
|||
|
a number of East European languages) follows this pattern with a distance
|
|||
|
of 16 for the codes A9 to AF, but a distance of 32 for C0 to DE. True, the
|
|||
|
low nibble is the same, but it doesn't help you that much.
|
|||
|
IS 6937 also has a distance of 16 for most upper-half codes, but with
|
|||
|
exceptions. And there will always be a number of special cases, such as
|
|||
|
the German double-s. So, a translation table gets you a lot further. If
|
|||
|
you extend the table with some trapping mechanism for special cases, you
|
|||
|
could get it good enough for "any" use.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
From: ant@mks.com (Anthony Howe)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Organization: Mortice Kern Systems Inc., Waterloo, Ontario, CANADA
|
|||
|
Date: Tue, 9 Feb 1993 14:21:34 GMT
|
|||
|
Lines: 52
|
|||
|
|
|||
|
To my knowledge, 8-bit clean means that you must make no assumptions about
|
|||
|
any characters in the character set other than what the ctype macro/functions
|
|||
|
tell you. (See ANSI C section 4.3 Character Handling.)
|
|||
|
|
|||
|
"The header <ctype.h> declares several functions useful for testing
|
|||
|
and mapping characters. In all cases the argument is an int, the
|
|||
|
value of which shall be representable as an assigned char or shall
|
|||
|
equal the value of the macro EOF. If the argument has any other
|
|||
|
value, the behaviour is undefined."
|
|||
|
|
|||
|
int isalnum(int c);
|
|||
|
int iscntrl(int c);
|
|||
|
int isupper(int c);
|
|||
|
...
|
|||
|
|
|||
|
P.J. Plauger has a column in "The C User Journal". In one issue (which I
|
|||
|
can't remember) he discuss <ctype.h> and issues concerning 8-bit clean.
|
|||
|
I recommend doing a search back over that last two years for it.
|
|||
|
|
|||
|
>From my understanding of the quote above, the ctype table must be at least
|
|||
|
256 bytes. You must be careful of sign-extension with char pointers, like
|
|||
|
|
|||
|
{
|
|||
|
char *p = "Hi\375\376\377 there";
|
|||
|
...
|
|||
|
if (isalpha(p[2])) {
|
|||
|
...
|
|||
|
} else if (iscntrl(p[4])) {
|
|||
|
...
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
If your compiler defaults to chars being signed, the results of the ctype
|
|||
|
ctype table look up will be undefined, since p[2] will be sign-extended to
|
|||
|
-3 and p[4] will be sign-extended to -1 and so fall off the bottom of the
|
|||
|
table. Also EOF, typically -1, does NOT equal 255. Remember that the
|
|||
|
argument is an int, so EOF is really going to be (int) -1 while 255 will
|
|||
|
be (unsigned char) 255, which are not the same.
|
|||
|
|
|||
|
You should come up with a mapping function something like unctrl() that will
|
|||
|
represent control characters (non-printables) in a sensible manner. Allow
|
|||
|
the mapping to be altered/configured from system to system.
|
|||
|
|
|||
|
You also have to be careful about 9-bit char. There are still systems
|
|||
|
out there that have 9-bit bytes, which would mean a ctype table of 512
|
|||
|
bytes. Plauger's article covers all these issues very well.
|
|||
|
|
|||
|
-ant
|
|||
|
--
|
|||
|
ant@mks.com Anthony C Howe
|
|||
|
Mortice Kern Systems Inc. 35 King St. N., Waterloo, Ontario, Canada, N2J 6W9
|
|||
|
"Nice legs. For a human that is." - Worf (Q-pid)
|
|||
|
|
|||
|
|
|||
|
From: jschief@finbol.toppoint.de (Joerg Schlaeger)
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Date: Tue, 09 Feb 93 17:24:01 GMT
|
|||
|
Lines: 35
|
|||
|
|
|||
|
upham@cs.ubc.ca writes in article <1l4go1INNq8v@cascade.cs.ubc.ca>:
|
|||
|
> ..................
|
|||
|
> switch between character sets and save space (the specifics now escape
|
|||
|
> me, unfortunately). In general, if you want to be safe, do everything
|
|||
|
> on four byte characters internally, and then add conversion interfaces
|
|||
|
> to work with whatever character set is needed.
|
|||
|
>
|
|||
|
> Derek
|
|||
|
>
|
|||
|
> --
|
|||
|
> Derek Lynn Upham University of British Columbia
|
|||
|
> upham@cs.ubc.ca Computer Science Department
|
|||
|
> =============================================================================
|
|||
|
> "Ha! Your Leaping Tiger Kung Fu is no match for my Frightened Piglet Style!"
|
|||
|
>
|
|||
|
|
|||
|
Hi,
|
|||
|
and please don't forget the big- & small endian problem for more than one
|
|||
|
byte per character, because you can't besure that your stdin is a keyboard
|
|||
|
and that the byteorder is the allways the same.
|
|||
|
I've the problem with named pipe's filled with messages from Intel & Motorala
|
|||
|
Workstations and 16-Bit charsets.
|
|||
|
Is there anyone who knows the solution for every character set,1 & 2 & 4 Byte.
|
|||
|
Perhabs a sign like "\n" thats all the same, to detect the need for byteswapping.
|
|||
|
|
|||
|
Joerg
|
|||
|
|
|||
|
--
|
|||
|
+++++++++++++++++++++++++++++++++++
|
|||
|
Joerg Schlaeger
|
|||
|
Home: +49 431 682210 (voice & fax & ...)
|
|||
|
jschief@finbol.toppoint.de
|
|||
|
-----------------------------------
|
|||
|
(to be faster with the /2)
|
|||
|
+++++++++++++++++++++++++++++++++++
|
|||
|
|
|||
|
|
|||
|
From: jimc@tau-ceti.isc-br.com (Jim Cathey)
|
|||
|
Newsgroups: comp.editors
|
|||
|
Subject: Re: 8 bit clean implies what?
|
|||
|
Date: 10 Feb 93 23:45:35 GMT
|
|||
|
Organization: Olivetti North America, Spokane, WA
|
|||
|
Lines: 20
|
|||
|
|
|||
|
In article <729278641snx@finbol.toppoint.de> jschief@finbol.toppoint.de (Joerg Schlaeger) writes:
|
|||
|
>and please don't forget the big- & small endian problem for more than one
|
|||
|
>byte per character, because you can't besure that your stdin is a keyboard
|
|||
|
>and that the byteorder is the allways the same.
|
|||
|
|
|||
|
Unicode has a magic cookie that's a NOP whose byte-reversed form is also
|
|||
|
a (different) NOP. It may be embedded in any string (presumably at the front)
|
|||
|
as a byte-sex tag if you need such things.
|
|||
|
|
|||
|
--
|
|||
|
+----------------+
|
|||
|
! II CCCCCC ! Jim Cathey
|
|||
|
! II SSSSCC ! ISC-Bunker Ramo
|
|||
|
! II CC ! TAF-C8; Spokane, WA 99220
|
|||
|
! IISSSS CC ! UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com)
|
|||
|
! II CCCCCC ! (509) 927-5757
|
|||
|
+----------------+
|
|||
|
One Design to rule them all; one Design to find them.
|
|||
|
One Design to bring them all and in the darkness bind
|
|||
|
them. In the land of Mediocrity where the PC's lie.
|
|||
|
|
|||
|
|