SRFI 175

Title

ASCII character library

Author

Lassi Kortela

Status

This SRFI is currently in draft status. Here is an explanation of each status that a SRFI can hold. To provide input on this SRFI, please send email to srfi-175@nospamsrfi.schemers.org. To subscribe to the list, follow these instructions. You can access previous messages via the mailing list archive.

Abstract

This SRFI defines ASCII-only equivalents to many of the character procedures in standard Scheme plus a few extra ones. Recent Scheme standards are based around Unicode but the significant syntactic elements in many file formats and network protocols are all ASCII. Such low-level code can run faster and its behavior can be easier to understand when it uses ASCII primitives.

Table of contents

Rationale

Procedures dealing with character objects have been included in standard Scheme since R2RS (1985) with identical arguments and return values. The early Scheme reports did not mandate any particular character set, though in practice most (perhaps all) implementations used extended ASCII. R6RS (2007) was the first standard to strongly favor Unicode.

Unicode is a fine choice for high-level work, but is overkill for most low-level work dealing with file formats and network protocols. ASCII-only procedures are much simpler to implement and their behavior is much easier to understand than their Unicode equivalents. They have shorter code paths with fewer and simpler failure modes, and need no lookup tables.

Characters as integers

Scheme has a standard character data type which is very useful for disambiguating between characters and integers. However, code dealing with low-level binary formats typically uses byte ports and bytevectors whose elements are small, exact nonnegative integers. It is convenient to treat those integers as if they were characters (which they often represent, as most binary formats also contain strings of text). For this reason, the procedures in this SRFI taking character objects also accept integers in their place.

This SRFI has been designed with the assumption that codepoints 0..127 correspond to ASCII in in the Scheme implementation's native character datatype. We could not come up with any implementations where this is not the case. The only non-ASCII-superset character set we could think of is EBCDIC, which is fringe enough that it does not seem worth worrying about it.

Procedure equivalence

The following table lists all procedures defined in this SRFI that have direct equivalents in the Scheme RnRS standards.

This SRFI RnRS Since
ascii-char? char? R2RS
ascii-string? string? R2RS
ascii-ci=? char-ci=? R2RS
ascii-ci<? char-ci<? R2RS
ascii-ci>? char-ci>? R2RS
ascii-ci<=? char-ci<=? R2RS
ascii-ci>=? char-ci>=? R2RS
ascii-string-ci=? string-ci=? R2RS
ascii-string-ci<? string-ci<? R2RS
ascii-string-ci>? string-ci>? R2RS
ascii-string-ci<=? string-ci<=? R2RS
ascii-string-ci>=? string-ci>=? R2RS
ascii-alphabetic? char-alphabetic? R2RS
ascii-numeric? char-numeric? R2RS
ascii-whitespace? char-whitespace? R2RS
ascii-upper-case? char-upper-case? R2RS
ascii-lower-case? char-lower-case? R2RS
ascii-upcase char-upcase R2RS
ascii-downcase char-downcase R2RS
ascii-digit-value digit-value R7RS*

*Note that the ascii-digit-value procedure takes a limit argument that the standard digit-value procedure does not take.

The standard Scheme character procedures listed above require their arguments to be character objects. The equivalents in this SRFI accept integers in addition to character objects. However, ascii-char? like the standard char? only tests for a character object.

Capsule history of ASCII

The ASCII (American Standard Code for Information Interchange) character set is standardized by ANSI (American National Standards Institute). The present ASCII standard was first published in 1967. The organization was not yet called ANSI back then; its name was the United States of America Standards Institute (USASI).

Most computers now deal with 8-bit bytes, and ASCII is often thought of as an 8-bit character set. However, it is actually only 7-bit. The 8th bit was left unused because 8-bit hardware was not yet ubiquitous in the sixties. Through the decades many applications have used the 8th bit as a parity or flag bit.

Once international character sets were created, most of them took the 7-bit ASCII code as a basis. 8-bit character sets for alphabets generally took ASCII as the first half, using the other half for national letters as well as typographic elements and more control characters. Multi-byte character sets for complex writing systems are also generally based on ASCII but encoding them into 8-bit bytes is more complex. UTF-8, the dominant encoding of Unicode, is a multi-byte character encoding where 8-bit bytes using only the low 7 bits represent ASCII characters.

More complete histories of ASCII are available on Wikipedia and in numerous other places. Of particular interest is that these histories explain why the allocation of character codes is almost perfectly logical but not quite.

ASCII character table

#x00 NUL  #x10 DLE  #x20    #x30 0  #x40 @  #x50 P  #x60 `  #x70 p
#x01 SOH  #x11 DC1  #x21 !  #x31 1  #x41 A  #x51 Q  #x61 a  #x71 q
#x02 STX  #x12 DC2  #x22 "  #x32 2  #x42 B  #x52 R  #x62 b  #x72 r
#x03 ETX  #x13 DC3  #x23 #  #x33 3  #x43 C  #x53 S  #x63 c  #x73 s
#x04 EOT  #x14 DC4  #x24 $  #x34 4  #x44 D  #x54 T  #x64 d  #x74 t
#x05 ENQ  #x15 NAK  #x25 %  #x35 5  #x45 E  #x55 U  #x65 e  #x75 u
#x06 ACK  #x16 SYN  #x26 &  #x36 6  #x46 F  #x56 V  #x66 f  #x76 v
#x07 BEL  #x17 ETB  #x27 '  #x37 7  #x47 G  #x57 W  #x67 g  #x77 w
#x08 BS   #x18 CAN  #x28 (  #x38 8  #x48 H  #x58 X  #x68 h  #x78 x
#x09 HT   #x19 EM   #x29 )  #x39 9  #x49 I  #x59 Y  #x69 i  #x79 y
#x0a LF   #x1a SUB  #x2a *  #x3a :  #x4a J  #x5a Z  #x6a j  #x7a z
#x0b VT   #x1b ESC  #x2b +  #x3b ;  #x4b K  #x5b [  #x6b k  #x7b {
#x0c FF   #x1c FS   #x2c ,  #x3c <  #x4c L  #x5c \  #x6c l  #x7c |
#x0d CR   #x1d GS   #x2d -  #x3d =  #x4d M  #x5d ]  #x6d m  #x7d }
#x0e SO   #x1e RS   #x2e .  #x3e >  #x4e N  #x5e ^  #x6e n  #x7e ~
#x0f SI   #x1f US   #x2f /  #x3f ?  #x4f O  #x5f _  #x6f o  #x7f DEL

ASCII character classes

#x00..#x1f  control         #x20        space
#x21..#x2f  punctuation     #x30..#x39  digit
#x3a..#x40  punctuation     #x41..#x5a  upper-case
#x5b..#x60  punctuation     #x61..#x7a  lower-case
#x7b..#x7e  punctuation     #x7f        control

Letter and number transformations

Many letter and number tasks are naturally expressed by treating decimal digits and the Latin alphabet as integer ranges. Recall that characters themselves are just integer codes under the hood.

Hence by adding a (positive or negative) integer offset we can:

Converting letters from upper-case to lower-case or vice versa is a simple matter of checking whether a letter is in the opposite case, and if so, offsetting it onto the case we want.

Converting digits to numbers is a matter of checking that a character is in the ASCII digit range and then offsetting it to map it onto the integers 0..9. Vice versa for numbers to ASCII digits.

We can use only a part of the letter or digit range by specifying a limit. For example, to use the letters abcdef or ABCDEF for hex digits, we’d use a limit of 6 on the upper-case or lower-case range.

For tasks that mix letters and digits, or upper-case and lower-case letters, we have to chain multiple transforms together. Each transform checks the source character to find out whether it matches. If it does, the transformation is performed. Otherwise the job is deferred to the next transformation. In the case of hex conversion, we’d first check whether a character matches the ASCII digit range, and if not, defer to a 6-limited letter range.

To map letters to other letters, it is advantageous to treat the alphabet as a circular range that repeats infinitely in both directions. We can easily perform letter rotations by adding an arbitrary offset and taking the result modulo 26 (the count of letters in the alphabet).

This SRFI wraps the above transformations into reusable combinators. They are specified in the Transformation procedures section. Since there are countless minor variations on real-world transformation tasks such as number parsing, this SRFI doesn’t provide any ready-made parsing procedures. Instead, the combinators have been designed with the goal of making it easy to roll your own. The Examples section will get you started.

To recap the above, each transform:

The combinators ascii-upper-case-value and ascii-lower-case-value each do all of the above jobs. The ascii-digit-value combinator does all of them except offsetting, since that is less useful for digits than letters.

The combinators ascii-nth-upper-case and ascii-nth-lower-case do the opposite conversion from numeric values to characters, also handling alphabet rotations. The ascii-nth-digit combinator does not do rotations, since once again those are less useful on digits.

Specification

Numerical limits

Let the char-fix range be an implementation-defined range of exact integer values such that:

For every procedure in this SRFI:

Hence in a Scheme implementation where all character codepoints fit in a fixnum, the char-fix range can be identical to the fixnum range and this SRFI can be implemented using fast fixnum math. In particular, R6RS supplies standard fixnum procedures with the fx prefix. In a Scheme implementation where some codepoints are bigger than a fixnum, generic math has to be used.

Predicates to test for ASCII vs non-ASCII objects

(ascii-codepoint? obj)

Returns #t if obj is an exact integer in the inclusive range #x00..#x7f. Else returns #f.

(ascii-bytevector? obj)

Returns #t if obj is a bytevector and contains no byte value outside the inclusive range #x00..#x7f. Else returns #f.

A zero-length bytevector is considered an ASCII bytevector.

(ascii-char? obj)

Returns #t if obj is a character object whose codepoint lies in the inclusive range #x00..#x7f. Else returns #f.

(ascii-string? obj)

Returns #t if obj is a string and contains no character with a codepoint outside the inclusive range #x00..#x7f. Else returns #f.

A zero-length string is considered an ASCII string.

Predicates to test for subsets of ASCII

(ascii-control? char)

Returns #t if char represents an ASCII character in the control class. Else returns #f.

Note that carriage return, line feed and tab are control characters but space is not.

(ascii-display? char)

Returns #t if char represents an ASCII character that is not in the control class. Else returns #f.

The point is that display characters are safe to write to a device that may not be able to sensibly interpret control characters or non-ASCII characters.

Note that we consider space to be a display character but not tab, carriage return or line feed. This convention is popular but not universal.

(ascii-space-or-tab? char)

Returns #t if char represents an ASCII character with the integer value #x09 (tab) or #x20 (space). Else returns #f.

The point is that space and tab are very often useful to distinguish from other whitespace characters, notably newlines.

(ascii-punctuation? char)

Returns #t if char represents an ASCII character in the punctuation class. Else returns #f.

(ascii-alphanumeric? char)

Returns #t if char represents an ASCII character in the upper-case or lower-case or digit class. Else returns #f.

Subset predicates with standard Scheme equivalents

(ascii-alphabetic? char)

Returns #t if char represents an ASCII character in the upper-case or lower-case class. Else returns #f.

(ascii-numeric? char)

Returns #t if char represents an ASCII character in the digit class. Else returns #f.

(ascii-whitespace? char)

Returns #t if char represents an ASCII character with the integer value #x09 (tab) or #x0a (line feed) or #x0b (vertical tab) or #x0c (form feed) or #x0d (carriage return) or #x20 (space). Else returns #f.

Notice how the other whitespace characters form a contiguous range of control characters, but space stands alone as a separate non-control character.

(ascii-upper-case? char)

Returns #t if char represents an ASCII character in the upper-case class. Else returns #f.

(ascii-lower-case? char)

Returns #t if char represents an ASCII character in the lower-case class. Else returns #f.

Case-insensitive character comparison procedures

(ascii-ci=? char1 char2)

(ascii-ci<? char1 char2)

(ascii-ci>? char1 char2)

(ascii-ci<=? char1 char2)

(ascii-ci>=? char1 char2)

These procedures test whether the codepoint of char1 is equal to, less than, greater than, less than or equal to, or greater than or equal to the codepoint of char2.

The comparison is case-insensitive. Specifically, ASCII upper-case letters are converted to their lower-case equivalents before the codepoints are compared. Mapping upper-case to lower-case matches the standard Unicode case-folding algorithm. The direction of folding is important when comparing a letter and a non-letter to find out which is less than the other. These procedures do not apply any case-folding to non-ASCII characters.

Note that char1 and char2 do not need to be of the same type. It is permitted for one of them to be a character object and the other to be an integer.

For case-sensitive comparison, the standard character comparison procedures char=? etc. as well as the standard number and fixnum comparison procedures =, fx= etc. work fine for ASCII; hence this SRFI does not provide case-sensitive equivalents.

Case-insensitive string comparison procedures

(ascii-string-ci=? string1 string2)

(ascii-string-ci<? string1 string2)

(ascii-string-ci>? string1 string2)

(ascii-string-ci<=? string1 string2)

(ascii-string-ci>=? string1 string2)

These procedures test whether string1 is equal to, less than, greater than, less than or equal to, or greater than or equal to string2.

Each pair of adjacent characters between string1 and string2 is compared as with ascii-ci=?, ascii-ci<?, etc. Comparison stops when either string ends, or when an unequal pair of characters is found. If the two strings are of different lengths, and their characters are equal all the way up to the length of the shorter string, then the shorter string is considered less than the longer one. A zero-length string is considered less than a non-zero-length string. Two zero-length strings are considered equal.

For case-sensitive comparison, the standard string=? etc. work fine for ASCII; hence this SRFI does not provide case-sensitive equivalents.

Case conversion procedures

(ascii-upcase char)

If char represents an ASCII character in the lower-case class, returns the same letter from the upper-case class. Else returns char unchanged.

char can be a character object or an integer; the same type of object is returned.

(ascii-downcase char)

If char represents an ASCII character in the upper-case class, returns the same letter from the lower-case class. Else returns char unchanged.

char can be a character object or an integer; the same type of object is returned.

Bracket matching procedures

ASCII includes four pairs of open and close brackets:

Open Close Known as
( ) Parentheses
[ ] Square brackets
{ } Curly braces
< > Angle brackets

(ascii-open-bracket char)

If char represents one of the four ASCII open brackets, returns char. Else returns #f.

char can be a character object or an integer; the same type of object is returned.

(ascii-close-bracket char)

If char represents one of the four ASCII close brackets, returns char. Else returns #f.

char can be a character object or an integer; the same type of object is returned.

(ascii-mirror-bracket char)

char can be a character object or an integer; the same type of object is returned.

Control character display procedures

Every ASCII control character has a corresponding display character. The control characters #x00..#x1f are displayed as @ A B C ... X Y Z [ \ ] ^ _. The control character #x7f is displayed as ?. For example, when you press Control-A in a Unix terminal, the program running in the terminal receives the ASCII character #x01. Control-A is sometimes written ^A. Likewise, Control-@ can be written as ^@ and Control-^ as ^^, etc.

(ascii-control->display char)

If char represents an ASCII character in the control class, returns the corresponding display character as above. Else returns #f.

char can be a character object or an integer; the same type of object is returned.

(ascii-display->control char)

If char represents one of the ASCII display characters given above, returns the corresponding control character. Else returns #f.

char can be a character object or an integer; the same type of object is returned.

Transformation procedures

These procedures serve as versatile building blocks for various letter and number transformations.

(ascii-nth-digit n)

Returns a character object representing the n'th decimal digit in ASCII. n counts from zero so that 0 returns 0 and 9 returns 9.

If n is not an exact integer in the range 0..9, #f is returned.

(ascii-nth-upper-case n)

Returns a character object representing the n'th letter in the upper-case Latin alphabet in ASCII. n counts from zero so that 0 returns A and 25 returns Z.

n is taken modulo 26 so values less than 0 or greater than 25 are permitted. Use R5RS modulo (not remainder) when implementing the procedures in this SRFI.

(ascii-nth-lower-case n)

Returns a character object representing the n'th letter in the lower-case Latin alphabet in ASCII. n counts from zero so that 0 returns a and 25 returns z.

n is taken modulo 26 so values less than 0 or greater than 25 are permitted. Use R5RS modulo (not remainder) when implementing the procedures in this SRFI.

(ascii-digit-value char limit)

If char represents an ASCII decimal digit, returns the numeric value 0..9 of that digit. Only digit values less than limit are accepted: for example, a limit of 8 accepts only octal digits. To accept the entire range, pass a limit of 10.

If char does not represent an acceptable digit, #f is returned.

(ascii-upper-case-value char offset limit)

If char represents an ASCII upper-case letter, its distance from A is taken as an integer 0..25. Only distances less than limit are accepted: for example, a limit of 6 accepts only the letters ABCDEF. To accept the entire range, pass a limit of 26.

An acceptable distance is returned with offset added to it; give an offset of 0 to add nothing.

If char does not represent an acceptable letter, #f is returned.

(ascii-lower-case-value char offset limit)

If char represents an ASCII lower-case letter, its distance from a is taken as an integer 0..25. Only distances less than limit are accepted: for example, a limit of 6 accepts only the letters abcdef. To accept the entire range, pass a limit of 26.

An acceptable distance is returned with offset added to it; give an offset of 0 to add nothing.

If char does not represent an acceptable letter, #f is returned.

Examples

Case conversion

The case conversion procedures in this SRFI can be implemented in terms of the letter transformation procedures. For the sake of simplicity, the following examples do not take fixnum-to-character conversion into account.

(define (my-upcase char)
  (or (ascii-lower-case-value char #x41 26) char))

(define (my-downcase char)
  (or (ascii-upper-case-value char #x61 26) char))

Number parsing

Since there are lots of slightly different number syntaxes, this SRFI does not provide procedures to convert between numbers and strings. Instead, the transformation procedures let you easily roll your own. Here is one way to do it:

(define (parse-binary-digit  char) (ascii-digit-value char 2))
(define (parse-octal-digit   char) (ascii-digit-value char 8))
(define (parse-decimal-digit char) (ascii-digit-value char 10))

(define (parse-hex-digit char)
  (or (ascii-digit-value char 10)
      (ascii-lower-case-value char 10 6)
      (ascii-upper-case-value char 10 6)))

(define (quote-hex-digit n)
  (cond ((< n 10) (ascii-nth-digit n))
        ((< n 16) (ascii-nth-lower-case (- n 10)))))

Caesar cipher

The Caesar cipher is a naive encryption method used successfully in ancient Rome. It involves rotating each letter by rot alphabet positions so that it becomes another letter. Letters rotated beyond Z wrap around and resume counting from A; likewise, negative rotations beyond A wrap around and resume from Z. ROT13 is a Caesar variant that is its own inverse: a positive rotation by 13 is identical to a negative rotation by -13. Non-alphabetic characters are left intact.

(define (caesar-char rot char)
  (or (let ((n (ascii-lower-case-value char rot 26)))
        (and n (ascii-nth-lower-case n)))
      (let ((n (ascii-upper-case-value char rot 26)))
        (and n (ascii-nth-upper-case n)))
      char))

Strings utility

The Unix strings utility reads a binary file, looking for contiguous sequences of displayable ASCII bytes and showing each sequence as it is found. The idea is to find human-readable text in the file. The following is the main loop of strings. It relies on a show helper procedure that displays (list->string (map integer->char (reverse stride))) if stride is at least 4 bytes long.

(let loop ((stride '()))
  (let ((byte (read-u8 port)))
    (cond ((eof-object? byte)
           (show stride))
          ((not (ascii-display? byte))
           (show stride)
           (loop '()))
          (else
           (loop (cons byte stride))))))

Implementation

A sample implementation is available at:

github.com/scheme-requests-for-implementation/srfi-175

It provides two equivalent libraries: one for R6RS and one for R7RS. Each library depends only on standard language features. The R6RS library uses number procedures specialized for fixnums. A test suite as well as ready-to-run examples are included.

The R6RS code is a fully automatic conversion of the R7RS code. The R7RS program doing the conversion is included.

The sample implementation has passed all its tests and successfully run all the examples in at least the following Scheme implementations:

Acknowledgements

John Cowan brought helpful knowledge of ASCII and Unicode support in Scheme implementations and encouraged me to add more procedures to make life easier for users. John and Duy Nguyen convinced me to omit character class constants from this SRFI, leaving them to SRFI 14 and its successors. John and Shiro Kawai gave valuable feedback on procedure names. Arthur Gleckler was an encouraging and responsive editor.

Since the sample implementation was written specifically for standard R6RS/R7RS, it unearthed several gotchas in the murkier corners of Scheme implementations' R6RS/R7RS support. A big thank you to the developers of Cyclone, Gambit, Gerbil and IronScheme for being extremely responsive and helpful and taking great care of their implementations.

Copyright

Copyright © Lassi Kortela (2019)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


Editor: Arthur A. Gleckler