SRFI 169: Underscores in Numbers
Lassi Kortela
This SRFI is currently in draft status. Here is
an
explanation of each status that a SRFI can hold. To provide
input on this SRFI, please send email to srfi-169@nospamsrfi.schemers.org
. To subscribe to the list, follow these
instructions. You can access previous messages via the
mailing list archive.
Many people find that large numbers are easier to read when
the digits are broken into small groups. For example, the number
1582439
might be easier to read if written as
1 582 439
. This applies to source code as it does to
other writing. We propose an extension of Scheme syntax to allow
the underscore as a digit separator in numerical constants.
Western cultures tend to divide digits into groups of three.
This convention is not universal. For example, in India people
write numbers like 3 14 15 926
(read three crore
fourteen lakh fifteen thousand nine hundred and twenty-six
in Indian English).
For simplicity and universality, we propose that digit groups of all sizes may be mixed freely when writing a number. It is permissible to have just one digit in a group, and groups in a number don’t need to be ordered by increasing or decreasing digit count.
Human cultures and programming languages differ in what separator to use between groups.
The examples in this document so far have used a space. This is familiar to humans but not a good fit for most programming languages since whitespace has a prominent role as token separator. Scheme is no exception here.
The next natural alternative is to use a comma or a
period. This is likely to cause confusion in an international
community since countries that a use comma as the decimal
separator are as numerous as those that use a period. More
trouble comes from Scheme using the comma to splice things
into a quasiquoted list: e.g. `(1,2)
evaluates
to (1 2)
. Allowing commas in numbers would
change splicing behavior in a confusing way.
C++ uses an apostrophe which is somewhat exotic and may
call to mind units of measure, e.g. feet and inches. Scheme
also uses the apostrophe for quotation, e.g.
'(1'2)
evaluates to (1 (quote 2))
.
Allowing apostrophes in numbers would change the meaning of
this syntax.
The most popular digit group separator among programming languages is the underscore. It is in the standard syntax of Ada, C#, Clojure, Eiffel, Frink, Java, Julia, Kotlin, OCaml, Perl, Python, Ruby, Rust and Swift. It is also being added to JavaScript and is a common syntax extension in implementations of Standard ML. The Common Lisp standard permits it under the umbrella of potential numbers but we are not aware of implementations that use the opportunity. Of Scheme implementations, Gauche can read numbers with underscores when they have a radix or exactness prefix.
In light of the above, we consider the underscore to be the clear winner. It is the most widely compatible and least ambiguous choice, in both human and machine terms.
Languages in the Lisp family traditionally allow a larger set
of characters in identifiers than do most other languages. For
example, 1+
and 3*/!
would parse as
symbols in Common Lisp. Scheme is slightly more restrictive:
R5RS, R6RS and R7RS do not recognize identifiers that begin with
a decimal digit. Some implementations are more relaxed. For
example, MIT Scheme comes with 1+
and
-1+
procedures to increment and decrement numbers.
Several implementations presently parse tokens consisting
entirely of digits and underscores as symbols.
Countless languages outside the Lisp family use the
underscores as word separators in multi-word identifiers –
i.e. Scheme’s open-input-file
would be spelled
open_input_file
instead. In these languages, it’s
common to use a leading underscore to mark private (as opposed to
public or exported) identifiers. This leads to potential
ambiguity with identifiers such as _123
that start
with an underscore and contain only underscores and digits. Those
tokens often parse as identifiers. If we made them parse as
numbers in Scheme instead, it could confuse users.
Scheme supports a rich numeric tower of integers, ratios, real and complex numbers. These come in exact and inexact variants. For real numbers, we have decimal-point and exponent notation. Particular implementations add quaternions and units of measure to the mix. Common Lisp’s potential numbers offer a glimpse of how far numerical syntax can go. These intricate extensions, some of which we cannot even anticipate yet, make it even trickier for us to specify a digit-separation scheme devoid of ambiguity.
We attempt to solve these problems with a conservative rule that allows underscores only between digits. After considering everything in the above paragraph, we did not manage to come up with any concrete examples of present or future tasks that would be impeded by this restricted version of the syntax extension.
We stipulate that conforming implementations must allow one underscore between any two digits, in any part of a number.
The rule includes:
Underscores in numbers of any radix (binary, octal, decimal, hexadecimal).
Underscores between letters that represent digits in a radix higher than 10 (hexadecimal in particular).
Underscores in the numerator and/or denominator of a ratio.
Underscores in the integer, fractional and/or exponent part of a real number.
Underscores in the real and/or imaginary part of a complex number.
Underscores in any dimension of a hypercomplex number (for implementations with syntax for such numbers).
Underscores in both exact and inexact numbers.
Underscores in the quantity part of a number with a unit of measure (for implementations with syntax for units of measure).
Underscores between leading zeros (but not before the first zero).
The rule excludes:
Leading underscores. They are potentially confused with
symbols that are coming from or going to other programming
languages. For example, the C language permits the symbol
(identifier) _123
.
Underscores between sign and magnitude.
Underscores between a radix or exactness prefix, and the digits.
Trailing underscores. They may cause trouble if another syntax extension is made later to support units of measure. Should the name of a unit begin with a digit, it would be ambiguous where the quantity ends and where the unit begins.
Two or more consecutive underscores. We did not think of concrete situations where these would be problematic, but decided to avoid them anyway. There are enough similar gotchas that caution seems the wise choice.
Conforming implementations may be more lenient in what they allow (to maintain compatibility with existing code). In this document, numbers written according to the above rule are called conforming. Other numbers (which may or may not be valid depending on the implementation) are called non-conforming.
0123 ; conforming
0_1_2_3 ; conforming
0_123 ; conforming
01_23 ; conforming
012_3 ; conforming
+0123 ; conforming
+0_123 ; conforming
-0123 ; conforming
-0_123 ; conforming
_0123 ; non-conforming
0123_ ; non-conforming
0123__ ; non-conforming
01__23 ; non-conforming
0_1__2___3 ; non-conforming
+_0123 ; non-conforming
+0123_ ; non-conforming
-_0123 ; non-conforming
-0123_ ; non-conforming
1_2_3/4_5_6_7 ; conforming
12_34/5_678 ; conforming
1_2_3/_4_5_6_7 ; non-conforming
_12_34/5_678 ; non-conforming
0_1_23.4_5_6 ; conforming
1_2_3.5e6 ; conforming
1_2e1_2 ; conforming
_0123.456 ; non-conforming
0123_.456 ; non-conforming
0123._456 ; non-conforming
0123.456_ ; non-conforming
123_.5e6 ; non-conforming
123._5e6 ; non-conforming
123.5_e6 ; non-conforming
123.5e_6 ; non-conforming
123.5e6_ ; non-conforming
12_e12 ; non-conforming
12e_12 ; non-conforming
12e12_ ; non-conforming
-12_3.0_00_00-12_34.56_78i ; conforming
-12_3.0_00_00@-12_34.56_78 ; conforming
-12_3.0_00_00-12_34.56_78_i ; non-conforming
-12_3.0_00_00-12_34.56_78i_ ; non-conforming
-12_3.0_00_00_@-12_34.56_78 ; non-conforming
-12_3.0_00_00@_-12_34.56_78 ; non-conforming
Kawa supports quaternions using the following syntax:
1+2i-3j+4k
By applying the rule a syntax like that can be extended as follows:
1_0+2_0i-3_0j+4_0k ; conforming
1_0_+2_0i-3_0j+4_0k ; non-conforming
1_0+2_0_i-3_0j+4_0k ; non-conforming
1_0+2_0i-3_0j_+4_0k ; non-conforming
1_0+2_0i-3_0j+4_0k_ ; non-conforming
Kawa supports units of measure using the following syntax:
123456cm^2
By applying the rule a syntax like that can be extended as follows:
123_456cm^2 ; conforming
123_456_cm^2 ; non-conforming
123_456.78_cm^2 ; non-conforming
#b10_10_10 ; conforming
#o23_45_67 ; conforming
#d45_67_89 ; conforming
#xAB_CD_EF ; conforming
#x-2_0 ; conforming
#o+2_345_6 ; conforming
#x-_2 ; non-conforming
_#x-_2 ; non-conforming
#d_45_67_89 ; non-conforming
#e_45/67_89 ; non-conforming
#i#o_1234 ; non-conforming
#i_#o_1234 ; non-conforming
#e#x1234_ ; non-conforming
TODO
TODO
Copyright (C) TODO 2019
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.