1<!--===- docs/Character.md
2
3   Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4   See https://llvm.org/LICENSE.txt for license information.
5   SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6
7-->
8
9# Implementation of `CHARACTER` types in f18
10
11```eval_rst
12.. contents::
13   :local:
14```
15
16## Kinds and Character Sets
17
18The f18 compiler and runtime support three kinds of the intrinsic
19`CHARACTER` type of Fortran 2018.
20The default (`CHARACTER(KIND=1)`) holds 8-bit character codes;
21`CHARACTER(KIND=2)` holds 16-bit character codes;
22and `CHARACTER(KIND=4)` holds 32-bit character codes.
23
24We assume that code values 0 through 127 correspond to
25the 7-bit ASCII character set (ISO-646) in every kind of `CHARACTER`.
26This is a valid assumption for Unicode (UCS == ISO/IEC-10646),
27ISO-8859, and many legacy character sets and interchange formats.
28
29`CHARACTER` data in memory and unformatted files are not in an
30interchange representation (like UTF-8, Shift-JIS, EUC-JP, or a JIS X).
31Each character's code in memory occupies a 1-, 2-, or 4- byte
32word and substrings can be indexed with simple arithmetic.
33In formatted I/O, however, `CHARACTER` data may be assumed to use
34the UTF-8 variable-length encoding when it is selected with
35`OPEN(ENCODING='UTF-8')`.
36
37`CHARACTER(KIND=1)` literal constants in Fortran source files,
38Hollerith constants, and formatted I/O with `ENCODING='DEFAULT'`
39are not translated.
40
41For the purposes of non-default-kind `CHARACTER` constants in Fortran
42source files, formatted I/O with `ENCODING='UTF-8'` or non-default-kind
43`CHARACTER` value, and conversions between kinds of `CHARACTER`,
44by default:
45* `CHARACTER(KIND=1)` is assumed to be ISO-8859-1 (Latin-1),
46* `CHARACTER(KIND=2)` is assumed to be UCS-2 (16-bit Unicode), and
47* `CHARACTER(KIND=4)` is assumed to be UCS-4 (full Unicode in a 32-bit word).
48
49In particular, conversions between kinds are assumed to be
50simple zero-extensions or truncation, not table look-ups.
51
52We might want to support one or more environment variables to change these
53assumptions, especially for `KIND=1` users of ISO-8859 character sets
54besides Latin-1.
55
56## Lengths
57
58Allocatable `CHARACTER` objects in Fortran may defer the specification
59of their lengths until the time of their allocation or whole (non-substring)
60assignment.
61Non-allocatable objects (and non-deferred-length allocatables) have
62lengths that are fixed or assumed from an actual argument, or,
63in the case of assumed-length `CHARACTER` functions, their local
64declaration in the calling scope.
65
66The elements of `CHARACTER` arrays have the same length.
67
68Assignments to targets that are not deferred-length allocatables will
69truncate or pad the assigned value to the length of the left-hand side
70of the assignment.
71
72Lengths and offsets that are used by or exposed to Fortran programs via
73declarations, substring bounds, and the `LEN()` intrinsic function are always
74represented in units of characters, not bytes.
75In generated code, assumed-length arguments, the runtime support library,
76and in the `elem_len` field of the interoperable descriptor `cdesc_t`,
77lengths are always in units of bytes.
78The distinction matters only for kinds other than the default.
79
80Fortran substrings are rather like subscript triplets into a hidden
81"zero" dimension of a scalar `CHARACTER` value, but they cannot have
82strides.
83
84## Concatenation
85
86Fortran has one `CHARACTER`-valued intrinsic operator, `//`, which
87concatenates its operands (10.1.5.3).
88The operands must have the same kind type parameter.
89One or both of the operands may be arrays; if both are arrays, their
90shapes must be identical.
91The effective length of the result is the sum of the lengths of the
92operands.
93Parentheses may be ignored, so any `CHARACTER`-valued expression
94may be "flattened" into a single sequence of concatenations.
95
96The result of `//` may be used
97* as an operand to another concatenation,
98* as an operand of a `CHARACTER` relation,
99* as an actual argument,
100* as the right-hand side of an assignment,
101* as the `SOURCE=` or `MOLD=` of an `ALLOCATE` statemnt,
102* as the selector or case-expr of an `ASSOCIATE` or `SELECT` construct,
103* as a component of a structure or array constructor,
104* as the value of a named constant or initializer,
105* as the `NAME=` of a `BIND(C)` attribute,
106* as the stop-code of a `STOP` statement,
107* as the value of a specifier of an I/O statement,
108* or as the value of a statement function.
109
110The f18 compiler has a general (but slow) means of implementing concatenation
111and a specialized (fast) option to optimize the most common case.
112
113### General concatenation
114
115In the most general case, the f18 compiler's generated code and
116runtime support library represent the result as a deferred-length allocatable
117`CHARACTER` temporary scalar or array variable that is initialized
118as a zero-length array by `AllocatableInitCharacter()`
119and then progressively augmented in place by the values of each of the
120operands of the concatenation sequence in turn with calls to
121`CharacterConcatenate()`.
122Conformability errors are fatal -- Fortran has no means by which a program
123may recover from them.
124The result is then used as any other deferred-length allocatable
125array or scalar would be, and finally deallocated like any other
126allocatable.
127
128The runtime routine `CharacterAssign()` takes care of
129truncating, padding, or replicating the value(s) assigned to the left-hand
130side, as well as reallocating an nonconforming or deferred-length allocatable
131left-hand side.  It takes the descriptors of the left- and right-hand sides of
132a `CHARACTER` assignemnt as its arguments.
133
134When the left-hand side of a `CHARACTER` assignment is a deferred-length
135allocatable and the right-hand side is a temporary, use of the runtime's
136`MoveAlloc()` subroutine instead can save an allocation and a copy.
137
138### Optimized concatenation
139
140Scalar `CHARACTER(KIND=1)` expressions evaluated as the right-hand sides of
141assignments to independent substrings or whole variables that are not
142deferred-length allocatables can be optimized into a sequence of
143calls to the runtime support library that do not allocate temporary
144memory.
145
146The routine `CharacterAppend()` copies data from the right-hand side value
147to the remaining space, if any, in the left-hand side object, and returns
148the new offset of the reduced remaining space.
149It is essentially `memcpy(lhs + offset, rhs, min(lhsLength - offset, rhsLength))`.
150It does nothing when `offset > lhsLength`.
151
152`void CharacterPad()`adds any necessary trailing blank characters.
153