1<!--===- docs/Character.md 2 3 Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. 4 See https://llvm.org/LICENSE.txt for license information. 5 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception 6 7--> 8 9# Implementation of `CHARACTER` types in f18 10 11```eval_rst 12.. contents:: 13 :local: 14``` 15 16## Kinds and Character Sets 17 18The f18 compiler and runtime support three kinds of the intrinsic 19`CHARACTER` type of Fortran 2018. 20The default (`CHARACTER(KIND=1)`) holds 8-bit character codes; 21`CHARACTER(KIND=2)` holds 16-bit character codes; 22and `CHARACTER(KIND=4)` holds 32-bit character codes. 23 24We assume that code values 0 through 127 correspond to 25the 7-bit ASCII character set (ISO-646) in every kind of `CHARACTER`. 26This is a valid assumption for Unicode (UCS == ISO/IEC-10646), 27ISO-8859, and many legacy character sets and interchange formats. 28 29`CHARACTER` data in memory and unformatted files are not in an 30interchange representation (like UTF-8, Shift-JIS, EUC-JP, or a JIS X). 31Each character's code in memory occupies a 1-, 2-, or 4- byte 32word and substrings can be indexed with simple arithmetic. 33In formatted I/O, however, `CHARACTER` data may be assumed to use 34the UTF-8 variable-length encoding when it is selected with 35`OPEN(ENCODING='UTF-8')`. 36 37`CHARACTER(KIND=1)` literal constants in Fortran source files, 38Hollerith constants, and formatted I/O with `ENCODING='DEFAULT'` 39are not translated. 40 41For the purposes of non-default-kind `CHARACTER` constants in Fortran 42source files, formatted I/O with `ENCODING='UTF-8'` or non-default-kind 43`CHARACTER` value, and conversions between kinds of `CHARACTER`, 44by default: 45* `CHARACTER(KIND=1)` is assumed to be ISO-8859-1 (Latin-1), 46* `CHARACTER(KIND=2)` is assumed to be UCS-2 (16-bit Unicode), and 47* `CHARACTER(KIND=4)` is assumed to be UCS-4 (full Unicode in a 32-bit word). 48 49In particular, conversions between kinds are assumed to be 50simple zero-extensions or truncation, not table look-ups. 51 52We might want to support one or more environment variables to change these 53assumptions, especially for `KIND=1` users of ISO-8859 character sets 54besides Latin-1. 55 56## Lengths 57 58Allocatable `CHARACTER` objects in Fortran may defer the specification 59of their lengths until the time of their allocation or whole (non-substring) 60assignment. 61Non-allocatable objects (and non-deferred-length allocatables) have 62lengths that are fixed or assumed from an actual argument, or, 63in the case of assumed-length `CHARACTER` functions, their local 64declaration in the calling scope. 65 66The elements of `CHARACTER` arrays have the same length. 67 68Assignments to targets that are not deferred-length allocatables will 69truncate or pad the assigned value to the length of the left-hand side 70of the assignment. 71 72Lengths and offsets that are used by or exposed to Fortran programs via 73declarations, substring bounds, and the `LEN()` intrinsic function are always 74represented in units of characters, not bytes. 75In generated code, assumed-length arguments, the runtime support library, 76and in the `elem_len` field of the interoperable descriptor `cdesc_t`, 77lengths are always in units of bytes. 78The distinction matters only for kinds other than the default. 79 80Fortran substrings are rather like subscript triplets into a hidden 81"zero" dimension of a scalar `CHARACTER` value, but they cannot have 82strides. 83 84## Concatenation 85 86Fortran has one `CHARACTER`-valued intrinsic operator, `//`, which 87concatenates its operands (10.1.5.3). 88The operands must have the same kind type parameter. 89One or both of the operands may be arrays; if both are arrays, their 90shapes must be identical. 91The effective length of the result is the sum of the lengths of the 92operands. 93Parentheses may be ignored, so any `CHARACTER`-valued expression 94may be "flattened" into a single sequence of concatenations. 95 96The result of `//` may be used 97* as an operand to another concatenation, 98* as an operand of a `CHARACTER` relation, 99* as an actual argument, 100* as the right-hand side of an assignment, 101* as the `SOURCE=` or `MOLD=` of an `ALLOCATE` statemnt, 102* as the selector or case-expr of an `ASSOCIATE` or `SELECT` construct, 103* as a component of a structure or array constructor, 104* as the value of a named constant or initializer, 105* as the `NAME=` of a `BIND(C)` attribute, 106* as the stop-code of a `STOP` statement, 107* as the value of a specifier of an I/O statement, 108* or as the value of a statement function. 109 110The f18 compiler has a general (but slow) means of implementing concatenation 111and a specialized (fast) option to optimize the most common case. 112 113### General concatenation 114 115In the most general case, the f18 compiler's generated code and 116runtime support library represent the result as a deferred-length allocatable 117`CHARACTER` temporary scalar or array variable that is initialized 118as a zero-length array by `AllocatableInitCharacter()` 119and then progressively augmented in place by the values of each of the 120operands of the concatenation sequence in turn with calls to 121`CharacterConcatenate()`. 122Conformability errors are fatal -- Fortran has no means by which a program 123may recover from them. 124The result is then used as any other deferred-length allocatable 125array or scalar would be, and finally deallocated like any other 126allocatable. 127 128The runtime routine `CharacterAssign()` takes care of 129truncating, padding, or replicating the value(s) assigned to the left-hand 130side, as well as reallocating an nonconforming or deferred-length allocatable 131left-hand side. It takes the descriptors of the left- and right-hand sides of 132a `CHARACTER` assignemnt as its arguments. 133 134When the left-hand side of a `CHARACTER` assignment is a deferred-length 135allocatable and the right-hand side is a temporary, use of the runtime's 136`MoveAlloc()` subroutine instead can save an allocation and a copy. 137 138### Optimized concatenation 139 140Scalar `CHARACTER(KIND=1)` expressions evaluated as the right-hand sides of 141assignments to independent substrings or whole variables that are not 142deferred-length allocatables can be optimized into a sequence of 143calls to the runtime support library that do not allocate temporary 144memory. 145 146The routine `CharacterAppend()` copies data from the right-hand side value 147to the remaining space, if any, in the left-hand side object, and returns 148the new offset of the reduced remaining space. 149It is essentially `memcpy(lhs + offset, rhs, min(lhsLength - offset, rhsLength))`. 150It does nothing when `offset > lhsLength`. 151 152`void CharacterPad()`adds any necessary trailing blank characters. 153