Arrays for C
Early draft proposal

John Nagle
Animats


Safer arrays for C - a proposal
(DRAFT 0.2)

 

Abstract

For many years, C programs have suffered from pointer problems. We offer a way out which is backwards compatible.The goal here is to extend C in a way which allows efficient subscript checking without losing backwards compatibility and without impacting performance significantly.

The problem: Array == Pointer

In C, the declarations

float tab[ ];
and
float* tab;

are equivalent. This has its uses, but since no size information, or even whether a pointer refers to an array or a single element, is carried along with an array, it is a source of error.

C does, of course, allow

float tab[100];

but the size must be a compile-time constant. To pass that array to a function, one would declare a function

void fn(size_t n, float tab[ ])
{
...
}

with the size of tab unknown to the language, although hopefully known to the programmer. This construct is the source of most of the buffer overflow problems in the world today.

C99 extensions

C99 introduced "variable length arrays", using this syntax:

void fn(size_t n)
{ float tab[n];
...
}

This feature was provided to allow variable length arrays of storage class auto, or "on the stack" arrays, which made simple numerical subroutines much more straightforward to write. (See the original edition of "Numerical Recipes in C" for the contortions required for temporary arrays prior to C99.) However, the only context in which this construct is allowed is for "auto" variables. This is limiting.

C99 also introduced "static parameters":

void fn(static float tab[10])
{
...
}

At last, it's possible to talk about the size of an array parameter in C. But C99 limits the size to a constant.

Our proposal

We propose to allow the C99 variable length array syntax in more contexts, including formal function arguments:

void fn(size_t n, static float tab[n])
{
...
}

Here, we've provided a way to express in C what is really going on - an array of size n is being passed. Note that the information about the size of the array is not bundled with the array; the expected implementation remains the same. But the compiler now knows the size of the array.

So what is this good for? Subscript checking, call checking, and simplifying array manipulation.

As with C99 variable length arrays, sizeof (and, in most implementations, lengthof) become variable-valued functions which report the size in bytes and the length in elements of the array.

As a convenience, we suggest that the scope of formal argument declarations within a function prototype cover the entire function prototype, including the portion of the prototype preceding the declaration. This allows the common C idiom

void fn(float tab[n], size_t n)
{
...
}

and, of course, the common declaration

int read(int fd, char buf[n], size_t n);

None of this involves changing the run-time representation of arrays. This is entirely a compile time construct.

Subarrays

It's common in C to talk about a portion of an array using pointer operations. The usual syntax is

float tab[100];
float* tab2 = tab + 20

tab2 then represents an array beginning at tab[20], of undefined length.

We propose the syntax

float tab[100];
float tab2[80] = tab[20:100];

to express such things. The number before the colon is the index of the start of the subarray, and the second number is the index of the end + 1. (This follows Python's notational convention.) Again, the run-time representation is the same, and this is still a pointer assignment, not an array copy. We're now associating a length with "tab2", one which can be retrieved with lengthof or used for checking. In the example above, lengthof(tab) is 100 and lengthof(tab2) is 80.

Initialization

The common idiom

char s[ ] = "Hello, world.";

needs to be dealt with. This is actually a special case of array initialization:

float tab[ ] = {1.0, 1.5, 2.0} ;

This defines a fixed length array, so that the above is equivalent to

float tab[3] = {1.0, 1.5, 2.0};

and the first statement above is equivalent to

char s[14] = "Hello, world.";

These are now considered fixed-length arrays.

Auto

A proposed addition to C++ allows the "auto" keyword to simplify declarations. The newly declared variable takes the type of the initialization expression. Adding this feature to C makes initialization of sized objects simpler:.

auto s = "Hello, world.";

This eliminates the need to count characters.

Casts

Conversions will, of course, be necessary. Syntax is a problem, because, in C, array size is associated with the variable, not the type, and casts don't have a variable name. We'd like to be able to write a cast such as

(float[n])

This needs further discussion.

Void

To reduce the need for "void *", which bypasses most type checking, we would add two new types:

void_t

which, like void, represents data of unknown type. sizeof(void_byte_t) == 1, and arrays of void_t are permitted.

void_zero_t

which is a void_t whose value must be zero. (These are merely names for discussion; in practice, "void_t" is seen in existing code, and a different keyword may be preferable.)

Conversion rules are as follows:

  1. void_zero_t and arrays thereof can be converted to any type of equivalent size.
  2. void_t and arrays thereof can be converted to any type of equivalent size other than a struct which contains one or more pointer types.

The standard malloc function would then be declared:

void_t[size] malloc(size_t size);

and the standard "bzero" function would be defined as

void_zero_t[n] bzero(void_t s[n], size_t n);

Checking

Given this information, we can perform subscript checking. The following checks should be made:

  1. For subscript expressions, if lengthof is defined for the array being subscripted, the subscript must be within the range 0 .. lengthof(arr) -1.
  2. For function calls, the length of the formal argument must be less than or equal to the length of the actual argument.
  3. For assignments, the length of the LHS must be less than or equal to the length of the RHS.

This is backwards-compatible with correct existing code, since checking is only enabled for arrays with either constant length or for which the new syntax is being used.

Extension to structures

A classic idiom in C is:

struct buf {
size_t len;
char buffer[1];
};

This usually indicates a buffer of variable length. But C provides no way to talk about a variable length item within a structure, so the programmer is obliged to lie to the language about it. We would propose

struct buf {
size_t len;
char buffer[len];
};

As with the classic idiom, the variable length array can appear only as the last item of the structure.

The type declaration becomes a bit unusual. We need a type like this:

size_t n = 100;
struct buf ourbuffer(n);

Allocation

To allocate a variable length array, the syntax would be

size_t n = 100;
float tab[n] = (float[n]) malloc(sizeof(float)*n);

It would be tempting to introduce the "new" syntax from C++, so that one could simply write

float tab[n] = new float[n];

but that might be controversial. Certainly one could have

#define NEW(TYPE,LENGTH) (TYPE[(LENGTH)])malloc(sizeof(TYPE)*(LENGTH))
...
float tab[n] = NEW(float,n);

which avoids conflicts with C++.

Discussion

This is a limited extension to C, not an attempt to make C into C++ or arrays into first-class objects. It's still not possible to copy a variable length array with the assignment operator, or pass an array by value. The goal here is to provide the minimal backwards-compatible extension which allows the programmer to say within C the size of an array.

Strict mode

The next step is "strict mode", an optional mode in which the checking is stricter, and sufficient to prevent buffer overflows. In strict mode, we disallow conversions between pointers and arrays. In this mode, a pointer means a pointer to a single object, never an array. Some new restrictions apply:

  1. Subscripting a pointer is prohibited.
  2. Casts from pointers to arrays are normally disallowed.
  3. Pointer arithmetic is prohibited except when all the following hold:
    1. The pointer has storage class "auto" and local scope.
    2. All assignments to the pointer are from the same array.

These restrictions allow subscript checking for pointer arithmetic. The compiler can associate the pointer with a single array, and thus perform subscript checking on it.

Rule 3 allows the classic C idiom

void bcopy(char dest[n], char src[n], size_t n)
{ while (n-- > 0) *dest++ = *src++; }

The compiler can interpret this unambiguously as the equivalent form

void bcopy(char dest[n], char src[n], size_t n)
{ destix = 0; srcix = 0; while (n-- > 0) dest[destix++] = src[srcix++]; }

This form is suitable for subscript checking. A clever compiler could hoist the subscript checks out of the loop.

It seems desirable to allow pointer arithmetic in the cases where it can be easily checked, to support legacy code in strict mode to the greatest extent possible. Where pointer arithmetic cannot be checked, though, it must be disallowed for safety.

Problems

Variable-length null-terminated strings

It's not clear what to do about the common idiom of null-terminated constant strings. This is serious enough that we may want to allow unchecked subscripting for "const" objects. Errors from such operations can't corrupt memory, although they may read junk.

 

April 21, 2008