[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11. Rules When Writing New C Code

The XEmacs C Code is extremely complex and intricate, and there are many rules that are more or less consistently followed throughout the code. Many of these rules are not obvious, so they are explained here. It is of the utmost importance that you follow them. If you don't, you may get something that appears to work, but which will crash in odd situations, often in code far away from where the actual breakage is.

11.1 Introduction to Writing C Code  
11.2 Writing New Modules  
11.3 Working with Lisp Objects  
11.4 Writing Lisp Primitives  
11.5 Writing Good Comments  
11.6 Adding Global Lisp Variables  
11.7 Writing Macros  
11.8 Proper Use of Unsigned Types  
11.9 Major Textual Changes  
11.10 Debugging and Testing  

See also 26.10 Coding for Mule.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.1 Introduction to Writing C Code

The C code is actually written in a dialect of C called Clean C, meaning that it can be compiled, warning-free, with either a C or C++ compiler. Coding in Clean C has several advantages over plain C. C++ compilers are more nit-picking, and a number of coding errors have been found by compiling with C++. The ability to use both C and C++ tools means that a greater variety of development tools are available to the developer. In addition, the ability to overload operators in C++ means it is possible, for error-checking purposes, to redefine certain simple types (normally defined as aliases for simple built-in types such as unsigned char or long) as classes, strictly limiting the permissible operations and catching illegal implicit casts and such.

XEmacs follows the GNU coding standards, which are documented separately in See section `top' in GNU Coding Standards. This section mainly documents standards that are not included in that document; typically this consists of standards that are specifically relevant to the XEmacs code itself.

First, a recap of the GNU standards:

Now, the XEmacs coding standards:

Specially-prefixed functions/variables:

Functions for manipulating Lisp types:

Other:


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.2 Writing New Modules

Every module includes `<config.h>' (angle brackets so that `--srcdir' works correctly; `config.h' may or may not be in the same directory as the C sources) and `lisp.h'. `config.h' must always be included before any other header files (including system header files) to ensure that certain tricks played by various `s/' and `m/' files work out correctly.

When including header files, always use angle brackets, not double quotes, except when the file to be included is always in the same directory as the including file. If either file is a generated file, then that is not likely to be the case. In order to understand why we have this rule, imagine what happens when you do a build in the source directory using `./configure' and another build in another directory using `../work/configure'. There will be two different `config.h' files. Which one will be used if you `#include "config.h"'?

Almost every module contains a syms_of_*() function and a vars_of_*() function. The former declares any Lisp primitives you have defined and defines any symbols you will be using. The latter declares any global Lisp variables you have added and initializes global C variables in the module. Important: There are stringent requirements on exactly what can go into these functions. See the comment in `emacs.c'. The reason for this is to avoid obscure unwanted interactions during initialization. If you don't follow these rules, you'll be sorry! If you want to do anything that isn't allowed, create a complex_vars_of_*() function for it. Doing this is tricky, though: you have to make sure your function is called at the right time so that all the initialization dependencies work out.

Declare each function of these kinds in `symsinit.h'. Make sure it's called in the appropriate place in `emacs.c'. You never need to include `symsinit.h' directly, because it is included by `lisp.h'.

All global and static variables that are to be modifiable must be declared uninitialized. This means that you may not use the "declare with initializer" form for these variables, such as int some_variable = 0;. The reason for this has to do with some kludges done during the dumping process: If possible, the initialized data segment is re-mapped so that it becomes part of the (unmodifiable) code segment in the dumped executable. This allows this memory to be shared among multiple running XEmacs processes. XEmacs is careful to place as much constant data as possible into initialized variables during the `temacs' phase.

Please note: This kludge only works on a few systems nowadays, and is rapidly becoming irrelevant because most modern operating systems provide copy-on-write semantics. All data is initially shared between processes, and a private copy is automatically made (on a page-by-page basis) when a process first attempts to write to a page of memory.

Formerly, there was a requirement that static variables not be declared inside of functions. This had to do with another hack along the same vein as what was just described: old USG systems put statically-declared variables in the initialized data space, so those header files had a #define static declaration. (That way, the data-segment remapping described above could still work.) This fails badly on static variables inside of functions, which suddenly become automatic variables; therefore, you weren't supposed to have any of them. This awful kludge has been removed in XEmacs because

  1. almost all of the systems that used this kludge ended up having to disable the data-segment remapping anyway;
  2. the only systems that didn't were extremely outdated ones;
  3. this hack completely messed up inline functions.

Here are things to know when you create a new source file:


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.3 Working with Lisp Objects

Conventions involving Lisp objects

Of course the low-level implementation language of XEmacs is C, but much of that uses the Lisp engine to do its work. However, because the code is "inside" of the protective containment shell around the "reactor core," you'll see lots of complex "plumbing" needed to do the work and "safety mechanisms," whose failure results in a meltdown. This section provides a quick overview (or review) of the various components of the implementation of Lisp objects.

Two typographic conventions help to identify C objects that implement Lisp objects. The first is that capitalized identifiers, especially beginning with the letters `Q', `V', `F', and `S', for C variables and functions, and C macros with beginning with the letter `X', are used to implement Lisp. The second is that where Lisp uses the hyphen `-' in symbol names, the corresponding C identifiers use the underscore `_'. Of course, since XEmacs Lisp contains interfaces to many external libraries, those external names will follow the coding conventions their authors chose, and may overlap the "XEmacs name space." However these cases are usually pretty obvious.

All Lisp objects are handled indirectly. The Lisp_Object type is usually a pointer to a structure, except for a very small number of types with immediate representations (currently characters and fixnums). However, these types cannot be directly operated on in C code, either, so they can also be considered indirect. Types that do not have an immediate representation always have a C typedef Lisp_type for a corresponding structure.

In older code, it was common practice to pass around pointers to Lisp_type, but this is now deprecated in favor of using Lisp_Object for all function arguments and return values that are Lisp objects. The Xtype macro is used to extract the pointer and cast it to (Lisp_type *) for the desired type.

Convention: macros whose names begin with `X' operate on Lisp_Objects and do no type-checking. Many such macros are type extractors, but others implement Lisp operations in C (e.g., XCAR implements the Lisp car function). These are unsafe, and must only be used where types of all data have already been checked. Such macros are only applied to Lisp_Objects. In internal implementations where the pointer has already been converted, the structure is operated on directly using the C -> member access operator.

The typeP, CHECK_type, and CONCHECK_type macros are used to test types. The first returns a Boolean value, and the latter signal errors. (The `CONCHECK' variety allows execution to be CONtinued under some circumstances, thus the name.) Functions which expect to be passed user data invariably call `CHECK' macros on arguments.

There are many types of specialized Lisp objects implemented in C, but the most pervasive type is the symbol. Symbols are used as identifiers, variables, and functions.

Convention: Global variables whose names begin with `Q' are constants whose value is a symbol. The name of the variable should be derived from the name of the symbol using the same rules as for Lisp primitives. Such variables allow the C code to check whether a particular Lisp_Object is equal to a given symbol. Symbols are Lisp objects, so these variables may be passed to Lisp primitives. (A tempting alternative to the use of `Q...' variables is to call the intern function at initialization in the vars_of_module function. But this does not staticpro the symbol, which in theory could get uninterned, and then garbage collected while you're not looking. You could staticpro yourself, but in a production XEmacs intern and staticpro is all that DEFSYMBOL does, while in a debugging XEmacs it also does some error-checking, which you normally want.)

Convention: Global variables whose names begin with `V' are variables that contain Lisp objects. The convention here is that all global variables of type Lisp_Object begin with `V', and no others do (not even fixnum and boolean variables that have Lisp equivalents). Most of the time, these variables have equivalents in Lisp, which are defined via the `DEFVAR' family of macros, but some don't. Since the variable's value is a Lisp_Object, it can be passed to Lisp primitives.

The implementation of Lisp primitives is more complex. Convention: Global variables with names beginning with `S' contain a structure that allows the Lisp engine to identify and call a C function. In modern versions of XEmacs, these identifiers are almost always completely hidden in the DEFUN and SUBR macros, but you will encounter them if you look at very old versions of XEmacs or at GNU Emacs. Convention: Functions with names beginning with `F' implement Lisp primitives. Of course all their arguments and their return values must be Lisp_Objects. (This is hidden in the DEFUN macro.)

Working with Lisp lists

Lisp lists are popular data structures in the C code as well as in Elisp. There are two sets of macros that iterate over lists. EXTERNAL_LIST_LOOP_n should be used when the list has been supplied by the user, and cannot be trusted to be acyclic and nil-terminated. A malformed-list or circular-list error will be generated if the list being iterated over is not entirely kosher. LIST_LOOP_n, on the other hand, is faster and less safe, and can be used only on trusted lists.

Related macros are GET_EXTERNAL_LIST_LENGTH and GET_LIST_LENGTH, which calculate the length of a list, and in the case of GET_EXTERNAL_LIST_LENGTH, validating the properness of the list. The macros EXTERNAL_LIST_LOOP_DELETE_IF and LIST_LOOP_DELETE_IF delete elements from a lisp list satisfying some predicate.

Implementation of Lisp objects

At the lowest levels, XEmacs makes heavy use of object-oriented techniques to promote code-sharing and uniform interfaces for different devices and platforms. Commonly, but not always, such objects are "wrapped" and exported to Lisp as Lisp objects. Usually they use the internal structures developed for Lisp objects (the `lrecord' structure) in order to take advantage of Lisp memory management. Unfortunately, XEmacs was originally written in C, so these techniques are based on heavy use of C macros.

A module defining a class is likely to use most of the following declarations and macros. In the following, the notation `<type>' will stand for the full name of the class, and will be capitalized in the way normal for its context. The notation `<typ>' will stand for the abbreviated form commonly used in macro names, while `ty' will be used as the typical name for instances of the class. (See the entry for `MAYBE_<TY>METH' below for an example using all three notations.)

In the interface (`.h' file), the following declarations are used often. Others may be used in for particular modules. Since they're quite short in most cases, the definitions are given as well. The generic macros used are defined in `lisp.h' or `lrecord.h'.

`typedef struct Lisp_<Type> Lisp_<Type>'
This refers to the internal structure used by C code. The XEmacs coding style now forbids passing pointers to `Lisp_<Type>' structures into or out of a function; instead, a `Lisp_Object' should be passed or returned (created using `wrap_<type>', if necessary).

`DECLARE_LISP_OBJECT (<type>, Lisp_<Type>)'
Declares a Lisp object for `<Type>', which is the unit of allocation.

`#define X<TYPE>(x) XRECORD (x, <type>, Lisp_<Type>)'
Turns a Lisp_Object into a pointer to `struct Lisp_<Type>'.

`#define wrap_<type>(p) wrap_record (p, <type>)'
Turns a pointer to `struct Lisp_<Type>' into a Lisp_Object.

`#define <TYPE>P(x) RECORDP (x, <type>)'
Tests whether a given Lisp_Object is of type `Lisp_<Type>'. Returns a C int, not a Lisp Boolean value.

`#define CHECK_<TYPE>(x) CHECK_RECORD (x, <type>)'
`#define CONCHECK_<TYPE>(x) CONCHECK_RECORD (x, <type>)'
Tests whether a given Lisp_Object is of type `Lisp_<Type>', and signals a Lisp error if not. The `CHECK' version of the macro never returns if the type is wrong, while the `CONCHECK' version can return if the user catches it in the debugger and explicitly requests a return.

`#define RAW_<TYP>METH(ty, m) ((ty)->methods->m##_method)'
Return a function pointer for the method for an object TY of class `Lisp_<Type>', or `NULL' if there is none for this type.

`#define HAS_<TYP>METH_P(ty, m) (!!RAW_<TYP>METH (ty, m))'
Test whether the class that TY is an instance of has the method.

`#define <TYP>METH(ty, m, args) ((RAW_<TYP>METH (ty, m)) args)'
Call the method on `args'. `args' must be enclosed in parentheses in the call. It is the programmer's responsibility to ensure that the method is available. The standard convenience macro `MAYBE_<TYP>METH' is often provided for the common case where a void-returning method of `Type' is called.

`#define MAYBE_<TYP>METH(ty, m, args) do { ... } while (0)'
Call a void-returning `<Type>' method, if it exists. Note the use of the `do ... while (0)' idiom to give the macro call C statement semantics. The full definition is equally idiomatic:

 
#define MAYBE_<TYP>METH(ty, m, args) do {	\
  Lisp_<Type> *maybe_<typ>meth_ty = (ty);	\
  if (HAS_<TYP>METH_P (maybe_<typ>meth_ty, m))	\
    <TYP>METH (maybe_<typ>meth_ty, m, args);	\
} while (0)

The use of macros for invoking an object's methods makes life a bit difficult for the student or maintainer when browsing the code. In particular, calls are of the form `<TYP>METH (ty, some_method, (x, y))', but definitions typically are for `<subtype>_some_method'. Thus, when you are trying to find calls, you need to grep for `some_method', but this will also catch calls and definitions of that method for instances of other subtypes of `<Type>', and there may be a rather large number of them.

Here is a checklist of things to do when creating a new lisp object type named foo:

  1. Create foo.h
  2. Create foo.c
  3. Add definitions of syms_of_foo, etc. to `foo.c'
  4. Add declarations of syms_of_foo, etc. to `symsinit.h'
  5. Add calls to syms_of_foo, etc. to `emacs.c'
  6. Add definitions of macros like CHECK_FOO and FOOP to `foo.h'
  7. Add the new type index to enum lrecord_type
  8. Add a DEFINE_*_LISP_OBJECT() to `foo.c'
  9. Add an INIT_LISP_OBJECT call to syms_of_foo.c


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.4 Writing Lisp Primitives

Lisp primitives are Lisp functions implemented in C. The details of interfacing the C function so that Lisp can call it are handled by a few C macros. The only way to really understand how to write new C code is to read the source, but we can explain some things here.

An example of a special operator is the definition of prog1, from `eval.c'. (An ordinary function would have the same general appearance.)

 
DEFUN ("prog1", Fprog1, 1, UNEVALLED, 0, /*
Similar to `progn', but the value of the first form is returned.
\(prog1 FIRST BODY...): All the arguments are evaluated sequentially.
The value of FIRST is saved during evaluation of the remaining args,
whose values are discarded.
*/
       (args))
{
  /* This function can GC */
  REGISTER Lisp_Object val, form, tail;
  struct gcpro gcpro1;

  val = Feval (XCAR (args));

  GCPRO1 (val);

  LIST_LOOP_3 (form, XCDR (args), tail)
    Feval (form);

  UNGCPRO;
  return val;
}

Let's start with a precise explanation of the arguments to the DEFUN macro. Here is a template for them:

 
DEFUN (lname, fname, min_args, max_args, interactive, /*
docstring
*/
   (arglist))

lname
This string is the name of the Lisp symbol to define as the function name; in the example above, it is "prog1".

fname
This is the C function name for this function. This is the name that is used in C code for calling the function. The name is, by convention, `F' prepended to the Lisp name, with all dashes (`-') in the Lisp name changed to underscores. Thus, to call this function from C code, call Fprog1. Remember that the arguments are of type Lisp_Object; various macros and functions for creating values of type Lisp_Object are declared in the file `lisp.h'.

Primitives whose names are special characters (e.g. + or <) are named by spelling out, in some fashion, the special character: e.g. Fplus() or Flss(). Primitives whose names begin with normal alphanumeric characters but also contain special characters are spelled out in some creative way, e.g. let* becomes FletX().

Each function also has an associated structure that holds the data for the subr object that represents the function in Lisp. This structure conveys the Lisp symbol name to the initialization routine that will create the symbol and store the subr object as its definition. The C variable name of this structure is always `S' prepended to the fname. You hardly ever need to be aware of the existence of this structure, since DEFUN plus DEFSUBR takes care of all the details.

min_args
This is the minimum number of arguments that the function requires. The function prog1 allows a minimum of one argument.

max_args
This is the maximum number of arguments that the function accepts, if there is a fixed maximum. Alternatively, it can be UNEVALLED, indicating a special operator that receives unevaluated arguments, or MANY, indicating an unlimited number of evaluated arguments (the C equivalent of &rest). Both UNEVALLED and MANY are macros. If max_args is a number, it may not be less than min_args and it may not be greater than 8. (If you need to add a function with more than 8 arguments, use the MANY form. Resist the urge to edit the definition of DEFUN in `lisp.h'. If you do it anyways, make sure to also add another clause to the switch statement in primitive_funcall().)

interactive
This is an interactive specification, a string such as might be used as the argument of interactive in a Lisp function. In the case of prog1, it is 0 (a null pointer), indicating that prog1 cannot be called interactively. A value of "" indicates a function that should receive no arguments when called interactively.

docstring
This is the documentation string. It is written just like a documentation string for a function defined in Lisp; in particular, the first line should be a single sentence. Note how the documentation string is enclosed in a comment, none of the documentation is placed on the same lines as the comment-start and comment-end characters, and the comment-start characters are on the same line as the interactive specification. `make-docfile', which scans the C files for documentation strings, is very particular about what it looks for, and will not properly extract the doc string if it's not in this exact format.

In order to make both `etags' and `make-docfile' happy, make sure that the DEFUN line contains the lname and fname, and that the comment-start characters for the doc string are on the same line as the interactive specification, and put a newline directly after them (and before the comment-end characters).

arglist
This is the comma-separated list of arguments to the C function. For a function with a fixed maximum number of arguments, provide a C argument for each Lisp argument. In this case, unlike regular C functions, the types of the arguments are not declared; they are simply always of type Lisp_Object.

The names of the C arguments will be used as the names of the arguments to the Lisp primitive as displayed in its documentation, modulo the same concerns described above for F... names (in particular, underscores in the C arguments become dashes in the Lisp arguments).

There is one additional kludge: A trailing `_' on the C argument is discarded when forming the Lisp argument. This allows C language reserved words (like default) or global symbols (like dirname) to be used as argument names without compiler warnings or errors.

A Lisp function with max_args = UNEVALLED is a special operator; its arguments are not evaluated. Instead it receives one argument of type Lisp_Object, a (Lisp) list of the unevaluated arguments, conventionally named (args).

When a Lisp function has no upper limit on the number of arguments, specify max_args = MANY. In this case its implementation in C actually receives exactly two arguments: the number of Lisp arguments (an int) and the address of a block containing their values (a Lisp_Object *). In this case only are the C types specified in the arglist: (int nargs, Lisp_Object *args).

Within the function Fprog1 itself, note the use of the macros GCPRO1 and UNGCPRO. GCPRO1 is used to "protect" a variable from garbage collection--to inform the garbage collector that it must look in that variable and regard the object pointed at by its contents as an accessible object. This is necessary whenever you call Feval or anything that can directly or indirectly call Feval (this includes the QUIT macro!). At such a time, any Lisp object that you intend to refer to again must be protected somehow. UNGCPRO cancels the protection of the variables that are protected in the current function. It is necessary to do this explicitly.

The macro GCPRO1 protects just one local variable. If you want to protect two, use GCPRO2 instead; repeating GCPRO1 will not work. Macros GCPRO3 and GCPRO4 also exist.

These macros implicitly use local variables such as gcpro1; you must declare these explicitly, with type struct gcpro. Thus, if you use GCPRO2, you must declare gcpro1 and gcpro2.

Note also that the general rule is caller-protects; i.e. you are only responsible for protecting those Lisp objects that you create. Any objects passed to you as arguments should have been protected by whoever created them, so you don't in general have to protect them.

In particular, the arguments to any Lisp primitive are always automatically GCPROed, when called "normally" from Lisp code or bytecode. So only a few Lisp primitives that are called frequently from C code, such as Fprogn protect their arguments as a service to their caller. You don't need to protect your arguments when writing a new DEFUN.

GCPROing is perhaps the trickiest and most error-prone part of XEmacs coding. It is extremely important that you get this right and use a great deal of discipline when writing this code. See section GCPROing, for full details on how to do this.

What DEFUN actually does is declare a global structure of type Lisp_Subr whose name begins with capital `SF' and which contains information about the primitive (e.g. a pointer to the function, its minimum and maximum allowed arguments, a string describing its Lisp name); DEFUN then begins a normal C function declaration using the F... name. The Lisp subr object that is the function definition of a primitive (i.e. the object in the function slot of the symbol that names the primitive) actually points to this `SF' structure; when Feval encounters a subr, it looks in the structure to find out how to call the C function.

Defining the C function is not enough to make a Lisp primitive available; you must also create the Lisp symbol for the primitive (the symbol is interned; see section 23.2 Obarrays) and store a suitable subr object in its function cell. (If you don't do this, the primitive won't be seen by Lisp code.) The code looks like this:

 
DEFSUBR (fname);

Here fname is the same name you used as the second argument to DEFUN.

This call to DEFSUBR should go in the syms_of_*() function at the end of the module. If no such function exists, create it and make sure to also declare it in `symsinit.h' and call it from the appropriate spot in main(). See section 11.2 Writing New Modules.

Note that C code cannot call functions by name unless they are defined in C. The way to call a function written in Lisp from C is to use Ffuncall, which embodies the Lisp function funcall. Since the Lisp function funcall accepts an unlimited number of arguments, in C it takes two: the number of Lisp-level arguments, and a one-dimensional array containing their values. The first Lisp-level argument is the Lisp function to call, and the rest are the arguments to pass to it. Since Ffuncall can call the evaluator, you must protect pointers from garbage collection around the call to Ffuncall. (However, Ffuncall explicitly protects all of its parameters, so you don't have to protect any pointers passed as parameters to it.)

The C functions call0, call1, call2, and so on, provide handy ways to call a Lisp function conveniently with a fixed number of arguments. They work by calling Ffuncall.

`eval.c' is a very good file to look through for examples; `lisp.h' contains the definitions for important macros and functions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.5 Writing Good Comments

Comments are a lifeline for programmers trying to understand tricky code. In general, the less obvious it is what you are doing, the more you need a comment, and the more detailed it needs to be. You should always be on guard when you're writing code for stuff that's tricky, and should constantly be putting yourself in someone else's shoes and asking if that person could figure out without much difficulty what's going on. (Assume they are a competent programmer who understands the essentials of how the XEmacs code is structured but doesn't know much about the module you're working on or any algorithms you're using.) If you're not sure whether they would be able to, add a comment. Always err on the side of more comments, rather than less.

Generally, when making comments, there is no need to attribute them with your name or initials. This especially goes for small, easy-to-understand, non-opinionated ones. Also, comments indicating where, when, and by whom a file was changed are strongly discouraged, and in general will be removed as they are discovered. This is exactly what `ChangeLogs' are there for. However, it can occasionally be useful to mark exactly where (but not when or by whom) changes are made, particularly when making small changes to a file imported from elsewhere. These marks help when later on a newer version of the file is imported and the changes need to be merged. (If everything were always kept in CVS, there would be no need for this. But in practice, this often doesn't happen, or the CVS repository is later on lost or unavailable to the person doing the update.)

When putting in an explicit opinion in a comment, you should always attribute it with your name and the date. This also goes for long, complex comments explaining in detail the workings of something -- by putting your name there, you make it possible for someone who has questions about how that thing works to determine who wrote the comment so they can write to them. Use your actual name or your alias at xemacs.org, and not your initials or nickname, unless that is generally recognized (e.g. `jwz'). Even then, please consider requesting a virtual user at xemacs.org (forwarding address; we can't provide an actual mailbox). Otherwise, give first and last name. If you're not a regular contributor, you might consider putting your email address in -- it may be in the ChangeLog, but after awhile ChangeLogs have a tendency of disappearing or getting muddled. (E.g. your comment may get copied somewhere else or even into another program, and tracking down the proper ChangeLog may be very difficult.)

If you come across an opinion that is not or is no longer valid, or you come across any comment that no longer applies but you want to keep it around, enclose it in `[[ ' and ` ]]' marks and add a comment afterwards explaining why the preceding comment is no longer valid. Put your name on this comment, as explained above.

Just as comments are a lifeline to programmers, incorrect comments are death. If you come across an incorrect comment, immediately correct it or flag it as incorrect, as described in the previous paragraph. Whenever you work on a section of code, always make sure to update any comments to be correct -- or, at the very least, flag them as incorrect.

To indicate a "todo" or other problem, use four pound signs -- i.e. `####'.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.6 Adding Global Lisp Variables

Global variables whose names begin with `Q' are constants whose value is a symbol of a particular name. The name of the variable should be derived from the name of the symbol using the same rules as for Lisp primitives. These variables are initialized using a call to defsymbol() in the syms_of_*() function. (This call interns a symbol, sets the C variable to the resulting Lisp object, and calls staticpro() on the C variable to tell the garbage-collection mechanism about this variable. What staticpro() does is add a pointer to the variable to a large global array; when garbage-collection happens, all pointers listed in the array are used as starting points for marking Lisp objects. This is important because it's quite possible that the only current reference to the object is the C variable. In the case of symbols, the staticpro() doesn't matter all that much because the symbol is contained in obarray, which is itself staticpro()ed. However, it's possible that a naughty user could do something like uninterning the symbol out of obarray or even setting obarray to a different value [although this is likely to make XEmacs crash!].)

Please note: It is potentially deadly if you declare a `Q...' variable in two different modules. The two calls to defsymbol() are no problem, but some linkers will complain about multiply-defined symbols. The most insidious aspect of this is that often the link will succeed anyway, but then the resulting executable will sometimes crash in obscure ways during certain operations!

To avoid this problem, declare any symbols with common names (such as text) that are not obviously associated with this particular module in the file `general-slots.h'. The "-slots" suffix indicates that this is a file that is included multiple times in `general.c'. Redefinition of preprocessor macros allows the effects to be different in each context, so this is actually more convenient and less error-prone than doing it in your module.

Global variables whose names begin with `V' are variables that contain Lisp objects. The convention here is that all global variables of type Lisp_Object begin with `V', and all others don't (including fixnum and boolean variables that have Lisp equivalents). Most of the time, these variables have equivalents in Lisp, but some don't. Those that do are declared this way by a call to DEFVAR_LISP() in the vars_of_*() initializer for the module. What this does is create a special symbol-value-forward Lisp object that contains a pointer to the C variable, intern a symbol whose name is as specified in the call to DEFVAR_LISP(), and set its value to the symbol-value-forward Lisp object; it also calls staticpro() on the C variable to tell the garbage-collection mechanism about the variable. When eval (or actually symbol-value) encounters this special object in the process of retrieving a variable's value, it follows the indirection to the C variable and gets its value. setq does similar things so that the C variable gets changed.

Whether or not you DEFVAR_LISP() a variable, you need to initialize it in the vars_of_*() function; otherwise it will end up as all zeroes, which is the integer 0 (not nil), and this is probably not what you want. Also, if the variable is not DEFVAR_LISP()ed, you must call staticpro() on the C variable in the vars_of_*() function. Otherwise, the garbage-collection mechanism won't know that the object in this variable is in use, and will happily collect it and reuse its storage for another Lisp object, and you will be the one who's unhappy when you can't figure out how your variable got overwritten.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.7 Writing Macros

Heavily used small code fragments need to be fast. The traditional way to implement such code fragments in C is with macros. But macros in C are known to be broken.

Macro arguments that are repeatedly evaluated may suffer from repeated side effects or suboptimal performance.

Variable names used in macros may collide with caller's variables, causing (at least) unwanted compiler warnings.

In order to solve these problems, and maintain statement semantics, one should use the do { ... } while (0) trick (which safely works inside of if statements) while trying to reference macro arguments exactly once using local variables.

Let's take a look at this poor macro definition:

 
#define MARK_OBJECT(obj) \
  if (!marked_p (obj)) mark_object (obj), did_mark = 1

This macro evaluates its argument twice, and also fails if used like this:

 
  if (flag) MARK_OBJECT (obj); else do_something();

A much better definition is

 
#define MARK_OBJECT(obj) do { \
  Lisp_Object mo_obj = (obj); \
  if (!marked_p (mo_obj))     \
    {                         \
      mark_object (mo_obj);   \
      did_mark = 1;           \
    }                         \
} while (0)

Notice the elimination of double evaluation by using the local variable with the obscure name. Writing safe and efficient macros requires great care. The one problem with macros that cannot be portably worked around is, since a C block has no value, a macro used as an expression rather than a statement cannot use the techniques just described to avoid multiple evaluation.

In most cases where a macro has function semantics, an inline function is a better implementation technique. Modern compiler optimizers tend to inline functions even if they have no inline keyword, and configure magic ensures that the inline keyword can be safely used as an additional compiler hint. Inline functions used in a single .c files are easy. The function must already be defined to be static. Just add another inline keyword to the definition.

 
inline static int
heavily_used_small_function (int arg)
{
  ...
}

Inline functions in header files are trickier, because we would like to make the following optimization if the function is not inlined (for example, because we're compiling for debugging). We would like the function to be defined externally exactly once, and each calling translation unit would create an external reference to the function, instead of including a definition of the inline function in the object code of every translation unit that uses it. This optimization is currently only available for gcc. But you don't have to worry about the trickiness; just define your inline functions in header files using this pattern:

 
DECLARE_INLINE_HEADER (
int
i_used_to_be_a_crufty_macro_but_look_at_me_now (int arg)
)
{
  ...
}

We use DECLARE_INLINE_HEADER rather than just the modifier INLINE_HEADER to prevent warnings when compiling with gcc -Wmissing-declarations. I consider issuing this warning for inline functions a gcc bug, but the gcc maintainers disagree.

Every header which contains inline functions, either directly by using DECLARE_INLINE_HEADER or indirectly by using DECLARE_LISP_OBJECT must be added to `inline.c''s includes to make the optimization described above work. (Optimization note: if all INLINE_HEADER functions are in fact inlined in all translation units, then the linker can just discard inline.o, since it contains only unreferenced code).

The three golden rules of macros:

  1. Anything that's an lvalue can be evaluated more than once.
  2. Macros where anything else can be evaluated more than once should have the word "unsafe" in their name (exceptions may be made for large sets of macros that evaluate arguments of certain types more than once, e.g. struct buffer * arguments, when clearly indicated in the macro documentation). These macros are generally meant to be called only by other macros that have already stored the calling values in temporary variables.
  3. Nothing else can be evaluated more than once. Use inline functions, if necessary, to prevent multiple evaluation.

NOTE: The functions and macros below are given full prototypes in their docs, even when the implementation is a macro. In such cases, passing an argument of a type other than expected will produce undefined results. Also, given that macros can do things functions can't (in particular, directly modify arguments as if they were passed by reference), the declaration syntax has been extended to include the call-by-reference syntax from C++, where an & after a type indicates that the argument is an lvalue and is passed by reference, i.e. the function can modify its value. (This is equivalent in C to passing a pointer to the argument, but without the need to explicitly worry about pointers.)

When to capitalize macros:


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.8 Proper Use of Unsigned Types

Avoid using unsigned int and unsigned long whenever possible. Unsigned types are viral -- any arithmetic or comparisons involving mixed signed and unsigned types are automatically converted to unsigned, which is almost certainly not what you want. Many subtle and hard-to-find bugs are created by careless use of unsigned types. In general, you should almost never use an unsigned type to hold a regular quantity of any sort. The only exceptions are

  1. When there's a reasonable possibility you will actually need all 32 or 64 bits to store the quantity.
  2. When calling existing APIs that require unsigned types. In this case, you should still do all manipulation using signed types, and do the conversion at the very threshold of the API call.
  3. In existing code that you don't want to modify because you don't maintain it.
  4. In bit-field structures.

Other reasonable uses of unsigned int and unsigned long are representing non-quantities -- e.g. bit-oriented flags and such.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.9 Major Textual Changes

Sometimes major textual changes are made to the source. This means that a search-and-replace is done to change type names and such. Some people disagree with such changes, and certainly if done without good reason will just lead to headaches. But it's important to keep the code clean and understandable, and consistent naming goes a long way towards this.

An example of the right way to do this was the so-called "great integral type renaming".

11.9.1 Great Integral Type Renaming  
11.9.2 Text/Char Type Renaming  


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.9.1 Great Integral Type Renaming

The purpose of this is to rationalize the names used for various integral types, so that they match their intended uses and follow consist conventions, and eliminate types that were not semantically different from each other.

The conventions are:

For the actual name changes, see the script below.

I ran the following script to do the conversion. (NOTE: This script is idempotent. You can safely run it multiple times and it will not screw up previous results -- in fact, it will do nothing if nothing has changed. Thus, it can be run repeatedly as necessary to handle patches coming in from old workspaces, or old branches.) There are two tags, just before and just after the change: `pre-integral-type-rename' and `post-integral-type-rename'. When merging code from the main trunk into a branch, the best thing to do is first merge up to `pre-integral-type-rename', then apply the script and associated changes, then merge from `post-integral-type-change' to the present. (Alternatively, just do the merging in one operation; but you may then have a lot of conflicts needing to be resolved by hand.)

Script `fixtypes.sh' follows:

 
----------------------------------- cut ------------------------------------
files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
gr Memory_Count Bytecount $files
gr Lstream_Data_Count Bytecount $files
gr Element_Count Elemcount $files
gr Hash_Code Hashcode $files
gr extcount bytecount $files
gr bufpos charbpos $files
gr bytind bytebpos $files
gr memind membpos $files
gr bufbyte intbyte $files
gr Extcount Bytecount $files
gr Bufpos Charbpos $files
gr Bytind Bytebpos $files
gr Memind Membpos $files
gr Bufbyte Intbyte $files
gr EXTCOUNT BYTECOUNT $files
gr BUFPOS CHARBPOS $files
gr BYTIND BYTEBPOS $files
gr MEMIND MEMBPOS $files
gr BUFBYTE INTBYTE $files
gr MEMORY_COUNT BYTECOUNT $files
gr LSTREAM_DATA_COUNT BYTECOUNT $files
gr ELEMENT_COUNT ELEMCOUNT $files
gr HASH_CODE HASHCODE $files
----------------------------------- cut ------------------------------------

The `gr' script, and the scripts it uses, are documented in `README.global-renaming', because if placed in this file they would need to have their @ characters doubled, meaning you couldn't easily cut and paste from the source.

In addition to those programs, I needed to fix up a few other things, particularly relating to the duplicate definitions of types, now that some types merged with others. Specifically:

  1. in `lisp.h', removed duplicate declarations of Bytecount. The changed code should now look like this: (In each code snippet below, the first and last lines are the same as the original, as are all lines outside of those lines. That allows you to locate the section to be replaced, and replace the stuff in that section, verifying that there isn't anything new added that would need to be kept.)

     
    --------------------------------- snip -------------------------------------
    /* Counts of bytes or chars */
    typedef EMACS_INT Bytecount;
    typedef EMACS_INT Charcount;
    
    /* Counts of elements */
    typedef EMACS_INT Elemcount;
    
    /* Hash codes */
    typedef unsigned long Hashcode;
    
    /* ------------------------ dynamic arrays ------------------- */
    --------------------------------- snip -------------------------------------
    

  2. in `lstream.h', removed duplicate declaration of Bytecount. Rewrote the comment about this type. The changed code should now look like this:

     
    --------------------------------- snip -------------------------------------
    #endif
    
    /* The have been some arguments over the what the type should be that
       specifies a count of bytes in a data block to be written out or read in,
       using Lstream_read(), Lstream_write(), and related functions.
       Originally it was long, which worked fine; Martin ``corrected'' these to
       size_t and ssize_t on the grounds that this is theoretically cleaner and
       is in keeping with the C standards.  Unfortunately, this practice is
       horribly error-prone due to design flaws in the way that mixed
       signed/unsigned arithmetic happens.  In fact, by doing this change,
       Martin introduced a subtle but fatal error that caused the operation of
       sending large mail messages to the SMTP server under Windows to fail.
       By putting all values back to be signed, avoiding any signed/unsigned
       mixing, the bug immediately went away.  The type then in use was
       Lstream_Data_Count, so that it be reverted cleanly if a vote came to
       that.  Now it is Bytecount.
    
       Some earlier comments about why the type must be signed: This MUST BE
       SIGNED, since it also is used in functions that return the number of
       bytes actually read to or written from in an operation, and these
       functions can return -1 to signal error.
    
       Note that the standard Unix read() and write() functions define the
       count going in as a size_t, which is UNSIGNED, and the count going
       out as an ssize_t, which is SIGNED.  This is a horrible design
       flaw.  Not only is it highly likely to lead to logic errors when a
       -1 gets interpreted as a large positive number, but operations are
       bound to fail in all sorts of horrible ways when a number in the
       upper-half of the size_t range is passed in -- this number is
       unrepresentable as an ssize_t, so code that checks to see how many
       bytes are actually written (which is mandatory if you are dealing
       with certain types of devices) will get completely screwed up.
    
       --ben
    */
    
    typedef enum lstream_buffering
    --------------------------------- snip -------------------------------------
    

  3. in `dumper.c', there are four places, all inside of switch() statements, where XD_BYTECOUNT appears twice as a case tag. In each case, the two case blocks contain identical code, and you should *REMOVE THE SECOND* and leave the first.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.9.2 Text/Char Type Renaming

The purpose of this was

  1. To distinguish between "charptr" when it refers to operations on the pointer itself and when it refers to operations on text
  2. To use consistent naming for everything referring to internal format, i.e.

 
	Itext == text in internal format
	Ibyte == a byte in such text
	Ichar == a char as represented in internal character format

Thus e.g.

 
	set_charptr_emchar -> set_itext_ichar
This was done using a script like this:

 
files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"
gr Intbyte Ibyte $files
gr INTBYTE IBYTE $files
gr intbyte ibyte $files
gr EMCHAR ICHAR $files
gr emchar ichar $files
gr Emchar Ichar $files
gr INC_CHARPTR INC_IBYTEPTR $files
gr DEC_CHARPTR DEC_IBYTEPTR $files
gr VALIDATE_CHARPTR VALIDATE_IBYTEPTR $files
gr valid_charptr valid_ibyteptr $files
gr CHARPTR ITEXT $files
gr charptr itext $files
gr Charptr Itext $files

See above for the source to `gr'.

As in the integral-types change, there are pre and post tags before and after the change:

 
	pre-internal-format-textual-renaming
	post-internal-format-textual-renaming

When merging a large branch, follow the same sort of procedure documented above, using these tags -- essentially sync up to the pre tag, then apply the script yourself, then sync from the post tag to the present. You can probably do the same if you don't have a separate workspace, but do have lots of outstanding changes and you'd rather not just merge all the textual changes directly. Use something like this:

(WARNING: I'm not a CVS guru; before trying this, or any large operation that might potentially mess things up, DEFINITELY make a backup of your existing workspace.)

 
cup -r pre-internal-format-textual-renaming
<apply script>
cup -A -j post-internal-format-textual-renaming -j HEAD

This might also work:

 
cup -j pre-internal-format-textual-renaming
<apply script>
cup -j post-internal-format-textual-renaming -j HEAD

ben

The following is a script to go in the opposite direction:

 
files="*.[ch] s/*.h m/*.h config.h.in ../configure.in Makefile.in.in ../lib-src/*.[ch] ../lwlib/*.[ch]"

# Evidently Perl considers _ to be a word char ala \b, even though XEmacs
# doesn't.  We need to be careful here with ibyte/ichar because of words
# like Richard, eicharlen(), multibyte, HIBYTE, etc.

gr Ibyte Intbyte $files
gr '\bIBYTE' INTBYTE $files
gr '\bibyte' intbyte $files
gr '\bICHAR' EMCHAR $files
gr '\bichar' emchar $files
gr '\bIchar' Emchar $files
gr '\bIBYTEPTR' CHARPTR $files
gr '\bibyteptr' charptr $files
gr '\bITEXT' CHARPTR $files
gr '\bitext' charptr $files
gr '\bItext' CHARPTR $files

gr '_IBYTE' _INTBYTE $files
gr '_ibyte' _intbyte $files
gr '_ICHAR' _EMCHAR $files
gr '_ichar' _emchar $files
gr '_Ichar' _Emchar $files
gr '_IBYTEPTR' _CHARPTR $files
gr '_ibyteptr' _charptr $files
gr '_ITEXT' _CHARPTR $files
gr '_itext' _charptr $files
gr '_Itext' _CHARPTR $files


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

11.10 Debugging and Testing

To make a purified XEmacs, do: make puremacs. To make a quantified XEmacs, do: make quantmacs.

You simply can't dump Quantified and Purified images (unless using the portable dumper). Purify gets confused when xemacs frees memory in one process that was allocated in a different process on a different machine! Run it like so:

 
temacs -batch -l loadup.el run-temacs xemacs-args...

To make an XEmacs that can tell valgrind to do a memory leak check at runtime, configure --with-valgrind. If XEmacs has been configured --with-newgc, then valgrind must be invoked with --vex-iropt-precise-memory-exns=yes in order to handle signals properly.

Before you go through the trouble, are you compiling with all debugging and error-checking off? If not, try that first. Be warned that while Quantify is directly responsible for quite a few optimizations which have been made to XEmacs, doing a run which generates results which can be acted upon is not necessarily a trivial task.

Also, if you're still willing to do some runs make sure you configure with the `--quantify' flag. That will keep Quantify from starting to record data until after the loadup is completed and will shut off recording right before it shuts down (which generates enough bogus data to throw most results off). It also enables three additional elisp commands: quantify-start-recording-data, quantify-stop-recording-data and quantify-clear-data.

If you want to make XEmacs faster, target your favorite slow benchmark, run a profiler like Quantify, gprof, or tcov, and figure out where the cycles are going. In many cases you can localize the problem (because a particular new feature or even a single patch elicited it). Don't hesitate to use brute force techniques like a global counter incremented at strategic places, especially in combination with other performance indications (e.g., degree of buffer fragmentation into extents).

Specific projects:

Unfortunately, Emacs Lisp is slow, and is going to stay slow. Function calls in elisp are especially expensive. Iterating over a long list is going to be 30 times faster implemented in C than in Elisp.

To get started debugging XEmacs, take a look at the `.gdbinit' and `.dbxrc' files in the `src' directory. See the section in the XEmacs FAQ on How to Debug an XEmacs problem with a debugger.

After making source code changes, run make check to ensure that you haven't introduced any regressions. If you want to make xemacs more reliable, please improve the test suite in `tests/automated'.

Did you make sure you didn't introduce any new compiler warnings?

Before submitting a patch, please try compiling at least once with

 
configure --with-mule --use-union-type --error-checking=all


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by XEmacs Webmaster on August, 3 2012 using texi2html