Re: compiled lisp file format (Re: Skipping unexec via a big .elc file)

On May 28, 2017, at 08:43, Philipp Stephani <address@hidden> wrote:

Ken Raeburn <address@hidden> schrieb am So., 28. Mai 2017 um 13:07 Uhr:

On May 21, 2017, at 04:53, Paul Eggert <address@hidden> wrote:

> Ken Raeburn wrote:
>> The Guile project has taken this idea pretty far; they’re generating ELF object files with a few special sections for Guile objects, using the standard DWARF sections for debug information, etc. While it has a certain appeal (making C modules and Lisp files look much more similar, maybe being able to link Lisp and C together into one executable image, letting GDB understand some of your data), switching to a machine-specific format would be a pretty drastic change, when we can currently share the files across machines.
>
> Although it does indeed sound like a big change, I don't see why it would prevent us from sharing the files across machines. Emacs can use standard ELF and DWARF format on any platform if Emacs is doing the loading. And there should be some software-engineering benefit in using the same format that Guile uses.

Sorry for the delay in responding.

The ELF format has header fields indicating the word size, endianness, machine architecture (though there’s a value for “none”), and OS ABI. Some fields vary in size or order depending on whether the 32-bit or 64-bit format is in use. Some other format details (e.g., relocation types, interpretation of certain ranges of values in some fields) are architecture- or OS-dependent; we might not care about many of those details, but relocations are likely needed if we want to play linking games or use DWARF.

I think Guile is using whatever the native word size and architecture are. If we do that for Emacs, they’re not portable between platforms. Currently it works for me to put my Lisp files, both source and compiled, into ~/elisp and use them from different kinds of machines if my home directory is NFS-mounted.

We could instead pick fixed values (say, architecture “none”, little-endian, 32-bit), but then there’s no guarantee that we could use any of the usual GNU tools on them without a bunch of work, or that we’d ever be able to use non-GNU tools to treat them as object files. Then again, we couldn’t expect to do the latter portably anyway, since some of the platforms don’t even use ELF.

Is there any significant advantage of using ELF, or could this just use one of the standard binary serialization formats (protobuf, flatbuffer, ...)?

That’s an interesting idea. If one of the popular serialization libraries is compatibly licensed, easy to use, and performs well, it may be better than rolling our own. It’ll need to handle data structures with circular or cross-linked references. And we have the doc string delayed-loading optimization (that currently uses #$ and #@ syntaxes); presumably we’d like to keep that optimization in some form. It would be good not to have to build all our data structures on ones generated by the tool with its own bookkeeping fields; having anything in a cons cell besides the “car” and “cdr” slots would mean a significant increase in memory use.

I initially said, “follow the model of flat object file formats”, not “use ELF”; ELF is just one way of organizing the data of an object file, with years of experience behind it, which we could use wholesale or borrow some lessons from. One of the typical advantages of object file formats is that the data is grouped for efficient memory usage; some sections of a file will be mapped into the address space read-only (shared between processes), other sections read-write (possibly shared until copied on write), and others not mapped at all. For example, we might put symbol names (normally never modified but it can be done), doc strings (to be loaded later, only if needed), byte code, and other strings into their own sections, and create Lisp_String objects and such pointing to those bytes as needed. We don’t keep much in the way of source location information for Lisp code around, but if we ever change that, arguably it could go in a file section that’s not mapped or read until the debugger wants the information.

The Guile project’s documentation says their use of ELF is intended to build on existing work to invent a good object file format with several desired characteristics (https://www.gnu.org/software/guile/manual/html_node/Object-File-Format.html):

	• Above all else, it should be very cheap to load a compiled file.
	• It should be possible to statically allocate constants in the file. For example, a bytevector literal in source code can be emitted directly into the object file.
	• The compiled file should enable maximum code and data sharing between different processes.
	• The compiled file should contain debugging information, such as line numbers, but that information should be separated from the code itself. It should be possible to strip debugging information if space is tight.

They’re generating byte code currently, but are looking forward towards generating native code as well (instead?).

Their write-up implicitly assumes that, as with “normal” object files, the idea is to mmap the data into the address space, some of it read-only and some of it automatically getting some patching up, and then using those in-memory objects directly.  There’s no explicit discussion of the tradeoffs of loading a file all at once versus reading one object tree (S-_expression_) at a time from an input stream, but especially when mapping and using much of the data unmodified is feasible, I suspect the all-at-once approach is likely to be more efficient.  Whether that would be true in a case like Emacs, I don’t know.

They use DWARF for carrying some debug information, but so far I’m unsure what information is actually stored there.

Ken

From:	Ken Raeburn
Subject:	Re: compiled lisp file format (Re: Skipping unexec via a big .elc file)
Date:	Mon, 29 May 2017 05:33:50 -0400