Re: Translating Modula-2 identifiers to C

Hi Gaius

On Thu, 11 May 2023 at 01:05, Gaius Mulley <gaiusmod2@gmail.com> wrote:

The issue of name mangling above is an interesting idea.

I think the default use should be simple so that the gdb user experience
will see procedure and identifier names as the same as the source code.
(Aiming to make it easy for first year undergrads).

That's very easy to do. You do the development and debugging with the

compiler switch turned off, so that all names in gdb appear exactly the

way they do now. Once you are ready for deployment of the library to be

used from C, then you rebuild it with the compiler switch turned on.

This way there is absolutely no need to let gdb representation issues impact

design decisions.

The compiler should also allow more complex name mangling for advanced
use.

Currently
=========

In gm2 there are named paths which are prefixed to the gcc generated
symbol name. So for example libraries may be different and by default
have named paths, so the m2pim libraries, m2iso libraries have path
names associated with the default locations.

For example the m2pim libraries might be installed at:

$HOME/opt/lib/gcc/x86_64-pc-linux-gnu/13.0.1/m2/m2pim/StrIO.def

and the driver gm2 sets up the named paths resulting in a call to
StrIO.WriteString appear as a call to an external function
m2pim_StrIO_WriteString.

Which allows ISO, PIM libraries to coexist even if they have the same
module name and a different interface. (ISO Storage, PIM Storage, ISO
SYSTEM and PIM SYSTEM for example).

I very much doubt that you would want to write any library intended for

use within C based projects and make use of either PIM or ISO libraries.

The mere dependency on those libraries will be an impediment for using

the library within C. So, I would think you'd rather use C's stdlib and hook

ALLOCATE into malloc(), and DEALLOCATE into free().

HOWEVER, you could still prefix the module prefix with a framework prefix,

if so desired. The heavy lifting of my library is done in the lower level module

that converts any string from mixed case to snake_case and stores it in a

dictionary. The user level library that composes the qualified identifiers by

affixing module prefixes, type suffixes and local suffixes, and also convert

to uppercase for macro identifiers, mostly consists of calling the former

lower level library and then stitch the returned strings together.

It would thus be very little effort to add another optional parameter for a

project suffix to be prepended before the module suffix for fully qualified

names.

gm2 allows _ in any identifier and so it is possible to choose
identifiers which will clash with the name mangling schema above.

Unless you limit lowline use to non-leading, non-trailing and non-consecutive

occurrences. With that minor restriction, it won't clash. I designed it that way

because my translator/compiler permits non-leading, non-trailing and non-

consecutive lowlines in identifiers (when enabled by compiler switch).

You can always have two lowline-identifier modes, one with the restriction

above, and one without. And when the restriction is turned off, then the

compiler switch for snake-case/macro-case identifiers will be turned off.

Very simple.

Proposed change
===============

I wonder if if the following algorithm would resolve the above issue:

In order of priority:

0. DEFINITION FOR "C". Turns off default name mangling for the entire
module.

First, DEFINITION FOR is neither PIM, nor ISO syntax, nor is it in line with

what PIM compilers in the market back in the day used for foreign definition

modules. Back in the day of PIM, the most common I have seen was

FOREIGN DEFINITION MODULE. Some compilers I have seen used

pragmas for the same purpose.

The pragma route is preferable because a compiler that does not support

this can still accept the code and ignore the pragma. The code is semantically

still valid and a Modula-2 implementation could be written to match it.

Pragmas are non-semantic directives, just the right tool for the purpose.

So, I suggest you consider changing this to the <*FFI="C"*> pragma that

we use in our M2R10 specification, but this isn't anything new, like I said

some old compilers I have seen used pragmas, too.

Second, the purpose of a foreign definition module is to provide a

Modula-2 interface for a foreign library, that is to say, there is no

implementation module then.

This won't work when you want to write a Modula-2 library for use

within C. You will have to supply a Modula-2 definition and implementation

module for that purpose. The compiler then needs to translate the

definition module into a matching C header file, and the implementation

module into an object file that can be used from C together with the

generated header file.

Thus, we have two different scenarios:

(1) using a C library from within Modula-2

(2) providing a Modula-2 library for use within C

I am talking about scenario #2, your DEFINITION FOR "C" syntax

is for scenario #1 and probably shouldn't be shoehornder onto #2

because there are significant differences in how the two scenarios

need to be handled by the compiler. It is better to separate them.

Before this background, I'd suggest a different pragma for your

use case of having an entirely flat namespace:

DEFINITION MODULE FooLib <*FLATNAMESPACE*>;

However, whilst there may be some cases where this may be

useful, you probably don't want to write any larger piece of code

using this mode.

You would then have to write ALLCAPS for all your constants and for

all your enumeration types and enumerated values, for example.

And you are giving up one of the major advantages of Modula-2 that

can easily be auto-translated to C by automatic module prefixing.

Plus, you need to be aware of any potential name clashes with C,

for example you couldn't declare a variable switch, since that is

a reserved word in C. There are good reasons why we have

automated the translation of programming languages.

When you write the code you want to focus on the task at hand,

not have your mind divided by name translation issues.

1. <* gcc-name: foo_bar *> The attribute will override the symbol
name as given to the GCC backend.

Having a pragma to supply a custom name for the occasional identifier

is certainly useful and I have already put that on my to do list for my

compiler/translator

PROCEDURE FooBar <*CNAME="foobar"*> ( baz : Bam );

where CNAME stands for custom name, not C name.

However, if you want to write an entire library this way, again, that

would be very cumbersome. Automation is your friend. And then you

supply custom names only in the odd cases, that's much less hassle.

2. <* gcc-mangle: (format specifiers to determine style of mangling)
*>

I would recommend not doing this in your code, but by compiler switch.

You may want to change the output format later for a different purpose

and assuming that you may at some point have support for multiple

styles, then it would be a major hassle to go into every file and change

all those pragmas.

Basically, this should be considered akin to a different target architecture.

You don't tell the compiler in the code what architecture it should generate

code for. You do that on the command line or in the make file, but not in

your code.

And the same reason why you do it that way also applies here.

3. Any symbol containing a leading or trailing or consecutive
occurrences of lowline chars attracts a warning message.

I would make the feature mutually exclusive with unrestricted use

of lowline in identifiers.

4. Non exported identifiers appear as symbols with no mangling.

That is only sensible if you can say with certainty that GM2 will never

ever generate C source code from Modula-2 input.

Because if you don't transform the private identifiers in the same way

then your generated code will have a mish mash of different styles. And

if you ever generate C output, this will become visible and it will get the

code rejected by any self-respecting open source project out there.

5. The default namedpath__modulename__procedurename schema
is applied.

As mentioned above, it would be a minor effort to add a framework prefix

to my library, even though it is rather unlikely you'd want to use PIM/ISO

library dependencies in any C based project, so you'd likely avoid using

them when you develop a library for this purpose.

The detail is in [2] above. [1] and [2] can occur on a scope or per
identifier declaration. Mangling specifiers were used in p2c iirc.
But I had thought that some of the format ideas could be taken from
https://github.com/gcc-mirror/gcc/blob/master/gcc/m2/gm2-compiler/M2MetaError.def
might be useful to drive/implement the format specifier code. This would
allow users to specify the mangling schema on a per module or per
identifier basis if required.

If you go from camel- and title-case in Modula-2 to a snake-case/macro-case style

in C, then you have a lossy conversion where there is opportunity for name clashes.

For this reason, you can't treat all symbols with the same transformation. Instead,

you need to slightly different transformations for each kind of identifier, ideally you

have one transformation each for (1) constants, (2) types, (3) variables, (4) functions

and (5) procedures. Then you need to have a transformation for each of these with

the exception of variables for nested functions/procedures, and in PIM/ISO dialects

also one of each for nested modules.

So, you would need to write quite a lot of boilerplate if you were to come up with

some notation that specifies transformations. Again, cumbersome to use.

As I understand it a LGPL library can't be used as part of GCC and there
are two legal prerequisites:

1. the licence should be GPLv3 for the compiler or GPLv3 with GCC runtime
exemptions for a runtime library.
2. copyright has to be signed over to the FSF.

(but I stand to be corrected :-)

Those are certainly not legal prerequisites for inclusion into GPL projects. The

LGPL was specifically designed by Eben Moglen (FSF counsel) to allow inclusion.

They may be GCC policy though. That I wouldn't know.

However, the library will likely need some modifications if you want to incorporate

it into GM2. For example, in the user level module I pass the identifiers as

interned string objects created by my interned strings library. There is no need

for that, but since I intern all lexemes in my translator/compiler, it is just the

most sensible thing to do. But if you want to incorporate it, I will modify that

so the procedures take a const char* instead. Easy to do.

Anyway, the point is, that whatever adjustments I would be making for

incorporation into GM2, I could relicense under whatever license and terms

you need. That's not an issue.

There would need to be some mentioning that the code was derived from

my original library and relicensed for GM2 so that the FSF cannot then

knock on my door and tell me that my original library is a knock off. ;-)

regards

benjamin

From:	Benjamin Kowarsch
Subject:	Re: Translating Modula-2 identifiers to C
Date:	Thu, 11 May 2023 03:04:59 +0900