bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Regex support


From: Elias Mårtenson
Subject: Re: [Bug-apl] Regex support
Date: Tue, 3 Oct 2017 12:14:09 +0800

In the default mode, as I have demonstrated earlier, when the regexp has parenthesised subexpressions, the strings matching those expressions will be returned as separate strings. This is logical and in my opinion makes perfect sense.

When using ⊂-mode, parenthesised expressions doesn't change the behaviour at all, as there is no natural behaviour to implement in this case.

However, it would be nice to have a way to use subexpressions to split strings, so I'm thinking of something like the following:

      "([0-9]{4})-([0-9]{2})-([0-9]{2})" ⎕RE[something] "foo 2010-02-03"
┏→━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃0 0 0 0 1 1 1 1 0 2 2 0 3 3┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

Note that this variation is different from the previous one in that the ⊂-mode described in my previous email repeatedly calls the matching function, marking each result in the output bitmap, while the proposed version above runs the match only once, marking the subexpressions in the result.

I'm starting to think that both are needed, but what symbols should be used in the axis argument to indicate the desired mode?

An alternative output for the same _expression_ would be something like the following, which would match pretty much exactly what the underlying PCRE function returns:

┏→━━━━┓
↓ 4 14┃
┃ 4  8┃
┃10 11┃
┃13 14┃
┃ 4 14┃
┗━━━━━┛

Would this is be a useful variation too? And if so, what axis marker should be used for it?

Regards,
Elias

On 3 October 2017 at 01:30, Juergen Sauermann <address@hidden> wrote:
Hi Elias,

I believe it is better to keep things together, i.e. in a single ⎕ function than in several.

It may be intuitive to use the character ⊂
instead of B in the axis argument to indicate
that the result is meant for dyadic ⊂.

/// Jürgen


On 10/02/2017 10:47 AM, Elias Mårtenson wrote:
In playing around with this, I realise that the "B" mode is quite useful. So much so, in fact, that I'm wondering if it's warranted to have a dedicated quad-function for this specific behaviour.

Here's an example of extracting sequences of 4 characters:

      {⍵ ⊂⍨ "[a-z]{4}" ⎕RE['B'] ⍵} 'abcdef45abchello9'
┏→━━━━━━━━━━━━━━━━━━━┓
┃"abcd" "abch" "ello"┃
┗∊━━━━━━━━━━━━━━━━━━━┛

Regards,
Elias

On 2 October 2017 at 16:27, Elias Mårtenson <address@hidden> wrote:
Some progress:

The behaviour I described earlier still works, but now has the ability to work N-dimensional arrays of strings, compiling the regex only once and then applying it on all the cells.

In addition to this, I have now also added a flag "B" (meaning "bitmap") that creates a bitmap of all matches and can be used in conjunction with ⊂ to split strings by regex.

Here's an example:

      " +" ⎕RE["B"] "this is   a     test"
┏→━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃0 0 0 0 1 0 0 2 2 2 0 3 3 3 3 3 0 0 0 0┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

This matches any sequence of spaces, and we can easily use ⊂ to split the string:

      {⍵ ⊂⍨ 0=" +" ⎕RE["B"] ⍵} "this is   a     test"
┏→━━━━━━━━━━━━━━━━━━━━━┓
┃"this" "is" "a" "test"┃
┗∊━━━━━━━━━━━━━━━━━━━━━┛

However, I'm not sure if the value returned from the function are ideal. The idea of the increasing numbers is to be able to differentiate between the result of:

      " " ⎕RE["B"] "    "
┏→━━━━━━┓
┃1 2 3 4┃
┗━━━━━━━┛

vs:

      " +" ⎕RE["B"] "    "
┏→━━━━━━┓
┃1 1 1 1┃
┗━━━━━━━┛

Should it be left like this, or should it be done in some other way?

Regards,
Elias

On 25 September 2017 at 20:10, Juergen Sauermann <address@hidden> wrote:
Hi Elias,

making a quad function an operator is simple if the function argument(s) is/are primitive functions
and a little more complicated if not.

First of all you have to implement (read: overload) some of the eval_XXX() function that have function
arguments. For monadic operators these eval_XXX() functions areare:

   virtual Token eval_ALB(Value_P A, Token & LO, Value_P B)
   virtual Token eval_ALXB(Value_P A, Token & LO, Value_P X, Value_P B)
   virtual Token eval_LB(Token & LO, Value_P B)
   virtual Token eval_LXB(Token & LO, Value_P X, Value_P B)

where L resp. LO stands for the left function argument. For a dyadic operators they are:

   virtual Token eval_ALRB(Value_P A, Token & LO, Token & RO, Value_P B)
   virtual Token eval_ALRXB(Value_P A, Token & LO, Token & RO, Value_P X, Value_P B)
   virtual Token eval_LRB(Token & LO, Token & RO, Value_P B)
   virtual Token eval_LRXB(Token & LO, Token & RO, Value_P X, Value_P B)

where L resp. LO and R resp. RO stand for the left and right function argument(s), A and B
are the value arguments, and X the axis.

Not all of them need to be implemented only those that have function signatures that
are supported by the operator (mainly in terms of allowing an axis argument X or a
left value argument A).

If an operator supports defined functions (as opposed to primitive functions) then it will typically
implement the operator itself as a macro, which means that the implementation is written in APL
rather than in C++ (similar to "magic functions" in NARS). This is needed because primitive functions
are atomic (they either succeed or fail, but cannot be continued after a failure) while defined functions
(and operators) can continue at the point of interruption after having fixed the values that have cause
the fault.

Some of the build-in operators in GNU APL have both a primitive implementation (which is used when
the function arguments are primitive) and a macro based implementation if not. This is for performance
reasons so that the ability to take defined functions as arguments does not performance-wise harm the
cases where the function arguments are primitive.

The Macro definitions are contained in Macro.def

Please note that in GNU APL functions cannot return functions, which may or may not be a problem
in your case, depending on whether the function argument(s) of the ⎕-operator is/are primitive or not.
In standard APL you cannot assign a function to a name. The usual work-around return a string and ⍎ it.

My guts feeling is that if you need function arguments for implementing regular expressions then
something has been going into the wrong direction somewhere else.

Best Regards,
/// Jürgen



On 09/25/2017 05:18 AM, Elias Mårtenson wrote:
Dyalog's implementation is much more expressive than what I had proposed.

There are technical reasons why we have no hope of replicating their functionality (in particular, GNU APL does not have support for namespaces).

Their function takes arguments and returns a function, which is a matcher function that can be reused, which is useful since you'd only compile the regexp once. Jürgen, how can I make a quad-function behave like below? It seems to be similar in behaviour to ⍤ and ⍣.

*      ('.at' ⎕R '\u0') 'The cat sat on the mat' *
The CAT SAT on the MAT

It can also accept a function, in which case the function is called for each match, to return a replacement string. Can you explain how to make a quad-function an operator?
*
*
*      ('\w+' ⎕R {⌽⍵.Match}) 'The cat sat on the mat'*
ehT tac tas no eht tam

As you can see, they leverage namespaces in order to pass a lot of different fields to the replace-function. If we want to do something similar, ⍵ would probably have to be the match string, and we'll have to live without the remaining fields.

Regards,
Elias


On 23 September 2017 at 00:08, Juergen Sauermann <address@hidden <mailto:address@hiddenline.de>> wrote:

    Hi,

    I have not looked into Dyalogs implementation myself, but if they
    have it then we should aim at being as compatible as it makes sense.
    No problem if some of their capabilities are not supported (please
    avoid
    going over the top in the GNU APL implementation)

    Unfortunately ⎕R is already occupied in GNU APL (inherited from
    IBM APL2),
    so some other name(s) are needed.

    Before implementing too much in advance, it would be good to
    present the
    intended syntax and semantics on bug-apl and solicit opinions.

    /// Jürgen


    On 09/22/2017 04:59 PM, Elias Mårtenson wrote:
    I did not know this. I took a look at Dyalog's API and it's not
    possible to implement it fully, as it relies on their object
    oriented features. However, the basic functionality wouldn't be
    hard to replicate, if that is something that is desired.

    Jürgen, what is your opinion on this?

    On 22 September 2017 at 20:21, Jay Foad <address@hidden
    <mailto:address@hidden>> wrote:

        FYI Dyalog has operators ⎕S (search) and ⎕R (replace) which
        are implemented with PCRE:

        ('[Aa]..'⎕S'&')'Dyalog APL'
        ┌───┬───┐
        │alo│APL│
        └───┴───┘
        ('red' 'green'⎕R'green' 'blue')'red orange yellow green blue'
        green orange yellow blue blue

        http://help.dyalog.com/16.0/Content/Language/System%20Functions/r.htm
        <http://help.dyalog.com/16.0/Content/Language/System%20Functions/r.htm>

        Jay.










reply via email to

[Prev in Thread] Current Thread [Next in Thread]