help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regular Expression String Search With Bison


From: Giacinto Cifelli
Subject: Re: Regular Expression String Search With Bison
Date: Wed, 13 Mar 2019 07:13:21 +0100

Hello Ricardo,
On Tue, Mar 12, 2019 at 2:21 PM Ricardo Grant <address@hidden> wrote:
>
> Hello,
>
> I am struggling to understand bison, and parser in general, so I hope I

you are in the right place!
>From what I see below, you have already a good notion of what bison
does (parsing of a context-free grammar), and most likely you know the
theory.
But you could use a book for a starter (the only one I know of is
"flex & bison", Jonh Levine).
I give you some elements of an answers below, but maybe you will have
even more questions...

> can get some help. To understand better I decided to try to create a
> program the understands simple regular expressions, and is able to show
> correct sub string matches. Here is the grammar for the language I would
> like to make:
>
> <regex>   ::= <term> '|' <regex>
>            | term
> <term>    ::= <term> <factor>
>            | <factor>
> <factor>  ::= <base> '*'
>            | <base>
> <base>    ::= '(' regex ')'
>            | char
>

I am not sure whether this is really a regular expression or a context
free grammar.
can you parse it with grep?
A CF-grammar is a Chomsky type 2 grammar, while a RE-grammar is a type
3, which is a special case of type 2.
Bison is maybe the best tool for type 2 grammars. While it can be used
for type 3, it is an overkill. I'll come to that again below.

> I have attempted to make smaller code to explain my issue, since using
> words alone is a bit difficult. Some of it may be nonsensical:
>
> %{
>    #include <stdio.h>
>    #include <string.h>
>    #include <stdbool.h>
>
>    #define BUF_MAX 31
>
>    int yylex (void);
>    void yyerror (char const *);
>
>    char regxp[BUF_MAX];
>    char text[BUF_MAX];
>    char input[BUF_MAX]
>
>    int i = 0;
>
>    struct regxp
>    {
>      bool star;
>      char str[BUF_MAX];
>    };
> %}
>
> %parse-param {char* regex}
> %define api.value.type union
> %token <char> CHAR
> %type  <struct regxp> regex term factor base
>
> %%
>    input:
>      input line
>    | %empty
>    ;
>
>    line:
>      regex '\n' { printf ($1.str); }
>    | error '\n' { yyerrork; }
>    | '\n'
>    ;
>
>    regex:
>      regex term { strncpy ($1.str, $2.str); $$ = $1; }
>    | term { $$ = $1; }

I am not sure you can copy a struct like this.
You should either use a <struct regxp *> as a type, and proceed with
dynamic allocation, or maybe consider using the %union option.

>    ;
>
>    term:
>      term factor { strncpy ($1.str, $2.str); $$ = $1; }
>    | factor { $$ = $1; }
>    ;
>
>    factor:
>      base '*' { $$ = $1; $$.star = true; }
>    | base { $$ = $1; sscanf (input, "%c", &$1.str) }
>    ;
>
>    base:
>      '(' regex ')' { $$ = $1; }
>    | CHAR { input[++i] = $1; }
> %%
>
> int
> main (int argc, char const **argv)
> {
>    if (argc < 3)
>      return 1;
>    else
>      strncpy (argv[2], regxp[], BUF_MAX);
>    return yypase (argv[1]);
> }
>
> What exactly happens for the semantic variables, especially $$? My

you get the final $$ at the end of the parsing. If you don't use
dynamic allocation, they go on the stack and are lost when you
eventually return from the parser.
So you can use them as you go, as you have done, or - if you have
constructed a tree along the way - walk through it at the end of the
parsing.

> smallest element is a character, but I have to idea how to go from a
> character to a base token in code. So I think there will have to be some
> way of passing information around to the string in the regexp type.

bison calls a lexer for getting its tokens, by default the function
yylex(), repeatedly until the end of the input.
So in general it is used with flex.
flex is specialized for RE-grammars, and constructs rules based on
real regular expressions.
For example, you can have there a rule:
[A-Za-z_][A-Za-z_0-9]*  that would match all alphanumeric strings (not
starting with numbers, and also containing underscore - a normal
identifier for several languages).
so, a flex rule could be:
[A-Za-z_][A-Za-z_0-9]*    { return ID; }
(ID is to be declared in bison as %token ID), and can be directly used
there. Bison grammars needs only this number (tokens are numbers) to
assemble a sentence, but you can also extract the value (that you
would store in a semantic variable).


>
> In my main section, I just want to get a regular expression and a string
> to test against, like ./prog ab* ababb.

this is unlikely to work, because the linux shell will try to expand
your arguments.
Better save it in a file, change the main for a simpler one:

void main(void) {
    yyparse();
}

and call your program like this:
$ ./prog <file.txt

file.txt would contain your pattern:
$ cat file.txt
ab* ababb.

> _______________________________________________
> address@hidden https://lists.gnu.org/mailman/listinfo/help-bison



reply via email to

[Prev in Thread] Current Thread [Next in Thread]