help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

yylex and yy_scan_string()


From: Nicolas Peyrussie
Subject: yylex and yy_scan_string()
Date: Tue, 22 Mar 2005 13:57:22 +0100
User-agent: Mozilla Thunderbird 0.9 (X11/20041127)

Hello,

I am developping a program in flex, in order to tokenize html pages.
The aim is to retrieve and store the tags, the text (i.e. all that is not a tag), as well as the urls and domain names (the latters can be found in the tags or in the text).

This is my code :
/**************************************************************/
%}
ALPHA    [0-9]
ALPHANUM [a-zA-Z0-9]
DOMAIN {ALPHANUM}+("-"|".")({ALPHANUM}|"-"|".")+
TAG      "<"(\"[^\"]*\"|\'[^\']*\'|[^\'\"">"\n])*">"

%option stack
%x CUT
%x TAGST
%x URLST WORDST
%x DOMAINST AHREF
%x SCRIPT COMMENT
%%

"&nbsp;"|"&lt;"|"&gt;"|"</"[^>]+">"   {}

"<script"(.*)?">" {}
"<!--"(.*)?"-->" {}

"<script"            {yy_push_state(SCRIPT);}
<SCRIPT>"/script>"   {yy_pop_state();}
<SCRIPT>"\n"         {}
<SCRIPT>. {} <SCRIPT><<EOF>> {yy_pop_state();}

"<!--"               {yy_push_state(COMMENT);}
<COMMENT>"-->"       {yy_pop_state();}
<COMMENT>"\n"        {}
<COMMENT>. {} <COMMENT><<EOF>> {yy_pop_state();}

<INITIAL>{TAG} {
 printf("TAG : %s\n", yytext);
 BEGIN URLST;
 yy_scan_string(yytext);
 yylex();
}

<INITIAL>[^{TAG}][a-zA-Z0-9"address@hidden/:"]+ {
 printf("WORD : %s\n", yytext);
 BEGIN URLST;
 yy_scan_string(yytext);
 yylex();
}

<URLST>https?:\/\/{DOMAIN}(":"{ALPHA}+)?({ALPHANUM}|["~""/""$""-""_"".""+""!""*""'""("")"","";"":""@""&""=""?"])*
{
 printf("URL : %s\n", yytext);
 BEGIN DOMAINST;
 yyless(1);
}

<DOMAINST>\/\/{DOMAIN} {
 printf("DOMAIN  : %s\n", yytext);
}
%%

void lex_initFile(char *file)
{
 yyin =  fopen(file,"r");
 printf("%s\n",file);
 yylex();
}
/**************************************************************/

The problem is that the program stops after the yy_scan_string(), when
I thought it should start again where it had stopped thanks to the
yylex().

e.g. with the following html page :
<img src=www.google.fr/toto/titi/img.png/>
<img src=https://www.google.fr/toto/titi/img.png/>

I get : TAG : <img src=www.google.fr/toto/titi/img.png/>

and with this one :
<img src=https://www.google.fr/toto/titi/img.png/>
<img src=https://www.google.fr/toto/titi/img.png/>

I get : TAG : <img src=https://www.google.fr/toto/titi/img.png/>
URL : https://www.google.fr/toto/titi/img.png/
DOMAIN  : //www.google.fr

So the trouble is that the parser stops instead of continuing on the
rest of the input.
How can I avoid this problem ? The point is that I can't create
temporary files (I want to use just the html page as FILE in yyin)
because many instances of the program will run in the same time and it
would write in the same file the tokens for different html pages.

I thank you in advance for your answers.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]