gnulib-tool-py
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnulib-tool-py] Getting cached values of variables


From: Bruno Haible
Subject: Re: [gnulib-tool-py] Getting cached values of variables
Date: Fri, 18 May 2012 01:31:16 +0200
User-agent: KMail/4.7.4 (Linux/3.1.10-1.9-desktop; KDE/4.7.4; x86_64; ; )

Hi Dmitriy,

> I've broken my head totally, trying to find the same fast way to get cached
> values from gnulib-cache.m4 file, but there are two problems:
> 1. Python re module doesn't have cycles like sed. We can use default
> Python's cycles, but since we are working with Unicode (unicode/str types),
> it probably will take much time to complete such a task. As you know
> Unicode wastes more memory than usual bytes. That can make a headache
> because we use 'string' type everywhere.
> 2. The another disadvantage that we have to separate one big regex into
> several parts of code. I prefer to use big regex, because that means that
> we do not need to feed text to parsers a lot of times, but can do all we
> need instantly.
> 3. If we use sed, it works with C speed, so it will be very fast comparing
> to Python.

You are seeing a problem where there is none, and banging your head
against this nonexistent problem. Namely, there is no requirement that
this part - parsing gnulib-cache.m4's contents - be particularly fast.
Remember:
  1. It occurs just once per "gnulib-tool --import" invocation.
     Therefore it can take 0.3 or so.
  2. The file to be parsed has about 200 lines of text.
  3. There are about 20 regular expressions to perform through this
     entire file.

Things would be different if gnulib-cache.m4 was 5 MB in size, but that's
not the case and will never be.

There's one thing you need to remember about regular expressions: Regular
expressions are "prepared" or "compiled" into a form in which it can
match a line of text. I haven't measured it, but would estimate that
1 "compilation" step takes about 10 to 100 x the time it takes to match a
line. Therefore when you process several lines with several regular
expressions, what you should *not* do is:

  outer loop: over all lines
    inner loop: over all regexps
      compile the regexp, then match it against the line

Instead, one of these two algorithms will work:

  outer loop: over all regexps
    compile the regexp
    inner loop: over all lines
      match the regexp against the line

or

  loop: compile all the regexps
  outer loop: over all lines
    inner loop: over all compiled regexps
      match it against the line

Which of the two, please choose according to the criterion: which of
the two will give more readable code?

Ad 1. Python does not have cycles. You program the cycles yourself.
Pick a suitable representation for the contents of the file:
Either a) a list of lines,
Or     b) a long string with embedded newlines.
You program the cycles yourself, either by iterating over the lines
(in the first case) or by adding a '^' anchor to each regular expression,
so that it starts to match at the beginning of a line.

Ad 2. I definitely prefer several small regexes. Why? Because you can
add a comment on each of them. Because you can comment out part of
the code during development, etc. With small regexes the program is 99%
Python.

Ad 3. I don't want gnulib-tool to use large 'sed' scripts. You have
seen yourself that it takes two weeks of learning to understand it.
These large 'sed' scripts are a factor of unmaintainability; this is
why they have to be split.

> What can we do? I suggest to create new subdirectory 'shell' inside
> pygnulib. This subdirectory contains some scripts which shall be executed
> using shell.
> The first script will be called 'gnulibcache.sh'. ...

This would be totally against the goal of gaining more maintainability
through a simpler implementation language. Namely, if you have parts
in Python and parts in .sh language, the future maintainers need to be
familiar with *both* Python and shell.

Invoking 'find' was OK. I said, no need to write a complex directory-
traversing logic in Python since the 'find' program exists and can be
used. This was possible because the arguments passed to 'find' are
small and its expected output is quick to understand. This is *not*
the case with 'sed' or 'sh'.

> I think that it is not very good idea to keep some actions inside shell,
> but we already keep some actions which can not be done with Python (e.g.
> when we get version from git).

Each such case needs a really really good rationale.

> What do you think?

You were thinking way too much about speed. Think like this:
What is the easiest way, in Python, to traverse a file, looking for
a string like "gl_LGPL("?

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]