Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script

From:	Markus Armbruster
Subject:	Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script
Date:	Fri, 28 Oct 2016 18:44:20 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)
Marc-André Lureau <address@hidden> writes:

> As the name suggests, the qapi2texi script converts JSON QAPI
> description into a standalone texi file suitable for different target
> formats.
>
> It parses the following kind of blocks with some little variations:
>
>   ##
>   # = Section
>   # == Subsection
>   #
>   # Some text foo with *emphasis*
>   # 1. with a list
>   # 2. like that
>   #
>   # And some code:
>   # | $ echo foo
>   # | <- do this
>   # | -> get that
>   #
>   ##
>
>   ##
>   # @symbol
>   #
>   # Symbol body ditto ergo sum. Foo bar
>   # baz ding.
>   #
>   # @arg: foo
>   # @arg: #optional foo
>   #
>   # Returns: returns bla bla
>   #
>   #          Or bla blah
>   #
>   # Since: version
>   # Notes: notes, comments can have
>   #        - itemized list
>   #        - like this
>   #
>   #        and continue...
>   #
>   # Example:
>   #
>   # <- { "execute": "quit" }
>   # -> { "return": {} }
>   #
>   ##
>
> Thanks to the json declaration, it's able to give extra information
> about the type of arguments and return value expected.
>
> Signed-off-by: Marc-André Lureau <address@hidden>
> ---
>  scripts/qapi.py        | 100 +++++++++++++++-
>  scripts/qapi2texi.py   | 314 
> +++++++++++++++++++++++++++++++++++++++++++++++++
>  docs/qapi-code-gen.txt |  44 +++++--
>  3 files changed, 446 insertions(+), 12 deletions(-)
>  create mode 100755 scripts/qapi2texi.py

My review will be easier to follow if you skip ahead qapi-code-gen.txt,
then go to class QAPISchemaParser, then come back here.

> diff --git a/scripts/qapi.py b/scripts/qapi.py
> index 21bc32f..4efc7e7 100644
> --- a/scripts/qapi.py
> +++ b/scripts/qapi.py
> @@ -122,6 +122,79 @@ class QAPIExprError(Exception):
>              "%s:%d: %s" % (self.info['file'], self.info['line'], self.msg)
>  
>  
> +class QAPIDoc:

Old-style class.  Make this

   class QAPIDoc(object):

> +    def __init__(self, comment):
> +        self.symbol = None
> +        self.comment = "" # the main symbol comment

PEP8 wants at least two spaces before #, and I prefer the # in colum 32.

> +        self.args = OrderedDict()
> +        # meta is for Since:, Notes:, Examples:, Returns:...
> +        self.meta = OrderedDict()
> +        # the current section to populate, array of [dict, key, comment...]
> +        self.section = None
> +
> +        for line in comment.split('\n'):

Makes me wonder whether the caller should pass a list of lines instead
of a string that needs to be split into lines.

> +            # remove multiple spaces
> +            sline = ' '.join(line.split())

Actually, this doesn't "remove multiple spaces", it squeezes sequences
of whitespace into a single space character.  Rather inefficiently, I
suspect, but let's not worry about that now.

> +            # take the first word out
> +            split = sline.split(' ', 1)
> +            key = split[0]
> +
> +            if key.startswith("@"):
> +                key = key[1:].rstrip(':')

Treats the colon as optional.  I understand you're trying to be
unintrusive, but since we're parsing comments already, we can make the
parser do double duty and enforce consistent comment formatting.  But
this isn't the place to discuss the format we want, review of
qapi-code-gen.txt is.

> +                sline = split[1] if len(split) > 1 else ""
> +                if self.symbol is None:
> +                    # the first is the section symbol
> +                    self.symbol = key

Let's not say "the section symbol".  It's the symbol this APIDoc object
documents.  The APIDoc object has sections.

> +                else:
> +                    # else an arg
> +                    self.start_section(self.args, key)
> +            elif self.symbol and \
> +                    key in ("Returns:",
> +                            # special case for Since often used without:
> +                            "Since:", "Since",
> +                            # those are often singular or plural
> +                            "Note:", "Notes:",
> +                            "Example:", "Examples:"):
> +                sline = split[1] if len(split) > 1 else ""
> +                line = None
> +                # new "meta" section
> +                self.start_section(self.meta, key.rstrip(':'))
> +
> +            if self.section and self.section[1].startswith("Example"):
> +                # example is verbatim
> +                self.append_comment(line)
> +            else:
> +                self.append_comment(sline)

Sections other than "Example" and "Examples" have whitespace squeezed.
I believe this is the true reason for the squeezing.  There are a few
other places that rely on it, but they could do just as well without.

> +
> +        self.end_section()
> +
> +    def append_comment(self, line):
> +        """Adds a comment to the current section, or the symbol comment"""
> +        if line is None:
> +            return
> +        if self.section is not None:
> +            if self.section[-1] == "" and line == "":
> +                self.end_section()
> +            else:
> +                self.section.append(line)
> +        elif self.comment == "":
> +            self.comment = line
> +        else:
> +            self.comment += "\n" + line

Accumulating a list and joining at the end might be easier.

> +
> +    def end_section(self):
> +        if self.section is not None:
> +            dic = self.section[0]

Calling a dictionary "dic" could be considered a dick move ;-P

> +            key = self.section[1]
> +            doc = "\n".join(self.section[2:])
> +            dic[key] = doc

Silently trashes previous value of dic[key].

> +            self.section = None
> +
> +    def start_section(self, dic, key):
> +        self.end_section()
> +        self.section = [dic, key]  # .. remaining elems will be the doc

Taken together:

* The QAPI parser is hacked to accumulate and exfiltrate some comments
  to the QAPIDoc parser.

* A QAPIDoc can have a symbol, a symbol comment, and sections.

* Certain patterns in the comment text start and end sections.

  Text within a section belongs to the section.  Text outside sections
  belongs to the symbol comment.  Unlikely to work unless the symbol
  comment is entirely before the first section.

* Two kinds of sections, argument and meta:

  - An argument section is uniquely identified by the argument's name

    Multiple argument sections with the same name make no sense.  The
    code silently ignores all but the last one.

  - A meta section is uniqely identified by a keyword

    The keywords are: "Returns", "Since", "Note", "Notes", "Example",
    "Examples".

    Multiple meta sections with the same key make sense for some keys
    ("Note") but not for others ("Returns", "Since").  Nevertheless, the
    code silently ignores all but the last one.  You can't have two
    "Example" sections, but you can have an "Example" and an "Examples"
    section.  I'm afraid the data structure isn't quite right here.

* No comment is ever rejected.  Comments adhering to the format we have
  in mind generate correct documentation.  Mistakes may produce correct
  documentation, subtly incorrect documentation, or obvious garbage.
  When you get garbage (obvious or not), you have to guess what's wrong
  with your comment.  "Fortunately", developers generally won't have to
  do that, because they generally don't read the generated
  documentation.

General remarks:

* I feel the only way to keep the API documentation in order as we
  develop is to have the parser reject malformed API comments.

* The only way to create a robust and maintainable parser is to specify
  the language *first* and *rigorously*.

* Stacking parsers is not a good idea.  Figuring out one parser's
  accepted language is hard enough.

* For me, the parsing part of this patch is a prototype.  Possibly even
  good enough to commit, with some work.  But it'll have to be redone
  from first principles regardless.

> +
> +
>  class QAPISchemaParser(object):
>  
>      def __init__(self, fp, previously_included=[], incl_info=None):
> @@ -137,11 +210,14 @@ class QAPISchemaParser(object):
>          self.line = 1
>          self.line_pos = 0
>          self.exprs = []
> +        self.comment = None
> +        self.apidoc = incl_info['doc'] if incl_info else []
>          self.accept()
>  
>          while self.tok is not None:
>              expr_info = {'file': fname, 'line': self.line,
> -                         'parent': self.incl_info}
> +                         'parent': self.incl_info, 'doc': self.apidoc}
> +            self.apidoc = []
>              expr = self.get_expr(False)
>              if isinstance(expr, dict) and "include" in expr:
>                  if len(expr) != 1:

Before your patch, expr_info holds the (top-level) expression's location
plus the list of include directive locations that led to the expression.

You extend it by key 'doc'.  Let's see what it's good for.

At every (sub-)expression, we capture the current state of self.apidoc
in expr_info['doc'], then reset self.apidoc to [].  Odd; I would expect
API documentation to attach only to top-level expressions.

Looks like expr_info['doc'] and self.apidoc are lists, and self.apidoc
accumulates documentation until we move it to the next expression's
expr_info['doc'].  In other words, self.apidoc accumulates documentation
for the next expression.

I don't quite understand why you initialize self.apidoc the way you do:
if this file is included, self.apidoc is initialized to the include
directive's expr_info['doc'], else to the empty list.  What are you
trying to accomplish by that?

> @@ -162,6 +238,8 @@ class QAPISchemaParser(object):
>                      inf = inf['parent']
>                  # skip multiple include of the same file
>                  if incl_abs_fname in previously_included:
> +                    expr_info['doc'].extend(self.apidoc)
> +                    self.apidoc = expr_info['doc']
>                      continue
>                  try:
>                      fobj = open(incl_abs_fname, 'r')

Before your patch, we indeed skip, as the comment says.  After your
patch, we do something that's not yet obvious to me, then skip.  The
comment isn't quite right anymore.

Let's see what we do when we include a file a second time:

1. We append self.apidoc to expr_info['doc'].  But self.apidoc is []
here, isn't it?

2. We assign the result back to self.apidoc.

I think this is effectively

                       self.apidoc = expr_info['doc']

What are you trying to accomplish?

> @@ -176,6 +254,12 @@ class QAPISchemaParser(object):
>                               'info': expr_info}
>                  self.exprs.append(expr_elem)
>  
> +    def append_doc(self):
> +        if self.comment:
> +            apidoc = QAPIDoc(self.comment)
> +            self.apidoc.append(apidoc)
> +            self.comment = None
> +

Too many things called apidoc.  I'd elide the local variable:

               self.apidoc.append(QAPIDoc(self.comment))

Neither the role of self.comment nor self.apidoc are obvious just yet.

>      def accept(self):
>          while True:
>              self.tok = self.src[self.cursor]
> @@ -184,8 +268,20 @@ class QAPISchemaParser(object):
>              self.val = None
>  
>              if self.tok == '#':
> -                self.cursor = self.src.find('\n', self.cursor)
> +                end = self.src.find('\n', self.cursor)
> +                line = self.src[self.cursor:end+1]

                   self.cursor = self.src.find('\n', self.cursor)
                   line = self.src[self.pos:self.cursor+1]

and drop the self.cursor = end below.

@line isn't actually the source line, it's the source line with the
initial '#' chopped off.  Hmm.

Note that .find() can't fail, because we made sure self.src ends with
'\n'.

> +                # start a comment section after ##
> +                if line[0] == "#":
> +                    if self.comment is None:
> +                        self.comment = ""
> +                # skip modeline
> +                elif line.find("-*") == -1 and self.comment is not None:
> +                    self.comment += line

Aha: self.comment is either None or a string.  When it's a string, we're
accumulating a (multi-line) comment there.  We start accumulating when a
comment starts with '##'.  We stop by calling self.append_doc(), which
processes the accumulated comment, and appends the result to the list
self.apidoc.

If a comment starting with '##' occurs while we're already accumulating,
we silently ignore it.  I don't think that's a good idea.

Likewise, we silently ignore any comment lines containing '-*'.  That's
definitely not a good idea; your pattern is far too loose.  Why do we
need to ignore such lines?

> +                if self.src[end] == "\n" and self.src[end+1] == "\n":

If this is the last line, self.src[end+1] is out of bounds, I'm afraid.

> +                    self.append_doc()

We stop accumulating when a comment line is followed by a blank line,
and ...

> +                self.cursor = end
>              elif self.tok in "{}:,[]":
> +                self.append_doc()

... when we recognize one of the tokens '{', '}', ':', ',' '[', ']'.

This isn't a parser, it's a hack :)

In this contrieved example, your code joins the two comments into one:

    { 'command: 'contrived-example-1', 'gen':
    ## lorem ipsum
      false
    # dolor sit amet
    }

In this one, it doesn't:

    { 'command: 'contrived-example-2', 'gen':
    ## lorem ipsum
      false }
    # dolor sit amet

In my opinion, this needs a healthy dose of rigor.  Make API comments a
proper *syntactical* construct: starts with a comment token whose text
is exactly '##', extends until the first non-comment token.

Two ways to parse this:

* Lexically, i.e. gobble up the whole comment block as a single token.
  Drawback: tokens spanning lines defeat the elegant treatment of
  newline in one place.  Meh.

* Treat it as an expression: have a method get_api_comment() that keeps
  calling self.accept() until it sees a non-comment token.  To ease
  getting the comment text, we can make self.accept() store it in
  self.val.

This also reduces modifiable state: self.comment goes away.  Non-local
modifiable state is evil.

Now let me try to summarize how this works with parsing details ignored:

0. We accumulate the next expression's API documentation in self.apidoc,
a list of QAPIDoc.

1. We accumulate the current API comment in self.comment.

2. When it's complete, we call self.append_doc().  append_doc()
processes self.comment into a QAPIDoc, which it appends to self.apidoc,
then resets self.comment.

3. When we recognize the start of an expression, we move self.apidoc to
the expressions expr_info['doc'].

Taken together: each (sub-)expression has a list of QAPIDoc objects that
got recognized since the last start of a (sub-)expression.

Now go back to class QAPIDoc.

>                  return
>              elif self.tok == "'":
>                  string = ''
> diff --git a/scripts/qapi2texi.py b/scripts/qapi2texi.py
> new file mode 100755
> index 0000000..2706b16
> --- /dev/null
> +++ b/scripts/qapi2texi.py
[Skipping the generator for now]
> diff --git a/docs/qapi-code-gen.txt b/docs/qapi-code-gen.txt
> index 5d4c2cd..e51ae4c 100644
> --- a/docs/qapi-code-gen.txt
> +++ b/docs/qapi-code-gen.txt
> @@ -45,16 +45,13 @@ QAPI parser does not).  At present, there is no place 
> where a QAPI
>  schema requires the use of JSON numbers or null.
>  
>  Comments are allowed; anything between an unquoted # and the following
> -newline is ignored.  Although there is not yet a documentation
> -generator, a form of stylized comments has developed for consistently
> -documenting details about an expression and when it was added to the
> -schema.  The documentation is delimited between two lines of ##, then
> -the first line names the expression, an optional overview is provided,
> -then individual documentation about each member of 'data' is provided,
> -and finally, a 'Since: x.y.z' tag lists the release that introduced
> -the expression.  Optional members are tagged with the phrase
> -'#optional', often with their default value; and extensions added
> -after the expression was first released are also given a '(since
> +newline is ignored.  The documentation is delimited between two lines
> +of ##, then the first line names the expression, an optional overview
> +is provided, then individual documentation about each member of 'data'
> +is provided, and finally, a 'Since: x.y.z' tag lists the release that
> +introduced the expression.  Optional members are tagged with the
> +phrase '#optional', often with their default value; and extensions
> +added after the expression was first released are also given a '(since
>  x.y.z)' comment.  For example:
>  
>      ##
> @@ -73,12 +70,39 @@ x.y.z)' comment.  For example:
>      #           (Since 2.0)
>      #
>      # Since: 0.14.0
> +    #
> +    # Notes: You can also make a list:
> +    #        - with items
> +    #        - like this
> +    #
> +    # Example:
> +    #
> +    # <- { "execute": ... }
> +    # -> { "return": ... }
> +    #
>      ##
>      { 'struct': 'BlockStats',
>        'data': {'*device': 'str', 'stats': 'BlockDeviceStats',
>                 '*parent': 'BlockStats',
>                 '*backing': 'BlockStats'} }
>  
> +It's also possible to create documentation sections, such as:
> +
> +    ##
> +    # = Section
> +    # == Subsection
> +    #
> +    # Some text foo with *emphasis*
> +    # 1. with a list
> +    # 2. like that
> +    #
> +    # And some code:
> +    # | $ echo foo
> +    # | <- do this
> +    # | -> get that
> +    #
> +    ##
> +
>  The schema sets up a series of types, as well as commands and events
>  that will use those types.  Forward references are allowed: the parser
>  scans in two passes, where the first pass learns all type names, and

This may be good enough to guide users, but it's not rigorous enough to
actually implement a parser.  The API doc language needs to be specified
as a grammar.  If not here, then elsewhere.

Now go back to QAPISchemaParser.
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script, Markus Armbruster <=
- Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script, Eric Blake, 2016/10/28
Prev by Date: Re: [Qemu-devel] [RFC] powernv: CPU compatibility modes don't make sense for powernv
Next by Date: Re: [Qemu-devel] [RFC] powernv: CPU compatibility modes don't make sense for powernv
Previous by thread: [Qemu-devel] [PATCH 0/6] Expand the building of qemu-user docker images
Next by thread: Re: [Qemu-devel] [PATCH v2 01/11] qapi: add qapi2texi script
Index(es):
- Date
- Thread