qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 04/11] hw/arm: Add NPCM730 and NPCM750 SoC models


From: Markus Armbruster
Subject: Re: [PATCH v5 04/11] hw/arm: Add NPCM730 and NPCM750 SoC models
Date: Wed, 15 Jul 2020 11:35:15 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)

Havard Skinnemoen <hskinnemoen@google.com> writes:

> On Tue, Jul 14, 2020 at 10:11 AM Philippe Mathieu-Daudé <f4bug@amsat.org> 
> wrote:
>>
>> On 7/14/20 6:01 PM, Markus Armbruster wrote:
>> > Philippe Mathieu-Daudé <f4bug@amsat.org> writes:
>> >
>> >> +Markus
>> >>
>> >> On 7/14/20 2:44 AM, Havard Skinnemoen wrote:
>> >>> On Mon, Jul 13, 2020 at 8:02 AM Cédric Le Goater <clg@kaod.org> wrote:
>> >>>>
>> >>>> On 7/9/20 2:36 AM, Havard Skinnemoen wrote:
>> >>>>> The Nuvoton NPCM7xx SoC family are used to implement Baseboard
>> >>>>> Management Controllers in servers. While the family includes four SoCs,
>> >>>>> this patch implements limited support for two of them: NPCM730 
>> >>>>> (targeted
>> >>>>> for Data Center applications) and NPCM750 (targeted for Enterprise
>> >>>>> applications).
>> >>>>>
>> >>>>> This patch includes little more than the bare minimum needed to boot a
>> >>>>> Linux kernel built with NPCM7xx support in direct-kernel mode:
>> >>>>>
>> >>>>>   - Two Cortex-A9 CPU cores with built-in periperhals.
>> >>>>>   - Global Configuration Registers.
>> >>>>>   - Clock Management.
>> >>>>>   - 3 Timer Modules with 5 timers each.
>> >>>>>   - 4 serial ports.
>> >>>>>
>> >>>>> The chips themselves have a lot more features, some of which will be
>> >>>>> added to the model at a later stage.
>> >>>>>
>> >>>>> Reviewed-by: Tyrone Ting <kfting@nuvoton.com>
>> >>>>> Reviewed-by: Joel Stanley <joel@jms.id.au>
>> >>>>> Signed-off-by: Havard Skinnemoen <hskinnemoen@google.com>
>> >>>>> ---
>> >> ...
>> >>
>> >>>>> +static void npcm7xx_realize(DeviceState *dev, Error **errp)
>> >>>>> +{
>> >>>>> +    NPCM7xxState *s = NPCM7XX(dev);
>> >>>>> +    NPCM7xxClass *nc = NPCM7XX_GET_CLASS(s);
>> >>>>> +    int i;
>> >>>>> +
>> >>>>> +    /* CPUs */
>> >>>>> +    for (i = 0; i < nc->num_cpus; i++) {
>> >>>>> +        object_property_set_int(OBJECT(&s->cpu[i]),
>> >>>>> +                                arm_cpu_mp_affinity(i, 
>> >>>>> NPCM7XX_MAX_NUM_CPUS),
>> >>>>> +                                "mp-affinity", &error_abort);
>> >>>>> +        object_property_set_int(OBJECT(&s->cpu[i]), 
>> >>>>> NPCM7XX_GIC_CPU_IF_ADDR,
>> >>>>> +                                "reset-cbar", &error_abort);
>> >>>>> +        object_property_set_bool(OBJECT(&s->cpu[i]), true,
>> >>>>> +                                 "reset-hivecs", &error_abort);
>> >>>>> +
>> >>>>> +        /* Disable security extensions. */
>> >>>>> +        object_property_set_bool(OBJECT(&s->cpu[i]), false, "has_el3",
>> >>>>> +                                 &error_abort);
>> >>>>> +
>> >>>>> +        qdev_realize(DEVICE(&s->cpu[i]), NULL, &error_abort);
>> >>>>
>> >>>> I would check the error:
>> >>>>
>> >>>>         if (!qdev_realize(DEVICE(&s->cpu[i]), NULL, errp)) {
>> >>>>             return;
>> >>>>         }
>> >>>>
>> >>>> same for the sysbus_realize() below.
>> >>>
>> >>> Hmm, I used to propagate these errors until Philippe told me not to
>> >>> (or at least that's how I understood it).
>> >>
>> >> It was before Markus simplification API were merged, you had to
>> >> propagate after each call, since this is a non hot-pluggable SoC
>> >> I suggested to use &error_abort to simplify.
>> >>
>> >>> I'll be happy to do it
>> >>> either way (and the new API makes it really easy to propagate errors),
>> >>> but I worry that I don't fully understand when to propagate errors and
>> >>> when not to.
>> >>
>> >> Markus explained it on the mailing list recently (as I found the doc
>> >> not obvious). I can't find the thread. I suppose once the work result
>> >> after the "Questionable aspects of QEMU Error's design" discussion is
>> >> merged, the documentation will be clarified.
>> >
>> > The Error API evolved recently.  Please peruse the big comment in
>> > include/qapi/error.h.  If still unsure, don't hesitate to ask here.
>> >
>> >> My rule of thumb so far is:
>> >> - programming error (can't happen) -> &error_abort
>> >
>> > Correct.  Quote the big comment:
>> >
>> >  * Call a function aborting on errors:
>> >  *     foo(arg, &error_abort);
>> >  * This is more concise and fails more nicely than
>> >  *     Error *err = NULL;
>> >  *     foo(arg, &err);
>> >  *     assert(!err); // don't do this
>> >
>> >> - everything triggerable by user or management layer (via QMP command)
>> >>   -> &error_fatal, as we can't risk loose the user data, we need to
>> >>   shutdown gracefully.
>> >
>> > Quote the big comment:
>> >
>> >  * Call a function treating errors as fatal:
>> >  *     foo(arg, &error_fatal);
>> >  * This is more concise than
>> >  *     Error *err = NULL;
>> >  *     foo(arg, &err);
>> >  *     if (err) { // don't do this
>> >  *         error_report_err(err);
>> >  *         exit(1);
>> >  *     }
>> >
>> > Terminating the process is generally fine during initial startup,
>> > i.e. before the guest runs.
>> >
>> > It's generally not fine once the guest runs.  Errors need to be handled
>> > more gracefully then.  A QMP command, for instance, should fail cleanly,
>> > propagating the error to the monitor core, which then sends it to the
>> > QMP client, and loops to process the next command.
>> >
>> >>> It makes sense to me to propagate errors from *_realize() and
>> >>> error_abort on failure to set simple properties, but I'd like to know
>> >>> if Philippe is on board with that.
>> >
>> > Realize methods must not use &error_fatal.  Instead, they should clean
>> > up and fail.
>> >
>> > "Clean up" is the part we often neglect.  The big advantage of
>> > &error_fatal is that you don't have to bother :)
>> >
>> > Questions?
>>
>> One on my side. So in this realize(), all &error_abort uses has
>> to be replaced by local_err + propagate ...:

Except for the ones where failure is a programming error.  For instance,
...

>> static void npcm7xx_realize(DeviceState *dev, Error **errp)
>> {
>>     NPCM7xxState *s = NPCM7XX(dev);
>>     NPCM7xxClass *nc = NPCM7XX_GET_CLASS(s);
>>     int i;
>>
>>     /* CPUs */
>>     for (i = 0; i < nc->num_cpus; i++) {
>>         object_property_set_int(OBJECT(&s->cpu[i]),
>>                                 arm_cpu_mp_affinity(i,
>> NPCM7XX_MAX_NUM_CPUS),
>>                                 "mp-affinity", &error_abort);
>>         object_property_set_int(OBJECT(&s->cpu[i]), NPCM7XX_GIC_CPU_IF_ADDR,
>>                                 "reset-cbar", &error_abort);
>>         object_property_set_bool(OBJECT(&s->cpu[i]), true,
>>                                  "reset-hivecs", &error_abort);

... object_property_set_bool() can fail only when

* No property with that name exists (programming error)

* The property is read-only (programming error)

* Its ->set() method fails

  The method is actually set_bool(), which fails only when

  - the device is already realized (programming errro)
  - visit_type_bool() fails (programming error)

Now, you may prefer not to know all that here, and instead propagate the
error.  I have two issues with that: it clutters the code, and the
impossible error path is untestable.

The common way to limit the clutter is of course skipping the cleanup ;)

You could also aim for the sour spot where the impossible error path is
wrong.  Extra points for making it subtly wrong, and tempting to copy to
a place where it's actually possible.

Bah, I'll take &error_abort, thank you very much.

>>
>>         /* Disable security extensions. */
>>         object_property_set_bool(OBJECT(&s->cpu[i]), false, "has_el3",
>>                                  &error_abort);
>>
>>         qdev_realize(DEVICE(&s->cpu[i]), NULL, &error_abort);
>>     }
>>     [...]
>>
>> ... but the caller does:
>>
>> static void quanta_gsj_init(MachineState *machine)
>> {
>>     NPCM7xxState *soc;
>>
>>     soc = npcm7xx_create_soc(machine, QUANTA_GSJ_POWER_ON_STRAPS);
>>     npcm7xx_connect_dram(soc, machine->ram);
>>     qdev_realize(DEVICE(soc), NULL, &error_abort);
>>                                     ^^^^^^^^^^^^
>>     npcm7xx_load_kernel(machine, soc);
>> }

quanta_gsj_init() states "realizing this device can't fail".

The realize method states "this step can't fail" for a number of steps.

What's wrong with that?

>>
>> So we overload the code...
>>
>> My question: Do you confirm this is worth it to propagate?
>
> Here's my understanding. Please let me know if it sounds right.
>
> 1. Internal code failing to set simple properties to predefined values
> is a programming error, so error_abort is appropriate.

That would be my advice.

> 2. qdev_realize() may fail due to user input, so errors should be propagated.

In general, yes.  For a specific device, you may know it can't fail, and
then &error_abort may be okay.

> 3. machine init can't propagate errors any further, so all errors are
> fatal.

Basically yes.

A machine init may also choose to recover from an error.  Say create an
optional device, and if it doesn't work, just omit it.  Just an example
for illustration; it feels like a bad idea to me.

>        But if all realize() functions follow (1) and (2), only user
> errors are propagated, so error_fatal should be used to produce a nice
> error message rather than "Unexpected error, aborting!"

Yes.

> If any of this can ever be hot-plugged, then it means errors may
> propagate somewhere other than the machine init code, so it becomes
> extra important not to let bad user input crash the whole qemu
> process. I don't know if this is a concern when none of these devices
> can currently be hot-plugged.

Many, many devices neglect to clean up properly on error, and get away
with it only because all callers treat errors as fatal.

If you decide to take cleanup shortcuts, say because the cleanup is
untestable, consider adding a comment at least.

> For example, if the user tries to create a machine with 64 MB RAM, the
> gcr device will report an error because it can't represent less than
> 128 MB of memory. Currently, this is reported as
>
> $ ./arm-softmmu/qemu-system-arm -machine npcm750-evb -nographic -m 64
> Unexpected error in npcm7xx_gcr_realize() at
> /usr/local/google/home/hskinnemoen/qemu/for-upstream/hw/misc/npcm7xx_gcr.c:151:
> qemu-system-arm: npcm7xx_gcr: DRAM size 67108864 is too small (128 MiB 
> minimum)
> Aborted
>
> But if I change npcm7xx_realize() to propagate errors from
> sysbus_realize(gcr), and change npcm750_evb_init() to use error_fatal
> instead of error_abort, I get
>
> $ ./arm-softmmu/qemu-system-arm -machine npcm750-evb -nographic -m 64
> qemu-system-arm: npcm7xx_gcr: DRAM size 67108864 is too small (128 MiB 
> minimum)
>
> which seems less scary and more accurate.

Looks like a bug fix to me :)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]