qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 2/9] s390x: fix invalid use of cc 1 for SSCH


From: Cornelia Huck
Subject: Re: [Qemu-devel] [PATCH 2/9] s390x: fix invalid use of cc 1 for SSCH
Date: Tue, 5 Sep 2017 10:02:34 +0200

On Thu, 31 Aug 2017 12:41:05 +0200
Halil Pasic <address@hidden> wrote:

> On 08/31/2017 11:19 AM, Cornelia Huck wrote:
> > On Wed, 30 Aug 2017 18:36:02 +0200
> > Halil Pasic <address@hidden> wrote:
> >   
> >> According to the POP a start subchannel instruction (SSCH) returning with
> >> cc 1 implies that the subchannel was status pending when SSCH executed.
> >>
> >> Due to a somewhat confusing error handling, where error codes are mapped
> >> to cc value, sane looking error codes result in non AR compliant
> >> behavior.
> >>
> >> Let's fix this! Instead of cc 1 we use cc 3 which means device not
> >> operational, and is much closer to the truth in the given cases.
> >>
> >> Signed-off-by: Halil Pasic <address@hidden>
> >> Acked-by: Pierre Morel<address@hidden>
> >> ---
> >>
> >> This patch turned out quite controversial. We did not reach a consensus
> >> during the internal review.
> >>
> >> The most of the discussion revolved around the ORB flag which
> >> architecturally must be supported, but are currently not supported by
> >> vfio-ccw (not yet, or can't be). The idea showing the most promise for
> >> consensus was to handle this via device status (along the lines better a
> >> strange acting device than a non-conform machine) but since it's a
> >> radical change we decided to first discuss upstream and then do whatever
> >> needs to be done.
> >> ---
> >>  hw/s390x/css.c      | 15 ++++++---------
> >>  hw/s390x/s390-ccw.c |  2 +-
> >>  2 files changed, 7 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/s390x/css.c b/hw/s390x/css.c
> >> index a50fb0727e..0822538cde 100644
> >> --- a/hw/s390x/css.c
> >> +++ b/hw/s390x/css.c
> >> @@ -1034,7 +1034,7 @@ static int 
> >> sch_handle_start_func_passthrough(SubchDev *sch)
> >>       */
> >>      if (!(orb->ctrl0 & ORB_CTRL0_MASK_PFCH) ||
> >>          !(orb->ctrl0 & ORB_CTRL0_MASK_C64)) {
> >> -        return -EINVAL;
> >> +        return -ENODEV;  
> > 
> > This feels wrong. If we don't support this yet, doing something like a
> > channel-program check or an operand exception feels closer to the
> > architecture than indicating a gone device.  
> 
> I disagree, a channel-program check or an operand exception, or cc 1
> (current solution) makes the machine obviously non-conform.

Well, it *is* not compliant...

> 
> My train of thought was that architecturally you can loose connection
> to the device at any time (you can't prohibit admins pulling cables or
> smashing equipment with a 10kg hammer).

But that's not what you're doing: Iff you want to signal 'device
gone' (and I'm not convinced it's a good idea), you need to do more
than signal cc 3.

> 
> Also from the guest OS perspective I think saying device not operational
> could provoke a proper reaction form the guest OS: that is just give up
> on the device. The things you propose would in my opinion put the blame
> on the guest OS driver (making non-conform requests) so in that case
> it would make sense to give up on the driver (but the same driver could
> wonderfully work with let's say a fully emulated device).

I'd not call that 'putting blame on the driver'. What happens is that
we signal the driver 'you send us something we cannot deal with' - the
more common reason for that is that the driver built an invalid
request, but I've certainly seen rejects for stuff that simply was not
implemented in the past.

The important thing is that I don't want to lie to the driver: The
device is there, and will work with a different request. Also see my
comment above.

(The real fix is of course to implement what is required by the
architecture :), but I don't think cc 3 is a good stop-gap measure.)

> 
> As I have stated in the cover letter of this patch, I would find
> setting device status even better, but I wanted to discuss first
> before going from setting cc to something else.
> 
> Setting cc was not my idea in the first place (AFAIK the -EINVAL
> here effectively triggers cc 1).

Something else than cc 1 is better, yes (but the intention was probably
a channel-program check or so).

> 
> >   
> >>      }
> >>  
> >>      ret = s390_ccw_cmd_request(orb, s, sch->driver_data);
> >> @@ -1046,16 +1046,13 @@ static int 
> >> sch_handle_start_func_passthrough(SubchDev *sch)
> >>          break;
> >>      case -ENODEV:
> >>          break;
> >> +    case -EFAULT:
> >> +         break;
> >>      case -EACCES:
> >>          /* Let's reflect an inaccessible host device by cc 3. */
> >> -        ret = -ENODEV;
> >> -        break;
> >>      default:
> >> -       /*
> >> -        * All other return codes will trigger a program check,
> >> -        * or set cc to 1.
> >> -        */
> >> -       break;
> >> +        /* Let's make all other return codes map to cc 3.  */
> >> +        ret = -ENODEV;  
> > 
> > Why? This feels wrong. For those cases where we want to signal an error
> > but cc 1 is conceptually wrong, either an operand exception (for very
> > few cases) or a channel-program check feels more in line with the
> > architecture.  
> 
> You mean the original code feels wrong, or? I keep the program check for
> -EFAULT (that's why it's added) and just change cc 1 to cc 3 for the
> not explicitly handled error codes (reason stated in the commit message).

I mean that both feel wrong :) Moving away from abuses of cc 1 makes
sense, but I don't think cc 3 is the correct direction.

> 
> > 
> > That's a general problem with doing stuff in the hypervisor: We have
> > sets of internal problems that obviously don't show up in the
> > architecture, and we can either handle them internally or use what the
> > architecture offers for problem signaling. z/VM has probably faced the
> > same problems :)  
> 
> I agree.
> 
> >   
> >>      };
> >>  
> >>      return ret;
> >> @@ -1115,7 +1112,7 @@ static int do_subchannel_work(SubchDev *sch)
> >>      if (sch->do_subchannel_work) {
> >>          return sch->do_subchannel_work(sch);
> >>      } else {
> >> -        return -EINVAL;
> >> +        return -ENODEV;  
> > 
> > This rather seems like a job for an assert? If we don't have a function
> > for the 'asynchronous' handling of the various functions assigned for a
> > subchannel, that looks like an internal error.
> >   
> 
> IMHO it depends. Aborting qemu is heavy handed, and as an user I would not
> be happy about it. But certainly it is an assert situation.  We can look for
> an even better solution, but I think this is an improvement. The logic behind
> is that the device is broken and can't be talked to properly.

We currently don't have a vast array of subchannel types (and are
unlikely to get more types that need a different handler function). We
know the current ones are fine, and an assert would catch programming
errors early.

> 
> >>      }
> >>  }
> >>  
> >> diff --git a/hw/s390x/s390-ccw.c b/hw/s390x/s390-ccw.c
> >> index 8614dda6f8..2b0741741c 100644
> >> --- a/hw/s390x/s390-ccw.c
> >> +++ b/hw/s390x/s390-ccw.c
> >> @@ -25,7 +25,7 @@ int s390_ccw_cmd_request(ORB *orb, SCSW *scsw, void 
> >> *data)
> >>      if (cdc->handle_request) {
> >>          return cdc->handle_request(orb, scsw, data);
> >>      } else {
> >> -        return -ENOSYS;
> >> +        return -ENODEV;  
> > 
> > If we get here, it means that we called a request handler (which is
> > only done for the passthrough variety) without having assigned a
> > request handler beforehand. This also looks like an internal error to
> > me...
> >   
> 
> Certainly. Again I was not the one who wrote or accepted the original
> code. My previous comment about whether assert or not applies here as
> well.

My answer applies even more here :)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]