Re: [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I

From:	Stefan Hajnoczi
Subject:	Re: [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
Date:	Mon, 29 Jan 2018 13:37:35 +0000
User-agent:	Mutt/1.9.1 (2017-09-22)
On Thu, Jan 25, 2018 at 07:18:52AM -0000, i336_ wrote:
> Public bug reported:
> 
> [Headsup: This report is long-ish due to the amount of detail I've
> stumbled on along the way that I think is relevant to include. I can't
> speak as to the complexity of the actual bugs, but the size of this
> report should not suggest that the reproduction process is particularly
> headache-inducing.]

I've CCed people who may be able to help.

I don't have time to read through everything you've posted.

> Hi!
> 
> I recently needed to fire up some ancient software for research purposes
> and got very distracted discovering and playing with old versions of
> Windows :). In the process I've discovered some glitches with disk I/O.
> 
> I believe I've stumbled on two completely separate issues that
> coincidentally surfaced at the same time. It's possible that components
> of this report will be re-filed as more specific new bugs, but I'm not
> an authority on QEMU internals or how to narrow down/categorize what
> I've found.
> 
> - The first bug only surfaces when the "isapc" machine type is used. It
> intermittently produces "General failure {read,writ}ing drive _" under
> MS-DOS 6.22, and also somehow interferes with early bootstrap of Windows
> NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux) appears to
> make no difference whatsoever, which may help with debugging.

Is this using the IDE disk controller?  In that case John Snow can help
you debug what's going on at the IDE level.

> - The second issue involves
>   - a WinNT4 disk image
>   - created by running through a bog-standard NT4 install inside QEMU 2.9.0
>   - which will now fail to boot in any version of QEMU - even version 1.0
>     - but which VirtualBox will boot fine
>       - but only if I point VirtualBox at QEMU's raw disk image via a
>         hacked-together VMDK file
>       - if the raw image is converted to VHD(X), VirtualBox will also fail
>         to boot the image with exactly the same error as QEMU
>       - this state of affairs is not affected by image sparseness (which makes
>         sense)

VMDK stores the disk geometry (cylinders, heads, sectors), which may
affect guest software.  I've CCed Fam Zheng.

> 
> I'm confident I've bisected the first issue.
> 
> I wasn't able to bisect the second issue (as all tested versions of QEMU
> behaved identically), but I've figured out a working repro testcase and
> I believe I've managed to pin down a solid root cause.
> 
> 
> == #1: Intermittent I/O issues when `-M isapc` is used =====
> 
> These symptoms sometimes take a small amount of time and fiddling to
> trigger, but I AM able to consistently surface them on my machine after
> a short while. (I am very very interested to hear if others cannot
> reproduce them.)
> 
> So, first of all:
> 
> https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
>   (Jul 30 2013): the last version that works
> 
> https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
>   (Oct 30 2013): the first version that intermittently fails
> 
> Maybe lift out and build these branches while reading. *shrug*
> (How to do this can be found at the end of this report - along with a 
> time-saving ./configure line, FWIW)
> 
> Here are the changelists between these two revisions:
> 
> https://github.com/qemu/qemu/compare/306ec6c...e689f7c
> (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)
> 
> https://github.com/qemu/qemu/compare/e689f7c...306ec6c
> (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)
> 
> (Someone else more familiar with Git might know why GitHub returns
> results for both compare directions, and/or if the 2nd link is useful
> information. The first link returns a lot more results than the 2nd one,
> at least. Does comparing new>old return deletions?)
> 
> ---
> 
> Now on to the symptoms. In a moment I'll describe reproduction.
> 
> # MS-DOS 6.22
> 
> The first symptom I discovered was that trivial read and write
> operations under MS-DOS would sometimes fail:
> 
>   C:\>echo test > hi
> 
>   General failure writing drive C
>   Abort, Retry, Fail?
> 
> Anything else that exercises the disk behaves similarly:
> 
>   C:\>dir /s > nul
> 
>   General failure reading drive C
>   Abort, Retry, Fail?
> 
> (Note that the above demonstrates both write and read failures)
> 
> (Also, FWIW, `dir /s` == `ls -R`)
> 
> The behavior of the I/O errors is not possible to characterise as it
> fluctuates so much. For example something as simple as DIR can produce
> wildly differing results: in one run, poking around with DIR ended with
> DOS deciding C:\ was empty at one point; at another point in a different
> run C:\ mysteriously dropped 50% of its contents only to magically gain
> it all back moments later after some poking around in one of the
> subdirectories that was still visible.
> 
> The time it takes to trigger these errors is also highly variable. QEMU
> may fall over as early as hanging forever at "Starting MS-DOS...", or I
> might get all the way into Windows 3.1 before it triggers (in which case
> Win3.1 reports vague memory errors - of all things).
> 
> Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
> Disk..." "Boot failed: could not read the boot disk" ... "No bootable
> device.", and on one occasion I even got "Non-System disk or disk error"
> "Replace and strike any key when ready"!
> 
> 
> # WinNT 4 Terminal Server
> 
> Most of the time, NTLDR will fire up normally. But every so often...
> 
>   SeaBIOS (version rel-1.7.3-117-g31b8b4e-
> 20131206_080705-nilsson.home.kraxel.org)
> 
>   Booting from Hard Disk...
>   A disk read error occurred.
>   Insert a system diskette and restart
>   the system.
> 
> (NB. You're seeing the old SeaBIOS version included with e689f7c, which
> was the first buggy commit.)
> 
> If NT gets past this point without erroring out (ie, it makes it to the
> boot menu), the rest of the system is 100% fine and there are no other
> disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was able to
> enable disk compression, answer "Yes" to "Compress entire disk now?" and
> have the process fully complete. No hitches.
> 
> This makes me vaguely recall/wonder that perhaps this could be somehow
> related to LBA and/or Int 13h, or something floating around near that
> bunch of functionality. (I'm woefully ignorant about such low-level
> details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has a
> buggy implementation of, while NT 4 (once NTOSKRNL is up and running) is
> able to use a different disk mode or access mechanism.
> 
> I'm really interested to get some understanding of what the root issue
> is here, when this is fixed. (I wonder if it's a timing thing?)
> 
> I've observed some unusual behavior with repeated restarts. In one case,
> I attempted to start NT4 multiple times, and QEMU consistently failed
> with "No bootable device" each time. So, I removed `-M isapc`, promptly
> got a boot menu, hit ^C, readded `-M isapc` - and continued to get a
> boot menu. Yep. I'll accept "really really big coincidence" but I do
> very much wonder if something else is going on here. I've observed many
> similar incidents. It makes me wonder whether the contents of memory or
> some other system state is an influence. Very probably not, but still...
> 
> 
> -- Reproduction --------------------------------------------
> 
> First of all, there was unfortunately no way for me to avoid having to
> post entire disk images, but I've managed to compress everything down to
> 174MB total download size.
> 
> FWIW, WinWorld and many other sites seem to have no operational issues
> providing clear pointers to CD keys; I consider my distribution of my
> installed HDD images an extension of the apparent status quo.
> 
> That being said, I've put everything on Google Drive so nobody has to
> headscratch about Launchpad/Canonical/etc's stance on hosting this data.
> 
> So, this folder contains the disk images:
> https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
> ("Download all" at the top-right will create a ZIP file, but FWIW
> downloading the individual files simultaneously would implement a rough
> form of download acceleration)
> 
> File meta info:
> 
> Compressed
> |
> |      Apparent
> |      |    Actual
> |      |    |
> 38M -> 200M (103M)  win31.img.xz
> 82M -> 1G   (289M)  wnt4ts-broken.img.xz
> 55M -> 350M (146M)  wnt4ts-intermittent.img.xz
> 
> SHA-256s:
> 
> win31:        8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
> broken:       
> a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
> intermittent: 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784
> 
> (Wanted to keep the checksum lines within 80 columns)
> 
> And, since I can't figure out where else in this report to put this,
> wnt4ts-broken.img's password is "admin" but something seems to have
> happened to the disk and NT doesn't actually boot properly :(, and
> wnt4ts-intermittent.img's password is "1234". (These were set up as test
> images. Now I'm _really_ glad I used simple passwords! :) )
> 
> ---
> 
> 
> I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.
> 
> 
> # MS-DOS
> 
> DOS is the simplest. It basically consists of
> 
> $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-kvm
> 
> And then literally just playing around. Things to try include creating
> files (`echo blah > file`), repeatedly seeking across the entire FAT
> (`dir /s > nul` or `dir /s`), and launching Windows (`win`).
> 
> win31.img is not special (as far as I can tell) and merely consists of
> the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC. I've
> basically just included the image for convenience.
> 
> Generally no single "run" is immune to starting Win3.1 and then
> launching File Manager; if that doesn't generate an error, something is
> definitely up.
> 
> The second best trigger is creating new files. That very very frequently
> produces "General Failure ...", but not always.
> 
> 
> # WinNT 4
> 
> Windows NT 4 is a bit more complicated. Because this error only occurs
> at presumably a single small point very early in boot, the window of
> opportunity for the glitch to surface within is much much narrower and
> thus often requires a larger number of tries.
> 
> Anecdotally I've had QEMU hit the boot error at the first try/run, and
> after as many as 63 "successful" boots.
> 
> I made a small test harness that automates the launch process. It
> consists of two shell scripts and requires tmux (and netcat).
> (*Potential epilepsy warning*: if you use a light-colored terminal
> background, the terminal QEMU is repeatedly invoked from will
> continuously flash rapidly from white to black.)
> 
> One of the scripts is run inside a tmux session in one terminal, while
> the other script is run in its own terminal (without any tmux).
> 
> 
> I named this one `run-qemu-loop`:
> 
> --8<--------------------------------------------------------
> 
> #!/bin/bash
> 
> # ---
> 
> qemu=/path/to/qemu-system-i386
> #or, alternatively: (I used the following line myself so I
> #could tab-complete my way to different qemu executables)
> #qemu="$1"
> 
> disk=/path/to/wnt4ts-intermittent.img
> 
> # ---
> 
> port=4444
> 
> rm -f STOP itercount
> 
> itercount=0
> 
> while :; do
>       
>       [ -f STOP ] && break
>       
>       ((itercount++))
>       echo $itercount > itercount
>       
>       $qemu \
>               -enable-kvm -vga cirrus -curses -M isapc \
>               -drive file="$disk",format=raw \
>               -chardev socket,id=mon0,host=localhost,port=$port,server,nowait 
> \
>               -mon chardev=mon0,mode=readline
>       
>       #point to an otherwise-unused terminal if you like (see also: `tty`)
>       #echo "$itercount run(s)" > /dev/pts/__
>       
> done
> 
> ------------------------------------------------------------
> 
> Not much logic above; this just repeatedly runs QEMU for as long as
> the file `STOP` does not exist in the current directory.
> 
> The key "magic" bit is that QEMU is launched in -curses mode.
> 
> The other key bit is that the above script is run inside tmux.
> 
> 
> Here's `tmux-ctl-loop`:
> 
> --8<--------------------------------------------------------
> 
> #!/bin/bash
> 
> port=4444
> 
> tmux=./tmux
> 
> printf -v l '%0.0s-' {0..25}
> h1="$l/ buffer dump begin \\$l"
> h2="$l-\\ buffer dump end /-$l"
> 
> while :; do
>       
>       while :; do
>               echo | nc localhost $port -q0 -w1 > /dev/null && break
>               echo 'Start qemu!'
>       done
>       
>       buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
>       
>       echo "$h1"
>       [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * 
> )'
>       echo "$h2"
>       
>       if echo "$buffer" | grep -q 'A disk read error occurred.'; then
>               
>               s="<Crashed after $(< itercount) runs.>"
>               echo "$s"
>               echo "$s" >> stats
>               
>               touch STOP
>               
>               #echo q | nc localhost $port -q0 > /dev/null
>               
>               exit
>               
>       elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
>                       
>               echo '<Booted successfully, trying again>'
>               
>               echo q | nc localhost $port -q0 > /dev/null
>               
>       else
>               
>               echo '<Waiting for boot>'
>               
>       fi
>                       
> done
> 
> ------------------------------------------------------------
> 
> Nothing particularly amazing going on here either.
> 
> While `qemu-run-loop` is running inside tmux in the first terminal, this
> is running in the 2nd one.
> 
> The small infinite loop at the top only breaks when it can successfully
> ping QEMU and it knows it's running.
> 
> Then, a screendump of the contents of the terminal QEMU is in is fetched
> from tmux, and the buffer's content is analyzed.
> 
> - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
>   sends `q` to QEMU through netcat, and then the script exits.
> 
> - If NTLDR loads successfully, the script sends `q` to QEMU and continues
>   looping. (qemu-run-loop will not find the `STOP` file, and so restart qemu.)
> 
> The scripts run very quickly, with 2-3 iterations per second on my i3
> box.
> 
> 
> # Usage
> 
> Save the two scripts above to the same directory as wnt4ts-intermittent.img,
> then:
> 
> - (If port 4444 doesn't work, the value needs to be changed in both
> scripts.)
> 
> - In the first terminal, run `tmux -S <file>`, where <file> names the socket
>   tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
>   (with `tmux=./tmux`, the command would be `tmux -S tmux`)
> 
> - Still in the first terminal (and now also inside tmux), enter
>   `./qemu-run-loop`, passing the path to qemu if you're using that approach
>   (refer to the first few lines of the script). Don't hit enter yet.
> 
> - Now, in the 2nd terminal, type `./tmux-ctl-loop`
> 
> - Hit enter in both terminals.
>  
> 
> Rationale for timing of Enter key:
> 
> - Running qemu-run-loop first will start QEMU, and if NTLDR starts
>   successfully it will immediately begin counting down from 30. If NT actually
>   starts to boot and is then hard-shut-down this /may/ affect the disk image
> 
> - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' 
> until
>   qemu-run-loop is running.
> 
> - Starting both scripts at "more or less" the same time (no rush) works out
>   well.
> 
> 
> Hopefully potential script modifications are obvious; for example
> 
> - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the HMP
>   yourself
>   (NB, if `STOP` is not created, when qemu finally exits it will of course
>   promptly be relaunched)
> 
> - pointing run-qemu-loop to a modified qemu binary
> 
> 
> == #2: QEMU-vs-VirtualBox image issue ======================
> 
> I was initially completely stumped by this issue, perhaps unsurprisingly
> so. :)
> 
> wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
> created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked NTFS
> and installed everything (which took a while).
> 
> NT setup reboots a number of times during the boot process, and IIRC
> those all went just fine. However, at some point, the image began to
> consistently bomb out with "A disk read error occurred. ...", and
> stubbornly refused to boot, regardless of the number of boot attempts I
> tried.
> 
> QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build on
> my system) all consistently hit "disk read error occurred".
> 
> I tried compiling QEMU 1.0 using clang so I could build for 32-bit on my
> 64-bit system (GCC 7 died with "Frame pointer required, but reserved").
> The resulting qemu completely crashed if I didn't enable KVM (ie, TCG
> was (understandably) broken); with KVM enabled qemu didn't crash, but
> NTLDR halted with the same error as on 64-bit qemu. (TL;DR, no
> difference whatsoever.)
> 
> My initial reaction at this point was to try the image on another
> virtualization platform. My first pick was VirtualBox.
> 
> So, I followed the official instructions for pointing VirtualBox to
> physical disk images, except I substituted a /dev/loopN device I'd
> pointed to the image file via losetup.
> 
> And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
> but not yay. What gives?!
> 
> Confused, I then tried to convert the disk image to VHD format.
> Unfortunately, for some reason, if I try `qemu-image convert ... -O vhdx
> ...`, VirtualBox chokes on the result:
> 
> -----
> 
> VD: error VERR_NOT_SUPPORTED opening image file
> '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).
> 
> Result Code: NS_ERROR_FAILURE (0x80004005)
> Component: MediumWrap
> Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
> Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
> Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)
> 
> -----
> 
> Welp.
> 
> Well, a bit more digging later, and I found I could do
> 
> $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd
> 
> but... as soon as I pointed VirtualBox to this, it too began to choke
> with "A disk read error occurred".
> 
> And yet, the VMDK->raw image setup worked just fine.
> 
> I found I could even replace the loop device with the path of the .img
> file itself and that worked just fine too.
> 
> At my wits' end, I followed some online instructions to learn about
> manual CHS configuration so I could try and get the image working in
> Bochs. "A disk read error occurred". I wasn't surprised.
> 
> It was at this point I began to give up, but I decided to try One Last
> Thing(TM) before properly throwing in the towel.
> 
> :)
> 
> I decided to learn a bit more about how `VBoxManage internalcommands
> createrawvmdk` worked, and try one thing in particular: I can edit the
> .vmdk file, but can I point `createrawvmdk` at the .img file directly
> too?
> 
> Turns out, yes you can.
> 
> It also turns out that this promptly caused VirtualBox to bomb out.
> 
> Interesting.
> 
> For reference, here's the VMDK file I initially created (by pointing
> `createrawvmdk` at /dev/loopN) and then later edited to point straight
> to the .img file, with both approaches resulting in successful boot.
> 
> --8<--------------------------------------------------------
> 
> # Disk DescriptorFile
> version=1
> CID=e35b9a45
> parentCID=ffffffff
> createType="fullDevice"
> 
> # Extent description
> RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0
> 
> # The disk Data Base 
> #DDB
> 
> ddb.virtualHWVersion = "4"
> ddb.adapterType="ide"
> ddb.geometry.cylinders="1523"
> ddb.geometry.heads="16"
> ddb.geometry.sectors="63"
> ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
> ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
> ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
> ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
> ddb.geometry.biosCylinders="761"
> ddb.geometry.biosHeads="32"
> ddb.geometry.biosSectors="63"
> 
> ------------------------------------------------------------
> 
> 
> Here's the _diff_ of what happens if I point `createrawvmdk` at 
> wnt4ts-broken.img directly:
> 
> --8<--------------------------------------------------------
> 
> ddb.geometry.cylinders="2080"
> ddb.geometry.heads="16"
> ddb.geometry.sectors="63"
> 
> ------------------------------------------------------------
> 
> :D
> 
> Naturally,
> 
> $ qemu-system-i386 -drive file=wnt4ts-
> broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl
> 
> will boot happily on 2.9.0 (notwithstanding the occasional "disk read
> error occurred" documented above).
> 
> It will also boot in 1.6.0.
> 
> (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank 640x480
> window and use 0% CPU if I specify `-M isapc`.)
> 
> And, of course, using these CHS values in Bochs also results in
> successful boot as well (after setting the CPU type to pentium).
> 
> Unfortunately, I have no idea what sequence of events caused the
> creation of the VMDK file above. No invocation of `createrawvmdk` is
> producing a VMDK file with the CHS settings above.
> 
> I've only just begun to learn about the intricacies of CHS. Am I to
> understand that these values are stored amongst the first 512 bytes of
> the disk? If this is the case, then I wonder what changed the data, and
> why. I was initially only using QEMU 2.9.0, and didn't move the image to
> different VMs or QEMU versions. Perhaps Windows NT got confused about
> the disk CHS and rewrote it?
> 
> 
> == Sporadic BIOS-level boot failure ========================
> 
> I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
> bootable device" (et al), even with the above manually-applied CHS
> settings.
> 
> Commit e689f7c also presents such errors.
> 
> Commit 306ec6c does not suffer from intermittent breakage of any kind:
> 
> - No SeaBIOS flake-outs
> - No "Non-system disk or disk error"
> - No "A disk error has occurred"
> - No "General failure ..."
> 
> While most of my confidence in commit 306ec6c is based on anecdotal
> evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
> I/O stability and left this modified version running for a few minutes.
> 
> --8<--------------------------------------------------------
> 
> #!/bin/bash
> 
> port=4444
> 
> tmux=./tmux
> 
> printf -v l '%0.0s-' {0..25}
> h1="$l/ buffer dump begin \\$l"
> h2="$l-\\ buffer dump end /-$l"
> 
> while :; do
>       
>       while :; do
>               echo | nc localhost $port -q0 -w1 > /dev/null && break
>               echo 'Start qemu!'
>       done
>       
>       buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
>       
>       echo "$h1"
>       [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * 
> )'
>       echo "$h2"
>       
>       if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
>               grep -q 'No bootable device'
>       then
>               
>               s="<Hit error after $(< itercount) runs.>"
>               echo "$s"
>               echo "$s" >> stats
>               
>               touch STOP
>               
>               #echo q | nc localhost $port -q0 > /dev/null
>               
>               exit
>               
>       elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
>               grep -q 'A disk read error'
>       then
>       
>               echo '<Boot did not hang at BIOS, trying again>'
>               
>               echo q | nc localhost $port -q0 > /dev/null
>               
>       else
>               
>               echo '<Waiting for boot>'
>               
>       fi
>                       
> done
> 
> ------------------------------------------------------------
> 
> For the above to work, the top of run-qemu-loop must also be modified to
> read something along the lines of
> 
> disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63
> 
> (Suggestion: modify copies of both scripts)
> 
> One small terminal-flicker-headache (and a 57°C CPU) later, I was able
> to carefully observe just over 350 successful runs in which QEMU commit
> 306ec6c only ever produced a boot menu. No other hitches.
> 
> ** Important: **
> 
> However, commit 306ec6c will fail to boot, ever, if the cylinders and
> geometry are not set to the values VirtualBox "discovered". (Of note is
> the fact that QEMU (2.9.0) was what initially created this image. I must
> admit that I don't remember what sequence of QEMU versions I fed the
> image to - and I maybe, possibly, didn't think to back the file up
> (sorry), so maybe something mangled something somewhere. But VirtualBox
> figured it out nonetheless!)
> 
> Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
> correct CHS discovery (and successful boot).
> 
> This is what leads me to conclude that I've discovered two separate
> issues.
> 
> 
> == Appendix: How to build the branches =====================
> 
> It's very simple.
> 
> First, `git clone https://github.com/qemu/qemu` somewhere if you don't
> already have a local copy. If you have an old git checkout that's from
> 2014 or later, you can use that old checkout instead. (If you want to
> test an old checkout you have, the commands below will either work
> perfectly or completely bomb out with no side effects.)
> 
> A full checkout is a ~183MB download. Sorry.
> 
> Next, create two new directories somewhere. Name them what you like, eg
> `qemu-working` and `qemu-broken`.
> 
> Now, cd into the checkout directory, and run:
> 
> $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC /path/to
> /qemu-working/
> 
> $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC /path/to
> /qemu-broken/
> 
> The paths can be relative.
> 
> Now, run this in both of the new directories:
> 
> $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
> --disable-usb-redir --disable-guest-agent --disable-libiscsi --disable-
> spice --disable-smartcard-nss --disable-vhost-net --disable-docs
> --disable-attr --disable-cap-ng --disable-vde --disable-user --disable-
> bluez --disable-vnc-ws --disable-xen --disable-brlapi --enable-debug
> --target-list=i386-softmmu --disable-fdt
> 
> $ make -j64
> 
> You can open two terminals and configure and build both simultaneously
> if you like.
> 
> On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - 
> make doesn't actually launch too many gcc processes. You *will* see your 
> system load spike to ~20 though :)
> (NB. Do. not. use. -j64. with. the. linux. kernel.)
> 
> On my system, a single build with -j64 takes only about 35 seconds. C
> FTW. (Although this has increased to 1min20sec for more recent builds.)
> 
> Most of the configure arguments remove functionality I'll never use (in
> this situation) and which will only slow down the build.
> 
> Once QEMU is built, run qemu-system-i386 directly from where it has been
> built.
> 
> $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
> $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...
> 
> Again, the paths can be relative.
> 
> ** Affects: qemu
>      Importance: Undecided
>          Status: New
> 
> 
> ** Tags: disk io qemu
> 
> -- 
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/1745312
> 
> Title:
>   Regression report: Disk subsystem I/O failures/issues surfacing in
>   DOS/early Windows [two separate issues: one bisected, one root-caused]
> 
> Status in QEMU:
>   New
> 
> Bug description:
>   [Headsup: This report is long-ish due to the amount of detail I've
>   stumbled on along the way that I think is relevant to include. I can't
>   speak as to the complexity of the actual bugs, but the size of this
>   report should not suggest that the reproduction process is
>   particularly headache-inducing.]
> 
>   Hi!
> 
>   I recently needed to fire up some ancient software for research
>   purposes and got very distracted discovering and playing with old
>   versions of Windows :). In the process I've discovered some glitches
>   with disk I/O.
> 
>   I believe I've stumbled on two completely separate issues that
>   coincidentally surfaced at the same time. It's possible that
>   components of this report will be re-filed as more specific new bugs,
>   but I'm not an authority on QEMU internals or how to narrow
>   down/categorize what I've found.
> 
>   - The first bug only surfaces when the "isapc" machine type is used.
>   It intermittently produces "General failure {read,writ}ing drive _"
>   under MS-DOS 6.22, and also somehow interferes with early bootstrap of
>   Windows NT 4 (in NTLDR). Enabling or disabling KVM (I'm on Linux)
>   appears to make no difference whatsoever, which may help with
>   debugging.
> 
>   - The second issue involves
>     - a WinNT4 disk image
>     - created by running through a bog-standard NT4 install inside QEMU 2.9.0
>     - which will now fail to boot in any version of QEMU - even version 1.0
>       - but which VirtualBox will boot fine
>         - but only if I point VirtualBox at QEMU's raw disk image via a
>           hacked-together VMDK file
>         - if the raw image is converted to VHD(X), VirtualBox will also fail
>           to boot the image with exactly the same error as QEMU
>         - this state of affairs is not affected by image sparseness (which 
> makes
>           sense)
> 
>   I'm confident I've bisected the first issue.
> 
>   I wasn't able to bisect the second issue (as all tested versions of
>   QEMU behaved identically), but I've figured out a working repro
>   testcase and I believe I've managed to pin down a solid root cause.
> 
> 
>   == #1: Intermittent I/O issues when `-M isapc` is used =====
> 
>   These symptoms sometimes take a small amount of time and fiddling to
>   trigger, but I AM able to consistently surface them on my machine
>   after a short while. (I am very very interested to hear if others
>   cannot reproduce them.)
> 
>   So, first of all:
> 
>   https://github.com/qemu/qemu/commit/306ec6c3cece7004429c79c1ac93d49919f1f1cc
>     (Jul 30 2013): the last version that works
> 
>   https://github.com/qemu/qemu/commit/e689f7c668cbd9d08f330e17c3dd3a059c9553d3
>     (Oct 30 2013): the first version that intermittently fails
> 
>   Maybe lift out and build these branches while reading. *shrug*
>   (How to do this can be found at the end of this report - along with a 
> time-saving ./configure line, FWIW)
> 
>   Here are the changelists between these two revisions:
> 
>   https://github.com/qemu/qemu/compare/306ec6c...e689f7c
>   (Compare direction: OLD to NEW) (Commits: 166  Files changed: 192)
> 
>   https://github.com/qemu/qemu/compare/e689f7c...306ec6c
>   (Compare direction: NEW to OLD) (Commits: 30   Files changed: 22)
> 
>   (Someone else more familiar with Git might know why GitHub returns
>   results for both compare directions, and/or if the 2nd link is useful
>   information. The first link returns a lot more results than the 2nd
>   one, at least. Does comparing new>old return deletions?)
> 
>   ---
> 
>   Now on to the symptoms. In a moment I'll describe reproduction.
> 
>   # MS-DOS 6.22
> 
>   The first symptom I discovered was that trivial read and write
>   operations under MS-DOS would sometimes fail:
> 
>     C:\>echo test > hi
> 
>     General failure writing drive C
>     Abort, Retry, Fail?
> 
>   Anything else that exercises the disk behaves similarly:
> 
>     C:\>dir /s > nul
> 
>     General failure reading drive C
>     Abort, Retry, Fail?
> 
>   (Note that the above demonstrates both write and read failures)
> 
>   (Also, FWIW, `dir /s` == `ls -R`)
> 
>   The behavior of the I/O errors is not possible to characterise as it
>   fluctuates so much. For example something as simple as DIR can produce
>   wildly differing results: in one run, poking around with DIR ended
>   with DOS deciding C:\ was empty at one point; at another point in a
>   different run C:\ mysteriously dropped 50% of its contents only to
>   magically gain it all back moments later after some poking around in
>   one of the subdirectories that was still visible.
> 
>   The time it takes to trigger these errors is also highly variable.
>   QEMU may fall over as early as hanging forever at "Starting MS-
>   DOS...", or I might get all the way into Windows 3.1 before it
>   triggers (in which case Win3.1 reports vague memory errors - of all
>   things).
> 
>   Very occasionally I've seen _SeaBIOS itself_ report "Booting from Hard
>   Disk..." "Boot failed: could not read the boot disk" ... "No bootable
>   device.", and on one occasion I even got "Non-System disk or disk
>   error" "Replace and strike any key when ready"!
> 
>   
>   # WinNT 4 Terminal Server
> 
>   Most of the time, NTLDR will fire up normally. But every so often...
> 
>     SeaBIOS (version rel-1.7.3-117-g31b8b4e-
>   20131206_080705-nilsson.home.kraxel.org)
> 
>     Booting from Hard Disk...
>     A disk read error occurred.
>     Insert a system diskette and restart
>     the system.
> 
>   (NB. You're seeing the old SeaBIOS version included with e689f7c,
>   which was the first buggy commit.)
> 
>   If NT gets past this point without erroring out (ie, it makes it to
>   the boot menu), the rest of the system is 100% fine and there are no
>   other disk I/O issues whatsoever. For example, on QEMU 2.9.0 I was
>   able to enable disk compression, answer "Yes" to "Compress entire disk
>   now?" and have the process fully complete. No hitches.
> 
>   This makes me vaguely recall/wonder that perhaps this could be somehow
>   related to LBA and/or Int 13h, or something floating around near that
>   bunch of functionality. (I'm woefully ignorant about such low-level
>   details.) Perhaps DOS/Win3.1 are stuck using a disk mode that QEMU has
>   a buggy implementation of, while NT 4 (once NTOSKRNL is up and
>   running) is able to use a different disk mode or access mechanism.
> 
>   I'm really interested to get some understanding of what the root issue
>   is here, when this is fixed. (I wonder if it's a timing thing?)
> 
>   I've observed some unusual behavior with repeated restarts. In one
>   case, I attempted to start NT4 multiple times, and QEMU consistently
>   failed with "No bootable device" each time. So, I removed `-M isapc`,
>   promptly got a boot menu, hit ^C, readded `-M isapc` - and continued
>   to get a boot menu. Yep. I'll accept "really really big coincidence"
>   but I do very much wonder if something else is going on here. I've
>   observed many similar incidents. It makes me wonder whether the
>   contents of memory or some other system state is an influence. Very
>   probably not, but still...
> 
> 
>   -- Reproduction --------------------------------------------
> 
>   First of all, there was unfortunately no way for me to avoid having to
>   post entire disk images, but I've managed to compress everything down
>   to 174MB total download size.
> 
>   FWIW, WinWorld and many other sites seem to have no operational issues
>   providing clear pointers to CD keys; I consider my distribution of my
>   installed HDD images an extension of the apparent status quo.
> 
>   That being said, I've put everything on Google Drive so nobody has to
>   headscratch about Launchpad/Canonical/etc's stance on hosting this
>   data.
> 
>   So, this folder contains the disk images:
>   https://drive.google.com/drive/folders/1WdZVBh5Trs9HLC186-nHyKeqaSxfyM2c
>   ("Download all" at the top-right will create a ZIP file, but FWIW
>   downloading the individual files simultaneously would implement a
>   rough form of download acceleration)
> 
>   File meta info:
> 
>   Compressed
>   |
>   |      Apparent
>   |      |    Actual
>   |      |    |
>   38M -> 200M (103M)  win31.img.xz
>   82M -> 1G   (289M)  wnt4ts-broken.img.xz
>   55M -> 350M (146M)  wnt4ts-intermittent.img.xz
> 
>   SHA-256s:
> 
>   win31:        
> 8179b8180a2ab40bd472e8a2f3fb89fc331651e56923f94ceb9e52a78ee220d2
>   broken:       
> a2af5f0bc49a063b75f534b6ffe5b82e32ecc706a64a425b6626feccf6e3fdfa  
>   intermittent: 
> 77ae8c458829ebcdd64c71042012f45d5a2788e6ebd22db9d53de9ef1a574784
> 
>   (Wanted to keep the checksum lines within 80 columns)
> 
>   And, since I can't figure out where else in this report to put this,
>   wnt4ts-broken.img's password is "admin" but something seems to have
>   happened to the disk and NT doesn't actually boot properly :(, and
>   wnt4ts-intermittent.img's password is "1234". (These were set up as
>   test images. Now I'm _really_ glad I used simple passwords! :) )
> 
>   ---
> 
>   
>   I have two testcases: DOS 6.22 (+ Windows 3.1), and Windows NT 4.
> 
>   
>   # MS-DOS
> 
>   DOS is the simplest. It basically consists of
> 
>   $ qemu-system-i386 -drive file=win31.img,format=raw -M isapc -enable-
>   kvm
> 
>   And then literally just playing around. Things to try include creating
>   files (`echo blah > file`), repeatedly seeking across the entire FAT
>   (`dir /s > nul` or `dir /s`), and launching Windows (`win`).
> 
>   win31.img is not special (as far as I can tell) and merely consists of
>   the result of installing DOS 6.22 and Windows 3.1 from WinWorldPC.
>   I've basically just included the image for convenience.
> 
>   Generally no single "run" is immune to starting Win3.1 and then
>   launching File Manager; if that doesn't generate an error, something
>   is definitely up.
> 
>   The second best trigger is creating new files. That very very
>   frequently produces "General Failure ...", but not always.
> 
>   
>   # WinNT 4
> 
>   Windows NT 4 is a bit more complicated. Because this error only occurs
>   at presumably a single small point very early in boot, the window of
>   opportunity for the glitch to surface within is much much narrower and
>   thus often requires a larger number of tries.
> 
>   Anecdotally I've had QEMU hit the boot error at the first try/run, and
>   after as many as 63 "successful" boots.
> 
>   I made a small test harness that automates the launch process. It
>   consists of two shell scripts and requires tmux (and netcat).
>   (*Potential epilepsy warning*: if you use a light-colored terminal
>   background, the terminal QEMU is repeatedly invoked from will
>   continuously flash rapidly from white to black.)
> 
>   One of the scripts is run inside a tmux session in one terminal, while
>   the other script is run in its own terminal (without any tmux).
> 
>   
>   I named this one `run-qemu-loop`:
> 
>   --8<--------------------------------------------------------
> 
>   #!/bin/bash
> 
>   # ---
> 
>   qemu=/path/to/qemu-system-i386
>   #or, alternatively: (I used the following line myself so I
>   #could tab-complete my way to different qemu executables)
>   #qemu="$1"
> 
>   disk=/path/to/wnt4ts-intermittent.img
> 
>   # ---
> 
>   port=4444
> 
>   rm -f STOP itercount
> 
>   itercount=0
> 
>   while :; do
>       
>       [ -f STOP ] && break
>       
>       ((itercount++))
>       echo $itercount > itercount
>       
>       $qemu \
>               -enable-kvm -vga cirrus -curses -M isapc \
>               -drive file="$disk",format=raw \
>               -chardev socket,id=mon0,host=localhost,port=$port,server,nowait 
> \
>               -mon chardev=mon0,mode=readline
>       
>       #point to an otherwise-unused terminal if you like (see also: `tty`)
>       #echo "$itercount run(s)" > /dev/pts/__
>       
>   done
> 
>   ------------------------------------------------------------
> 
>   Not much logic above; this just repeatedly runs QEMU for as long as
>   the file `STOP` does not exist in the current directory.
> 
>   The key "magic" bit is that QEMU is launched in -curses mode.
> 
>   The other key bit is that the above script is run inside tmux.
> 
>   
>   Here's `tmux-ctl-loop`:
> 
>   --8<--------------------------------------------------------
> 
>   #!/bin/bash
> 
>   port=4444
> 
>   tmux=./tmux
> 
>   printf -v l '%0.0s-' {0..25}
>   h1="$l/ buffer dump begin \\$l"
>   h2="$l-\\ buffer dump end /-$l"
> 
>   while :; do
>       
>       while :; do
>               echo | nc localhost $port -q0 -w1 > /dev/null && break
>               echo 'Start qemu!'
>       done
>       
>       buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
>       
>       echo "$h1"
>       [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * 
> )'
>       echo "$h2"
>       
>       if echo "$buffer" | grep -q 'A disk read error occurred.'; then
>               
>               s="<Crashed after $(< itercount) runs.>"
>               echo "$s"
>               echo "$s" >> stats
>               
>               touch STOP
>               
>               #echo q | nc localhost $port -q0 > /dev/null
>               
>               exit
>               
>       elif echo "$buffer" | grep -q 'OS Loader V4.00'; then
>                       
>               echo '<Booted successfully, trying again>'
>               
>               echo q | nc localhost $port -q0 > /dev/null
>               
>       else
>               
>               echo '<Waiting for boot>'
>               
>       fi
>                       
>   done
> 
>   ------------------------------------------------------------
> 
>   Nothing particularly amazing going on here either.
> 
>   While `qemu-run-loop` is running inside tmux in the first terminal,
>   this is running in the 2nd one.
> 
>   The small infinite loop at the top only breaks when it can
>   successfully ping QEMU and it knows it's running.
> 
>   Then, a screendump of the contents of the terminal QEMU is in is
>   fetched from tmux, and the buffer's content is analyzed.
> 
>   - If NTLDR fails, the script creates `STOP` to halt qemu-run-loop,
>     sends `q` to QEMU through netcat, and then the script exits.
> 
>   - If NTLDR loads successfully, the script sends `q` to QEMU and continues
>     looping. (qemu-run-loop will not find the `STOP` file, and so restart 
> qemu.)
> 
>   The scripts run very quickly, with 2-3 iterations per second on my i3
>   box.
> 
> 
>   # Usage
> 
>   Save the two scripts above to the same directory as wnt4ts-intermittent.img,
>   then:
> 
>   - (If port 4444 doesn't work, the value needs to be changed in both
>   scripts.)
> 
>   - In the first terminal, run `tmux -S <file>`, where <file> names the socket
>     tmux will use. This needs to match "tmux=" at the top of `tmux-ctl-loop`.
>     (with `tmux=./tmux`, the command would be `tmux -S tmux`)
> 
>   - Still in the first terminal (and now also inside tmux), enter
>     `./qemu-run-loop`, passing the path to qemu if you're using that approach
>     (refer to the first few lines of the script). Don't hit enter yet.
> 
>   - Now, in the 2nd terminal, type `./tmux-ctl-loop`
> 
>   - Hit enter in both terminals.
>    
> 
>   Rationale for timing of Enter key:
> 
>   - Running qemu-run-loop first will start QEMU, and if NTLDR starts
>     successfully it will immediately begin counting down from 30. If NT 
> actually
>     starts to boot and is then hard-shut-down this /may/ affect the disk image
> 
>   - tmux-ctl-loop will annoyingly spam a continuous stream of 'Start qemu!' 
> until
>     qemu-run-loop is running.
> 
>   - Starting both scripts at "more or less" the same time (no rush) works out
>     well.
> 
>   
>   Hopefully potential script modifications are obvious; for example
> 
>   - changing tmux-ctl-loop to not send 'q' to qemu so you can connect to the 
> HMP
>     yourself
>     (NB, if `STOP` is not created, when qemu finally exits it will of course
>     promptly be relaunched)
> 
>   - pointing run-qemu-loop to a modified qemu binary
> 
> 
>   == #2: QEMU-vs-VirtualBox image issue ======================
> 
>   I was initially completely stumped by this issue, perhaps
>   unsurprisingly so. :)
> 
>   wnt4ts-broken.img is a perfectly ordinary NT 4 installation that was
>   created in QEMU 2.9.0. I created a 1GB disk with `truncate`, picked
>   NTFS and installed everything (which took a while).
> 
>   NT setup reboots a number of times during the boot process, and IIRC
>   those all went just fine. However, at some point, the image began to
>   consistently bomb out with "A disk read error occurred. ...", and
>   stubbornly refused to boot, regardless of the number of boot attempts
>   I tried.
> 
>   QEMU 2.0.0, 1.5.0, and 1.0 (the earliest version I was able to build
>   on my system) all consistently hit "disk read error occurred".
> 
>   I tried compiling QEMU 1.0 using clang so I could build for 32-bit on
>   my 64-bit system (GCC 7 died with "Frame pointer required, but
>   reserved"). The resulting qemu completely crashed if I didn't enable
>   KVM (ie, TCG was (understandably) broken); with KVM enabled qemu
>   didn't crash, but NTLDR halted with the same error as on 64-bit qemu.
>   (TL;DR, no difference whatsoever.)
> 
>   My initial reaction at this point was to try the image on another
>   virtualization platform. My first pick was VirtualBox.
> 
>   So, I followed the official instructions for pointing VirtualBox to
>   physical disk images, except I substituted a /dev/loopN device I'd
>   pointed to the image file via losetup.
> 
>   And... VirtualBox picked the image up fine and Just Worked(TM). Yay! -
>   but not yay. What gives?!
> 
>   Confused, I then tried to convert the disk image to VHD format.
>   Unfortunately, for some reason, if I try `qemu-image convert ... -O
>   vhdx ...`, VirtualBox chokes on the result:
> 
>   -----
> 
>   VD: error VERR_NOT_SUPPORTED opening image file
>   '/.../wnt4ts-broken-qemuconv.vhd' (VERR_NOT_SUPPORTED).
> 
>   Result Code: NS_ERROR_FAILURE (0x80004005)
>   Component: MediumWrap
>   Interface: IMedium {4afe423b-43e0-e9d0-82e8-ceb307940dda}
>   Callee: IVirtualBox {0169423f-46b4-cde9-91af-1e9d5b6cd945}
>   Callee RC: VBOX_E_OBJECT_NOT_FOUND (0x80BB0001)
> 
>   -----
> 
>   Welp.
> 
>   Well, a bit more digging later, and I found I could do
> 
>   $ VBoxManage convertfromraw wnt4ts-broken.img wnt4ts-broken.vhd
> 
>   but... as soon as I pointed VirtualBox to this, it too began to choke
>   with "A disk read error occurred".
> 
>   And yet, the VMDK->raw image setup worked just fine.
> 
>   I found I could even replace the loop device with the path of the .img
>   file itself and that worked just fine too.
> 
>   At my wits' end, I followed some online instructions to learn about
>   manual CHS configuration so I could try and get the image working in
>   Bochs. "A disk read error occurred". I wasn't surprised.
> 
>   It was at this point I began to give up, but I decided to try One Last
>   Thing(TM) before properly throwing in the towel.
> 
>   :)
> 
>   I decided to learn a bit more about how `VBoxManage internalcommands
>   createrawvmdk` worked, and try one thing in particular: I can edit the
>   .vmdk file, but can I point `createrawvmdk` at the .img file directly
>   too?
> 
>   Turns out, yes you can.
> 
>   It also turns out that this promptly caused VirtualBox to bomb out.
> 
>   Interesting.
> 
>   For reference, here's the VMDK file I initially created (by pointing
>   `createrawvmdk` at /dev/loopN) and then later edited to point straight
>   to the .img file, with both approaches resulting in successful boot.
> 
>   --8<--------------------------------------------------------
> 
>   # Disk DescriptorFile
>   version=1
>   CID=e35b9a45
>   parentCID=ffffffff
>   createType="fullDevice"
> 
>   # Extent description
>   RW 1536000 FLAT "/absolute/full/path/to/wnt4ts-broken.img" 0
> 
>   # The disk Data Base 
>   #DDB
> 
>   ddb.virtualHWVersion = "4"
>   ddb.adapterType="ide"
>   ddb.geometry.cylinders="1523"
>   ddb.geometry.heads="16"
>   ddb.geometry.sectors="63"
>   ddb.uuid.image="871a6044-c8ca-48ed-b7aa-e6fc49da3db4"
>   ddb.uuid.parent="00000000-0000-0000-0000-000000000000"
>   ddb.uuid.modification="3661715c-3906-4e4a-ab65-486d140e03b8"
>   ddb.uuid.parentmodification="00000000-0000-0000-0000-000000000000"
>   ddb.geometry.biosCylinders="761"
>   ddb.geometry.biosHeads="32"
>   ddb.geometry.biosSectors="63"
> 
>   ------------------------------------------------------------
> 
>   
>   Here's the _diff_ of what happens if I point `createrawvmdk` at 
> wnt4ts-broken.img directly:
> 
>   --8<--------------------------------------------------------
> 
>   ddb.geometry.cylinders="2080"
>   ddb.geometry.heads="16"
>   ddb.geometry.sectors="63"
> 
>   ------------------------------------------------------------
> 
>   :D
> 
>   Naturally,
> 
>   $ qemu-system-i386 -drive file=wnt4ts-
>   broken.img,format=raw,cyls=1523,heads=16,secs=63 -M isapc -sdl
> 
>   will boot happily on 2.9.0 (notwithstanding the occasional "disk read
>   error occurred" documented above).
> 
>   It will also boot in 1.6.0.
> 
>   (POTENTIAL BUG HEADSUP: 1.0 and 1.5.0 both lock up with a blank
>   640x480 window and use 0% CPU if I specify `-M isapc`.)
> 
>   And, of course, using these CHS values in Bochs also results in
>   successful boot as well (after setting the CPU type to pentium).
> 
>   Unfortunately, I have no idea what sequence of events caused the
>   creation of the VMDK file above. No invocation of `createrawvmdk` is
>   producing a VMDK file with the CHS settings above.
> 
>   I've only just begun to learn about the intricacies of CHS. Am I to
>   understand that these values are stored amongst the first 512 bytes of
>   the disk? If this is the case, then I wonder what changed the data,
>   and why. I was initially only using QEMU 2.9.0, and didn't move the
>   image to different VMs or QEMU versions. Perhaps Windows NT got
>   confused about the disk CHS and rewrote it?
> 
>   
>   == Sporadic BIOS-level boot failure ========================
> 
>   I have multiple screenshots of SeaBIOS in QEMU 2.9.0 halting with "No
>   bootable device" (et al), even with the above manually-applied CHS
>   settings.
> 
>   Commit e689f7c also presents such errors.
> 
>   Commit 306ec6c does not suffer from intermittent breakage of any kind:
> 
>   - No SeaBIOS flake-outs
>   - No "Non-system disk or disk error"
>   - No "A disk error has occurred"
>   - No "General failure ..."
> 
>   While most of my confidence in commit 306ec6c is based on anecdotal
>   evidence, I modified `tmux-ctl-loop` a little to soak-test BIOS-level
>   I/O stability and left this modified version running for a few
>   minutes.
> 
>   --8<--------------------------------------------------------
> 
>   #!/bin/bash
> 
>   port=4444
> 
>   tmux=./tmux
> 
>   printf -v l '%0.0s-' {0..25}
>   h1="$l/ buffer dump begin \\$l"
>   h2="$l-\\ buffer dump end /-$l"
> 
>   while :; do
>       
>       while :; do
>               echo | nc localhost $port -q0 -w1 > /dev/null && break
>               echo 'Start qemu!'
>       done
>       
>       buffer="$(tmux -S $tmux capture-pane; tmux -S $tmux save-buffer -)"
>       
>       echo "$h1"
>       [[ "$buffer" ]] && echo "$buffer" || echo '( * Screen buffer is empty * 
> )'
>       echo "$h2"
>       
>       if echo "$buffer" | grep -q 'Non-system disk' || echo "$buffer" | \
>               grep -q 'No bootable device'
>       then
>               
>               s="<Hit error after $(< itercount) runs.>"
>               echo "$s"
>               echo "$s" >> stats
>               
>               touch STOP
>               
>               #echo q | nc localhost $port -q0 > /dev/null
>               
>               exit
>               
>       elif echo "$buffer" | grep -q 'OS Loader V4.00' || echo "$buffer" | \
>               grep -q 'A disk read error'
>       then
>       
>               echo '<Boot did not hang at BIOS, trying again>'
>               
>               echo q | nc localhost $port -q0 > /dev/null
>               
>       else
>               
>               echo '<Waiting for boot>'
>               
>       fi
>                       
>   done
> 
>   ------------------------------------------------------------
> 
>   For the above to work, the top of run-qemu-loop must also be modified
>   to read something along the lines of
> 
>   disk=/path/to/wnt4ts-broken.img,format=raw,cyls=1523,heads=16,secs=63
> 
>   (Suggestion: modify copies of both scripts)
> 
>   One small terminal-flicker-headache (and a 57°C CPU) later, I was able
>   to carefully observe just over 350 successful runs in which QEMU
>   commit 306ec6c only ever produced a boot menu. No other hitches.
> 
>   ** Important: **
> 
>   However, commit 306ec6c will fail to boot, ever, if the cylinders and
>   geometry are not set to the values VirtualBox "discovered". (Of note
>   is the fact that QEMU (2.9.0) was what initially created this image. I
>   must admit that I don't remember what sequence of QEMU versions I fed
>   the image to - and I maybe, possibly, didn't think to back the file up
>   (sorry), so maybe something mangled something somewhere. But
>   VirtualBox figured it out nonetheless!)
> 
>   Furthermore, feeding /dev/loopN to any QEMU version will NOT result in
>   correct CHS discovery (and successful boot).
> 
>   This is what leads me to conclude that I've discovered two separate
>   issues.
> 
> 
>   == Appendix: How to build the branches =====================
> 
>   It's very simple.
> 
>   First, `git clone https://github.com/qemu/qemu` somewhere if you don't
>   already have a local copy. If you have an old git checkout that's from
>   2014 or later, you can use that old checkout instead. (If you want to
>   test an old checkout you have, the commands below will either work
>   perfectly or completely bomb out with no side effects.)
> 
>   A full checkout is a ~183MB download. Sorry.
> 
>   Next, create two new directories somewhere. Name them what you like,
>   eg `qemu-working` and `qemu-broken`.
> 
>   Now, cd into the checkout directory, and run:
> 
>   $ git archive 306ec6c3cece7004429c79c1ac93d49919f1f1cc | tar xC
>   /path/to/qemu-working/
> 
>   $ git archive e689f7c668cbd9d08f330e17c3dd3a059c9553d3 | tar xC
>   /path/to/qemu-broken/
> 
>   The paths can be relative.
> 
>   Now, run this in both of the new directories:
> 
>   $ ./configure --python=python2.7 --disable-libssh2 --disable-seccomp
>   --disable-usb-redir --disable-guest-agent --disable-libiscsi
>   --disable-spice --disable-smartcard-nss --disable-vhost-net --disable-
>   docs --disable-attr --disable-cap-ng --disable-vde --disable-user
>   --disable-bluez --disable-vnc-ws --disable-xen --disable-brlapi
>   --enable-debug --target-list=i386-softmmu --disable-fdt
> 
>   $ make -j64
> 
>   You can open two terminals and configure and build both simultaneously
>   if you like.
> 
>   On my decent but very basic (2-core+HT) i3 box, -j64 actually works out - 
> make doesn't actually launch too many gcc processes. You *will* see your 
> system load spike to ~20 though :)
>   (NB. Do. not. use. -j64. with. the. linux. kernel.)
> 
>   On my system, a single build with -j64 takes only about 35 seconds. C
>   FTW. (Although this has increased to 1min20sec for more recent
>   builds.)
> 
>   Most of the configure arguments remove functionality I'll never use
>   (in this situation) and which will only slow down the build.
> 
>   Once QEMU is built, run qemu-system-i386 directly from where it has
>   been built.
> 
>   $ /path/to/qemu-working/i386-softmmu/qemu-system-i386 ...
>   $ /path/to/qemu-broken/i386-softmmu/qemu-system-i386 ...
> 
>   Again, the paths can be relative.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1745312/+subscriptions
>
signature.asc
Description: PGP signature
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused], i336_, 2018/01/25
- Re: [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused], Stefan Hajnoczi <=
- [Qemu-devel] [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused], Fam Zheng, 2018/01/30
- [Qemu-devel] [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused], John Snow, 2018/01/30
Prev by Date: Re: [Qemu-devel] [Qemu-arm] [PATCH] pl011: do not put into fifo before enabled the interruption
Next by Date: Re: [Qemu-devel] Prevent overriding the input file with the output file when using qemu-img
Previous by thread: [Qemu-devel] [Bug 1745312] [NEW] Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
Next by thread: [Qemu-devel] [Bug 1745312] Re: Regression report: Disk subsystem I/O failures/issues surfacing in DOS/early Windows [two separate issues: one bisected, one root-caused]
Index(es):
- Date
- Thread