[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PULL 10/56] x86: don't let decompressed kernel image clobber setup_
From: |
H. Peter Anvin |
Subject: |
Re: [PULL 10/56] x86: don't let decompressed kernel image clobber setup_data |
Date: |
Tue, 31 Jan 2023 21:40:53 -0800 |
User-agent: |
K-9 Mail for Android |
On January 31, 2023 1:22:43 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com>
wrote:
>On Tue, Jan 31, 2023, 15:55 H. Peter Anvin <hpa@zytor.com> wrote:
>
>> On January 30, 2023 12:19:14 PM PST, "Michael S. Tsirkin" <mst@redhat.com>
>> wrote:
>> >From: "Jason A. Donenfeld" <Jason@zx2c4.com>
>> >
>> >The setup_data links are appended to the compressed kernel image. Since
>> >the kernel image is typically loaded at 0x100000, setup_data lives at
>> >`0x100000 + compressed_size`, which does not get relocated during the
>> >kernel's boot process.
>> >
>> >The kernel typically decompresses the image starting at address
>> >0x1000000 (note: there's one more zero there than the compressed image
>> >above). This usually is fine for most kernels.
>> >
>> >However, if the compressed image is actually quite large, then
>> >setup_data will live at a `0x100000 + compressed_size` that extends into
>> >the decompressed zone at 0x1000000. In other words, if compressed_size
>> >is larger than `0x1000000 - 0x100000`, then the decompression step will
>> >clobber setup_data, resulting in crashes.
>> >
>> >Visually, what happens now is that QEMU appends setup_data to the kernel
>> >image:
>> >
>> > kernel image setup_data
>> > |--------------------------||----------------|
>> >0x100000 0x100000+l1 0x100000+l1+l2
>> >
>> >The problem is that this decompresses to 0x1000000 (one more zero). So
>> >if l1 is > (0x1000000-0x100000), then this winds up looking like:
>> >
>> > kernel image setup_data
>> > |--------------------------||----------------|
>> >0x100000 0x100000+l1 0x100000+l1+l2
>> >
>> > d e c o m p r e s s e d k e r n e l
>> >
>> |-------------------------------------------------------------|
>> > 0x1000000
>> 0x1000000+l3
>> >
>> >The decompressed kernel seemingly overwriting the compressed kernel
>> >image isn't a problem, because that gets relocated to a higher address
>> >early on in the boot process, at the end of startup_64. setup_data,
>> >however, stays in the same place, since those links are self referential
>> >and nothing fixes them up. So the decompressed kernel clobbers it.
>> >
>> >Fix this by appending setup_data to the cmdline blob rather than the
>> >kernel image blob, which remains at a lower address that won't get
>> >clobbered.
>> >
>> >This could have been done by overwriting the initrd blob instead, but
>> >that poses big difficulties, such as no longer being able to use memory
>> >mapped files for initrd, hurting performance, and, more importantly, the
>> >initrd address calculation is hard coded in qboot, and it always grows
>> >down rather than up, which means lots of brittle semantics would have to
>> >be changed around, incurring more complexity. In contrast, using cmdline
>> >is simple and doesn't interfere with anything.
>> >
>> >The microvm machine has a gross hack where it fiddles with fw_cfg data
>> >after the fact. So this hack is updated to account for this appending,
>> >by reserving some bytes.
>> >
>> >Fixup-by: Michael S. Tsirkin <mst@redhat.com>
>> >Cc: x86@kernel.org
>> >Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
>> >Cc: H. Peter Anvin <hpa@zytor.com>
>> >Cc: Borislav Petkov <bp@alien8.de>
>> >Cc: Eric Biggers <ebiggers@kernel.org>
>> >Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>> >Message-Id: <20221230220725.618763-1-Jason@zx2c4.com>
>> >Message-ID: <20230128061015-mutt-send-email-mst@kernel.org>
>> >Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>> >Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> >Tested-by: Eric Biggers <ebiggers@google.com>
>> >Tested-by: Mathias Krause <minipli@grsecurity.net>
>> >---
>> > include/hw/i386/microvm.h | 5 ++--
>> > include/hw/nvram/fw_cfg.h | 9 +++++++
>> > hw/i386/microvm.c | 15 +++++++----
>> > hw/i386/x86.c | 52 +++++++++++++++++++++------------------
>> > hw/nvram/fw_cfg.c | 9 +++++++
>> > 5 files changed, 59 insertions(+), 31 deletions(-)
>> >
>> >diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h
>> >index fad97a891d..e8af61f194 100644
>> >--- a/include/hw/i386/microvm.h
>> >+++ b/include/hw/i386/microvm.h
>> >@@ -50,8 +50,9 @@
>> > */
>> >
>> > /* Platform virtio definitions */
>> >-#define VIRTIO_MMIO_BASE 0xfeb00000
>> >-#define VIRTIO_CMDLINE_MAXLEN 64
>> >+#define VIRTIO_MMIO_BASE 0xfeb00000
>> >+#define VIRTIO_CMDLINE_MAXLEN 64
>> >+#define VIRTIO_CMDLINE_TOTAL_MAX_LEN ((VIRTIO_CMDLINE_MAXLEN + 1) *
>> 16)
>> >
>> > #define GED_MMIO_BASE 0xfea00000
>> > #define GED_MMIO_BASE_MEMHP (GED_MMIO_BASE + 0x100)
>> >diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
>> >index 2e503904dc..990dcdbb2e 100644
>> >--- a/include/hw/nvram/fw_cfg.h
>> >+++ b/include/hw/nvram/fw_cfg.h
>> >@@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s,
>> uint16_t key,
>> > void *data, size_t len,
>> > bool read_only);
>> >
>> >+/**
>> >+ * fw_cfg_read_bytes_ptr:
>> >+ * @s: fw_cfg device being modified
>> >+ * @key: selector key value for new fw_cfg item
>> >+ *
>> >+ * Reads an existing fw_cfg data pointer.
>> >+ */
>> >+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key);
>> >+
>> > /**
>> > * fw_cfg_add_string:
>> > * @s: fw_cfg device being modified
>> >diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c
>> >index 170a331e3f..29f30dd6d3 100644
>> >--- a/hw/i386/microvm.c
>> >+++ b/hw/i386/microvm.c
>> >@@ -378,7 +378,8 @@ static void microvm_fix_kernel_cmdline(MachineState
>> *machine)
>> > MicrovmMachineState *mms = MICROVM_MACHINE(machine);
>> > BusState *bus;
>> > BusChild *kid;
>> >- char *cmdline;
>> >+ char *cmdline, *existing_cmdline;
>> >+ size_t len;
>> >
>> > /*
>> > * Find MMIO transports with attached devices, and add them to the
>> kernel
>> >@@ -387,7 +388,8 @@ static void microvm_fix_kernel_cmdline(MachineState
>> *machine)
>> > * Yes, this is a hack, but one that heavily improves the UX without
>> > * introducing any significant issues.
>> > */
>> >- cmdline = g_strdup(machine->kernel_cmdline);
>> >+ existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg,
>> FW_CFG_CMDLINE_DATA);
>> >+ cmdline = g_strdup(existing_cmdline);
>> > bus = sysbus_get_default();
>> > QTAILQ_FOREACH(kid, &bus->children, sibling) {
>> > DeviceState *dev = kid->child;
>> >@@ -411,9 +413,12 @@ static void microvm_fix_kernel_cmdline(MachineState
>> *machine)
>> > }
>> > }
>> >
>> >- fw_cfg_modify_i32(x86ms->fw_cfg, FW_CFG_CMDLINE_SIZE,
>> strlen(cmdline) + 1);
>> >- fw_cfg_modify_string(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA, cmdline);
>> >-
>> >+ len = strlen(cmdline);
>> >+ if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline)) {
>> >+ fprintf(stderr, "qemu: virtio mmio cmdline too large,
>> skipping\n");
>> >+ } else {
>> >+ memcpy(existing_cmdline, cmdline, len + 1);
>> >+ }
>> > g_free(cmdline);
>> > }
>> >
>> >diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>> >index 78cc131926..eaff4227bd 100644
>> >--- a/hw/i386/x86.c
>> >+++ b/hw/i386/x86.c
>> >@@ -50,6 +50,7 @@
>> > #include "hw/intc/i8259.h"
>> > #include "hw/rtc/mc146818rtc.h"
>> > #include "target/i386/sev.h"
>> >+#include "hw/i386/microvm.h"
>> >
>> > #include "hw/acpi/cpu_hotplug.h"
>> > #include "hw/irq.h"
>> >@@ -813,12 +814,18 @@ void x86_load_linux(X86MachineState *x86ms,
>> > const char *kernel_filename = machine->kernel_filename;
>> > const char *initrd_filename = machine->initrd_filename;
>> > const char *dtb_filename = machine->dtb;
>> >- const char *kernel_cmdline = machine->kernel_cmdline;
>> >+ char *kernel_cmdline;
>> > SevKernelLoaderContext sev_load_ctx = {};
>> > enum { RNG_SEED_LENGTH = 32 };
>> >
>> >- /* Align to 16 bytes as a paranoia measure */
>> >- cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
>> >+ /*
>> >+ * Add the NUL terminator, some padding for the microvm cmdline
>> fiddling
>> >+ * hack, and then align to 16 bytes as a paranoia measure
>> >+ */
>> >+ cmdline_size = (strlen(machine->kernel_cmdline) + 1 +
>> >+ VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15;
>> >+ /* Make a copy, since we might append arbitrary bytes to it later. */
>> >+ kernel_cmdline = g_strndup(machine->kernel_cmdline, cmdline_size);
>> >
>> > /* load the kernel header */
>> > f = fopen(kernel_filename, "rb");
>> >@@ -959,12 +966,6 @@ void x86_load_linux(X86MachineState *x86ms,
>> > initrd_max = x86ms->below_4g_mem_size - acpi_data_size - 1;
>> > }
>> >
>> >- fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
>> >- fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(kernel_cmdline) +
>> 1);
>> >- fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline);
>> >- sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
>> >- sev_load_ctx.cmdline_size = strlen(kernel_cmdline) + 1;
>> >-
>> > if (protocol >= 0x202) {
>> > stl_p(header + 0x228, cmdline_addr);
>> > } else {
>> >@@ -1091,27 +1092,24 @@ void x86_load_linux(X86MachineState *x86ms,
>> > exit(1);
>> > }
>> >
>> >- setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
>> >- kernel_size = setup_data_offset + sizeof(SetupData) + dtb_size;
>> >- kernel = g_realloc(kernel, kernel_size);
>> >-
>> >-
>> >- setup_data = (SetupData *)(kernel + setup_data_offset);
>> >+ setup_data_offset = cmdline_size;
>> >+ cmdline_size += sizeof(SetupData) + dtb_size;
>> >+ kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
>> >+ setup_data = (void *)kernel_cmdline + setup_data_offset;
>> > setup_data->next = cpu_to_le64(first_setup_data);
>> >- first_setup_data = prot_addr + setup_data_offset;
>> >+ first_setup_data = cmdline_addr + setup_data_offset;
>> > setup_data->type = cpu_to_le32(SETUP_DTB);
>> > setup_data->len = cpu_to_le32(dtb_size);
>> >-
>> > load_image_size(dtb_filename, setup_data->data, dtb_size);
>> > }
>> >
>> >- if (!legacy_no_rng_seed) {
>> >- setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
>> >- kernel_size = setup_data_offset + sizeof(SetupData) +
>> RNG_SEED_LENGTH;
>> >- kernel = g_realloc(kernel, kernel_size);
>> >- setup_data = (SetupData *)(kernel + setup_data_offset);
>> >+ if (!legacy_no_rng_seed && protocol >= 0x209) {
>> >+ setup_data_offset = cmdline_size;
>> >+ cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH;
>> >+ kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size);
>> >+ setup_data = (void *)kernel_cmdline + setup_data_offset;
>> > setup_data->next = cpu_to_le64(first_setup_data);
>> >- first_setup_data = prot_addr + setup_data_offset;
>> >+ first_setup_data = cmdline_addr + setup_data_offset;
>> > setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
>> > setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
>> > qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>> >@@ -1122,6 +1120,12 @@ void x86_load_linux(X86MachineState *x86ms,
>> > fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel,
>> kernel_size);
>> > }
>> >
>> >+ fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr);
>> >+ fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, cmdline_size);
>> >+ fw_cfg_add_bytes(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline,
>> cmdline_size);
>> >+ sev_load_ctx.cmdline_data = (char *)kernel_cmdline;
>> >+ sev_load_ctx.cmdline_size = cmdline_size;
>> >+
>> > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
>> > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
>> > sev_load_ctx.kernel_data = (char *)kernel;
>> >@@ -1134,7 +1138,7 @@ void x86_load_linux(X86MachineState *x86ms,
>> > * kernel on the other side of the fw_cfg interface matches the hash
>> of the
>> > * file the user passed in.
>> > */
>> >- if (!sev_enabled()) {
>> >+ if (!sev_enabled() && first_setup_data) {
>> > SetupDataFixup *fixup = g_malloc(sizeof(*fixup));
>> >
>> > memcpy(setup, header, MIN(sizeof(header), setup_size));
>> >diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
>> >index a00881bc64..432754eda4 100644
>> >--- a/hw/nvram/fw_cfg.c
>> >+++ b/hw/nvram/fw_cfg.c
>> >@@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key,
>> void *data, size_t len)
>> > fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true);
>> > }
>> >
>> >+void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key)
>> >+{
>> >+ int arch = !!(key & FW_CFG_ARCH_LOCAL);
>> >+
>> >+ key &= FW_CFG_ENTRY_MASK;
>> >+ assert(key < fw_cfg_max_entry(s));
>> >+ return s->entries[arch][key].data;
>> >+}
>> >+
>> > void fw_cfg_add_string(FWCfgState *s, uint16_t key, const char *value)
>> > {
>> > size_t sz = strlen(value) + 1;
>>
>> Saying they are "appended to" is wrong; the loader is free to put them
>> anywhere in usable RAM that is not covered by the kernel image, the kernel
>> keepout area, the command line or initrd.
>>
>
>No. That sentence describes what QEMU does currently, without this patch,
>and it is accurate. This is not a kernel commit. It's a QEMU one.
>
>(Also, this patch sat on the list for over a month, during which time you
>could have chimed in. The ship for nitpicking has sailed now.)
>
>Sorry for the HTML crud; I'll be mobile for another 10 days, without a real
>computer.
>
>Jason
>
>>
Oh, right. Sorry.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [PULL 10/56] x86: don't let decompressed kernel image clobber setup_data,
H. Peter Anvin <=