[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequen
From: |
Sergey Fedorov |
Subject: |
Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data |
Date: |
Wed, 8 Jun 2016 17:10:03 +0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 |
On 08/06/16 03:02, Emilio G. Cota wrote:
> On Tue, Jun 07, 2016 at 18:56:48 +0300, Sergey Fedorov wrote:
>> On 07/06/16 04:05, Emilio G. Cota wrote:
>>> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>>>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>>>> diff --git a/util/qdist.c b/util/qdist.c
>>>>> new file mode 100644
>>>>> index 0000000..3343640
>>>>> --- /dev/null
>>>>> +++ b/util/qdist.c
>>>>> @@ -0,0 +1,386 @@
>>>> (snip)
>>>>> +
>>>>> +void qdist_add(struct qdist *dist, double x, long count)
>>>>> +{
>>>>> + struct qdist_entry *entry = NULL;
>>>>> +
>>>>> + if (dist->entries) {
>>>>> + struct qdist_entry e;
>>>>> +
>>>>> + e.x = x;
>>>>> + entry = bsearch(&e, dist->entries, dist->n, sizeof(e),
>>>>> qdist_cmp);
>>>>> + }
>>>>> +
>>>>> + if (entry) {
>>>>> + entry->count += count;
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + dist->entries = g_realloc(dist->entries,
>>>>> + sizeof(*dist->entries) * (dist->n + 1));
>>>> Repeated doubling?
>>> Can you please elaborate?
>> I mean dynamic array with a growth factor of 2
>> [https://en.wikipedia.org/wiki/Dynamic_array].
> Changed to:
>
> diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
> index 6d8b701..f30050c 100644
> --- a/include/qemu/qdist.h
> +++ b/include/qemu/qdist.h
> @@ -29,6 +29,7 @@ struct qdist_entry {
> struct qdist {
> struct qdist_entry *entries;
> size_t n;
> + size_t size;
> };
>
> #define QDIST_PR_BORDER BIT(0)
> diff --git a/util/qdist.c b/util/qdist.c
> index dc9dbd1..3b54354 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -16,6 +16,7 @@
> void qdist_init(struct qdist *dist)
> {
> dist->entries = NULL;
> + dist->size = 0;
> dist->n = 0;
> }
>
> @@ -58,8 +59,11 @@ void qdist_add(struct qdist *dist, double x, long count)
> return;
> }
>
> - dist->entries = g_realloc(dist->entries,
> - sizeof(*dist->entries) * (dist->n + 1));
> + if (unlikely(dist->n == dist->size)) {
> + dist->size = dist->size ? dist->size * 2 : 1;
We could initialize 'dist->size' to 1 and allocate a 1-entry
'dist->entries' array in qdist_init() to avoid this ternary operation ;-)
Otherwise looks good.
> + dist->entries = g_realloc(dist->entries,
> + sizeof(*dist->entries) * (dist->size));
> + }
> dist->n++;
> entry = &dist->entries[dist->n - 1];
> entry->x = x;
>
>
>>>> (snip)
>>>>> +static char *qdist_pr_internal(const struct qdist *dist)
>>>>> +{
>>>>> + double min, max, step;
>>>>> + GString *s = g_string_new("");
>>>>> + size_t i;
>>>>> +
>>>>> + /* if only one entry, its printout will be either full or empty */
>>>>> + if (dist->n == 1) {
>>>>> + if (dist->entries[0].count) {
>>>>> + g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES
>>>>> - 1]);
>>>>> + } else {
>>>>> + g_string_append_c(s, ' ');
>>>>> + }
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + /* get min and max counts */
>>>>> + min = dist->entries[0].count;
>>>>> + max = min;
>>>>> + for (i = 0; i < dist->n; i++) {
>>>>> + struct qdist_entry *e = &dist->entries[i];
>>>>> +
>>>>> + if (e->count < min) {
>>>>> + min = e->count;
>>>>> + }
>>>>> + if (e->count > max) {
>>>>> + max = e->count;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + /* floor((count - min) * step) will give us the block index */
>>>>> + step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
>>>>> +
>>>>> + for (i = 0; i < dist->n; i++) {
>>>>> + struct qdist_entry *e = &dist->entries[i];
>>>>> + int index;
>>>>> +
>>>>> + /* make an exception with 0; instead of using block[0], print a
>>>>> space */
>>>>> + if (e->count) {
>>>>> + index = (int)((e->count - min) * step);
>>>> So "e->count == min" gives us one eighth block instead of just space?
>>> Yes, only 0 can print a space.
>> So our scale is not linear. I think some users might get confused by this.
> That's correct. I think special-casing 0 makes sense though, since
> it increases the signal-to-noise ratio of the histogram. For example:
>
> 1) 0 as ' ':
> TB hash occupancy 31.84% avg chain occ. Histogram: [0,10)%|▆ █
> ▅▁▃▁▁|[90,100]%
> TB hash avg chain 1.015 buckets. Histogram: 1|█▁▁|3
>
> 2) 0 as '1/8':
> TB hash occupancy 32.07% avg chain occ. Histogram:
> [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
> TB hash avg chain 1.015 buckets. Histogram: 1|▇▁▁|3
>
> I think in these examples most users would be less confused by 1) than by 2).
I was meaning to represent all bars whose value < 1/8 as a space, not
only whose value is pure zero. Otherwise we can see 1/8 bar where the
actual value is negligibly differ from zero as in the second example.
>
> (snip)
>>>>> + to->n = from->n;
>>>>> + memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + rebin:
>> By the way, here's a space before the 'rebin' label.
> Yes, I always do this.
> It prevents diff from mistaking the label for a function definition,
> and thus wrongly using the label as context. See:
> https://lkml.org/lkml/2010/6/16/312
Cool!
>
>
>>>>> + j_min = 0;
>>>>> + for (i = 0; i < n; i++) {
>>>>> + double x;
>>>>> + double left, right;
>>>>> +
>>>>> + left = xmin + i * step;
>>>>> + right = xmin + (i + 1) * step;
>>>>> +
>>>>> + /* Add x, even if it might not get any counts later */
>>>>> + x = left;
>>>> This way we round down to the left margin of each bin like this:
>>>>
>>>> xmin [*---*---*---*---*] xmax -- from
>>>> | /| /| /| /
>>>> | / | / | / | /
>>>> |/ |/ |/ |/
>>>> | | | |
>>>> V V V V
>>>> [* * * *] -- to
>>> (snip)
>>>> xmin [*----*----*----*] xmax -- from
>>>> \ /\ /\ /\ /
>>>> \ / \ / \ / \ /
>>>> | | | |
>>>> V V V V
>>>> [* * * *] -- to
>>>>
>>>> I'm not sure which is the more correct option from the mathematical
>>>> point of view; but multiple-binning with the last variant of the
>>>> algorithm we would still give the same result.
>>> There's no "right" or "wrong" way as long as we're consistent
>>> and we print the right counts in the right bins. I think the
>>> convention I chose is simple enough, and leads to simple printing
>>> of the labels. But yes other alternatives would be OK here.
>> Well, if we go ahead with my last suggestion the code would look like this:
>>
>> rebin:
>> /* We do the binning using the following scheme:
>> *
>> * xmin [*----*----*----*] xmax -- from
>> * \ /\ /\ /\ /
>> * \ / \ / \ / \ /
>> * | | | |
>> * V V V V
>> * [* * * *] -- to
>> *
>> */
>> step = (xmax - xmin) / (n - 1);
>> j = 0;
>> for (i = 0; i < n; i++) {
>> double x;
>> double right;
>>
>> x = xmin + i * step;
>> right = x + 0.5 * step;
>>
>> /* Add x, even if it might not get any counts later */
>> qdist_add(to, x, 0);
>>
>> /* To avoid double-counting we capture [left, right) ranges */
>> while (from->entries[j].x < right && j < from->n) {
>> qdist_add(to, x, from->entries[j].count);
>> j++;
>> }
>> }
>> assert(j == from->n);
>> }
>>
>> Actually it's simpler than current version.
> The behaviour isn't the same though. With this we have
> that the two outer bins (leftmost and rightmost) are unnecessarily
> large (since they're out of the range of the input data).
>
> For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
> it makes no sense to report the first bin as [-12.5,12.5). If we
> then truncate the unnecessary edges, we'd have [0,12.5), but
> then the second bin is [12.5,37.5). Bins of unequal size are
> possible (although a bit unusual) in histograms, but given
> our Unicode-based representation, we're limited to same-width bars.
That is why I noted that I'm not sure what is the most correct from
mathematical point of view. Maybe consider the second option? I.e.
rounding to the middle of each bin with:
x = left + step / 2;
which would give the picture like this:
xmin [*---*---*---*---*] xmax -- from
| | | | |
\ / \ / \ / \ /
| | | |
V V V V
[* * * *] -- to
Anyway, you may consider if you like whether it's possible to apply some
simplifications from my code to the final version.
Kind regards,
Sergey