[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Rapp-dev] Pixelwise threshold and conditional pixel operations path
From: |
Hans-Peter Nilsson |
Subject: |
Re: [Rapp-dev] Pixelwise threshold and conditional pixel operations pathes |
Date: |
Fri, 27 May 2016 07:25:54 +0200 |
> From: Willie Betschart <address@hidden>
> Date: Wed, 4 May 2016 08:54:01 +0000
> Content-Type: text/x-patch; name="cond-pixop-add-v3.patch"
Thanks. Similarly to the other patch, committed with minor
changes. For the record:
> diff --git a/benchmark/rapp_benchmark.c b/benchmark/rapp_benchmark.c
> static void
> +rapp_bmark_exec_cond_u8(int (*func)(), const int *args)
> +{
> + const rapp_bmark_data_t *data = &rapp_bmark_data;
> + int idx = args[0];
> + (*func)(data->dst, data->dim_u8,
> + data->src[idx], data->dim_bin,
> + data->width, data->height, args[1]);
> +}
> +
> +static void
> rapp_bmark_exec_cond_set_u8(int (*func)(), const int *args)
> {
> const rapp_bmark_data_t *data = &rapp_bmark_data;
> int idx = args[0];
> (*func)(data->dst, data->dim_u8,
> data->src[idx], data->dim_bin,
> - data->width, data->height, 0);
> + data->width, data->height, args[1]);
> +}
> +
> +static void
> +rapp_bmark_exec_cond_u8_u8(int (*func)(), const int *args)
> +{
> + const rapp_bmark_data_t *data = &rapp_bmark_data;
> + int idx = args[0];
> + (*func)(data->dst, data->dim_u8,
> + data->set, data->dim_u8,
> + data->src[idx], data->dim_bin,
> + data->width, data->height, args[1]);
> }
Looks like you added rapp_bmark_exec_cond_u8_u8 and
rapp_bmark_exec_cond_u8, but *also* changed
rapp_bmark_exec_cond_set_u8 to be identical to the new
rapp_bmark_exec_cond_u8. That must be a thinko, so I just
dropped the new rapp_bmark_exec_cond_u8 and used cond_set_u8.
> diff --git a/compute/backend/rc_vec_neon.h b/compute/backend/rc_vec_neon.h
> +#define RC_VEC_SETMASKV(vec, maskv) \
> +do { \
> + rc_vec_t v_, andv_; \
> + rc_vec_t mask_ = (rc_vec_t){1<<0, 1<<1, 1<<2, 1<<3, \
> + 1<<4, 1<<5, 1<<6, 1<<7}; \
> + uint8_t indx0_ = vget_lane_u8(maskv, 0); \
> + RC_VEC_SPLAT(v_, indx0_); \
> + RC_VEC_AND(andv_, v_, mask_); \
> + (vec) = vceq_u8(andv_, mask_); \
> +} while (0)
Any particular reason to not use vtst instead of the and + vceq?
(If so, that should be expressed in a comment here.)
Reality (in the form of ARTPEC-6) seemed to fit the NEON
documentation I found so that's what I committed.
> diff --git a/compute/generic/rc_cond.c b/compute/generic/rc_cond.c
> +#define RC_COND_PIXOP_TEMPLATE(op1, pixop, arg1, arg2, mask) \
The arg2 was unused so I dropped that.
(Too much copy-pasting from rc_pixop.c I'm afraid; if we *need*
that argument we'll add it later.)
> +#define RC_COND_WORD_TEMPLATE(dst, pixop, arg1, arg2, word) \
> + unsigned d1_ = RC_WORD_EXTRACT(dst32_, 0, 8); \
> + unsigned d2_ = RC_WORD_EXTRACT(dst32_, 8, 8); \
> + unsigned d3_ = RC_WORD_EXTRACT(dst32_, 16, 8); \
> + unsigned d4_ = RC_WORD_EXTRACT(dst32_, 24, 8); \
> + \
> + /* Apply pixop. */ \
> + pixop(d1_, arg1, arg2); \
> + pixop(d2_, arg1, arg2); \
> + pixop(d3_, arg1, arg2); \
> + pixop(d4_, arg1, arg2); \
> + \
> + *d32_ = RC_WORD_INSERT(d1_, 0, 8) | \
> + RC_WORD_INSERT(d2_, 8, 8) | \
> + RC_WORD_INSERT(d3_, 16, 8) | \
> + RC_WORD_INSERT(d4_, 24, 8); \
I changed these 4-bytes-expansions into loops. GCC will unroll
that and nobody should have to see the repetitions.
> +#if RC_IMPL(rc_cond_addc_u8, 1)
> +void
> +rc_cond_addc_u8(uint8_t *restrict dst, int dst_dim,
> + const uint8_t *restrict map, int map_dim,
> + int width, int height, int value)
> +{
> + if (value > 0) {
> + /* Positive value - use ADDS */
> + RC_COND_TEMPLATE(dst, dst_dim, map, map_dim, width, height,
> + RC_PIXOP_ADDS, value, 0);
> + }
> + else if (value < 0) {
> + /* Negative value - use SUBS */
> + value = -value;
> + RC_COND_TEMPLATE(dst, dst_dim, map, map_dim, width, height,
> + RC_PIXOP_SUBS, value, 0);
> + }
> +}
> +#endif
The condition is better put in the driver when choosing between
different vector functions, so I separated these into
rc_cond_addc and rc_cond_subc. Not that I expect different
tunings between adds and subs, but it's one less thing to handle
in the innermost function.
> diff --git a/compute/vector/rc_cond.c b/compute/vector/rc_cond.c
> +#define RC_PIXOP_COPY(vec1, vec2, arg1, arg2) \
> + ((vec1) = (vec2))
Similar confusing abundance of unused arguments. Right,
copy-pasto; I did the same for rc_type.c, now fixed.
> +#define RC_COND_SINGLE_ITER_MAX(max, buf, map, j, i, pixop, \
> + arg1, arg2, arg3) \
> +do { \
> + rc_vec_t mv_; \
> + int k_, cnt_; \
> + RC_VEC_LOAD(mv_, &(map)[(i)]); \
> + RC_COND_COUNT(cnt_, mv_); \
> + if (cnt_ > 0) { \
> + for (k_ = 0; k_ < max; k_++, (j) += RC_VEC_SIZE) { \
> + rc_vec_t dv_, sv_, tv_; \
> + rc_vec_t exp_mv_, cdv_, cv1_, cv2_; \
> + \
> + /* Run standard pixop. */ \
> + RC_PIXOP_ITER(buf, j, sv_, cdv_, \
> + pixop, arg1, arg2, arg3); \
> + \
> + /* Conditional part. */ \
> + RC_VEC_SETMASKV(exp_mv_, mv_); \
> + RC_VEC_ANDNOT(cv1_, cdv_, exp_mv_); \
> + RC_VEC_AND(cv2_, sv_, exp_mv_); \
> + RC_VEC_OR(dv_, cv1_, cv2_); \
> + RC_VEC_SHLC(tv_, mv_, RC_VEC_SIZE / 8); \
> + mv_ = tv_; \
> + RC_VEC_STORE(&(buf)[(j)], dv_); \
> + } \
> + } \
> + else { \
> + (j) += (max) * RC_VEC_SIZE; \
> + } \
> + (i) += RC_VEC_SIZE; \
> +} while (0)
Expanding the conditional using the c=(a & m) | (b & ~m) idiom
certainly works everywhere, but maybe it'd be a good idea to
also add some vector macro that picks one of two vectors, or
maybe a masked write; maybe something combining the above
SETMASKV+ANDNOT+AND+OR+STORE and also letting the backend decide
whether it's worthwhile to check for zero mask (it certainly is
with the above expansion).
Also, I thought it'd be better to do the unrolling at the above
*innermost* level rather than at this level just outside:
> +#define RC_COND_PIXOP_TEMPLATE(dst, dst_dim, map, map_dim, \
> + width, height, pixop, arg1, arg2, arg3, \
> + unroll) \
> +do { \
> + /* We use the total number of destination vectors as the base. */ \
> + int tot_ = RC_DIV_CEIL((width), RC_VEC_SIZE * 8 / 8); \
> + \
> + /* We count whole source vectors for unrolling. */ \
> + int len_ = tot_ / (8 * unroll); \
> + int rem_ = tot_ % (8 * unroll); \
> + int y_; \
...but after benchmarking that, it was mostly a wash, except for
*one* newly added vector implementation where that was
noticeably worse. Reality strikes again.
> +/* Verify if the general condition operations are supported. */
> +#if defined RC_VEC_SETMASKV && defined RC_VEC_ANDNOT \
> + && defined RC_VEC_AND && defined RC_VEC_OR \
> + && defined RC_VEC_SHLC
No, each #if (def) should sit as a guard around only the macro
(or function) using exactly those macros. Thus better #if
define RC_COND_PIXOP_TEMPLATE.
> diff --git a/include/rapp_cond.h b/include/rapp_cond.h
> index 7ac25d7..2601309 100644
> --- a/include/rapp_cond.h
> +++ b/include/rapp_cond.h
Missing an update to the section that said that only set and
copy was implemented.
Actually, it should no longer be called "Conditional set and
copy". Fixed, I hope ("Single Conditional Operations"). That
subsection seemed to at one time have been called "Conditional
Operations" only to be renamed, with the main section (including
scatter/gather) named "Conditional Operations".
> @@ -89,6 +89,25 @@ rapp_cond_set_u8(uint8_t *restrict dst, int dst_dim,
> int width, int height, unsigned value);
>
> /**
> + * Add signed constant conditionally.
> + * Computes buf[i] = buf[i] + value, where map[i] is set.
> + * The result is saturated.
> + *
> + * @param[out] dst Destination pixel buffer.
> + * @param dst_dim Row dimension of the destination buffer.
> + * @param[in] map Binary map pixel buffer.
> + * @param map_dim Row dimension of the binary map buffer.
> + * @param width Image width in pixels.
> + * @param height Image height in pixels.
> + * @param value Add signed constant.
(To terse, range missing and also should describe the parameter,
not the operation.)
brgds, H-P