linaro-toolchain

linaro-toolchain@lists.linaro.org

3 participants
5671 discussions

[TCWG CI] Regression caused by gcc: tree-object-size: Use trees and support negative offsets

by ci_notify＠linaro.org

[TCWG CI] Regression caused by gcc: tree-object-size: Use trees and support negative offsets: commit 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 Author: Siddhesh Poyarekar <siddhesh(a)gotplt.org> tree-object-size: Use trees and support negative offsets Results regressed to # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 550 # First few build errors in logs: # 00:01:37 ./include/linux/thread_info.h:213:25: error: call to ‘__bad_copy_to’ declared with attribute error: copy destination size is too small # 00:01:37 ./include/linux/thread_info.h:213:25: error: call to ‘__bad_copy_to’ declared with attribute error: copy destination size is too small # 00:01:37 make[1]: *** [scripts/Makefile.build:287: fs/io_uring.o] Error 1 # 00:01:37 make: *** [Makefile:1846: fs] Error 2 from # reset_artifacts: -10 # build_abe binutils: -9 # build_abe stage1: -5 # build_abe qemu: -2 # linux_n_obj: 567 # linux build successful: all THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_kernel/gnu-master-arm-mainline-allnoconfig First_bad build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Last_good build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Baseline build: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Even more details: https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… Reproduce builds: <cut> mkdir investigate-gcc-422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 cd investigate-gcc-422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_kernel-gnu-bisect-gnu-master-arm-mainline-al… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_kernel-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /gcc/ ./ ./bisect/baseline/ cd gcc # Reproduce first_bad build git checkout --detach 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 871504b0dd5cd023d3a28cf9e5ccbda75928b102 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 422f9eb7011b76c12ff00ffaee2bcc9cdddf16d5 Author: Siddhesh Poyarekar <siddhesh(a)gotplt.org> Date: Fri Dec 17 07:07:18 2021 +0530 tree-object-size: Use trees and support negative offsets Transform tree-object-size to operate on tree objects instead of host wide integers. This makes it easier to extend to dynamic expressions for object sizes. The compute_builtin_object_size interface also now returns a tree expression instead of HOST_WIDE_INT, so callers have been adjusted to account for that. The trees in object_sizes are each an object_size object with members size (the bytes from the pointer to the end of the object) and wholesize (the size of the whole object). This allows analysis of negative offsets, which can now be allowed to the extent of the object bounds. Tests have been added to verify that it actually works. gcc/ChangeLog: * tree-object-size.h (compute_builtin_object_size): Return tree instead of HOST_WIDE_INT. * builtins.c (fold_builtin_object_size): Adjust. * gimple-fold.c (gimple_fold_builtin_strncat): Likewise. * ubsan.c (instrument_object_size): Likewise. * tree-object-size.c (object_size): New structure. (object_sizes): Change type to vec<object_size>. (initval): New function. (unknown): Use it. (size_unknown_p, size_initval, size_unknown): New functions. (object_sizes_unknown_p): Use it. (object_sizes_get): Return tree. (object_sizes_initialize): Rename from object_sizes_set_force and set VAL parameter type as tree. Add new parameter WHOLEVAL. (object_sizes_set): Set VAL parameter type as tree and adjust implementation. Add new parameter WHOLEVAL. (size_for_offset): New function. (decl_init_size): Adjust comment. (addr_object_size): Change PSIZE parameter to tree and adjust implementation. Add new parameter PWHOLESIZE. (alloc_object_size): Return tree. (compute_builtin_object_size): Return tree in PSIZE. (expr_object_size, call_object_size, unknown_object_size): Adjust for object_sizes_set change. (merge_object_sizes): Drop OFFSET parameter and adjust implementation for tree change. (plus_stmt_object_size): Call collect_object_sizes_for directly instead of merge_object_size and call size_for_offset to get net size. (cond_expr_object_size, collect_object_sizes_for, object_sizes_execute): Adjust for change of type from HOST_WIDE_INT to tree. (check_for_plus_in_loops_1): Likewise and skip non-positive offsets. gcc/testsuite/ChangeLog: * gcc.dg/builtin-object-size-1.c (test9): New test. (main): Call it. * gcc.dg/builtin-object-size-2.c (test8): New test. (main): Call it. * gcc.dg/builtin-object-size-3.c (test9): New test. (main): Call it. * gcc.dg/builtin-object-size-4.c (test8): New test. (main): Call it. * gcc.dg/builtin-object-size-5.c (test5, test6, test7): New tests. Signed-off-by: Siddhesh Poyarekar <siddhesh(a)gotplt.org> --- gcc/builtins.c | 10 +- gcc/gimple-fold.c | 11 +- gcc/testsuite/gcc.dg/builtin-object-size-1.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-2.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-3.c | 31 +++ gcc/testsuite/gcc.dg/builtin-object-size-4.c | 30 ++ gcc/testsuite/gcc.dg/builtin-object-size-5.c | 25 ++ gcc/tree-object-size.c | 394 +++++++++++++++++---------- gcc/tree-object-size.h | 2 +- gcc/ubsan.c | 5 +- 10 files changed, 409 insertions(+), 159 deletions(-) diff --git a/gcc/builtins.c b/gcc/builtins.c index cd8947b4de2..abe342e111d 100644 --- a/gcc/builtins.c +++ b/gcc/builtins.c @@ -10255,7 +10255,7 @@ maybe_emit_sprintf_chk_warning (tree exp, enum built_in_function fcode) static tree fold_builtin_object_size (tree ptr, tree ost) { - unsigned HOST_WIDE_INT bytes; + tree bytes; int object_size_type; if (!validate_arg (ptr, POINTER_TYPE) @@ -10280,8 +10280,8 @@ fold_builtin_object_size (tree ptr, tree ost) if (TREE_CODE (ptr) == ADDR_EXPR) { compute_builtin_object_size (ptr, object_size_type, &bytes); - if (wi::fits_to_tree_p (bytes, size_type_node)) - return build_int_cstu (size_type_node, bytes); + if (int_fits_type_p (bytes, size_type_node)) + return fold_convert (size_type_node, bytes); } else if (TREE_CODE (ptr) == SSA_NAME) { @@ -10289,8 +10289,8 @@ fold_builtin_object_size (tree ptr, tree ost) later. Maybe subsequent passes will help determining it. */ if (compute_builtin_object_size (ptr, object_size_type, &bytes) - && wi::fits_to_tree_p (bytes, size_type_node)) - return build_int_cstu (size_type_node, bytes); + && int_fits_type_p (bytes, size_type_node)) + return fold_convert (size_type_node, bytes); } return NULL_TREE; diff --git a/gcc/gimple-fold.c b/gcc/gimple-fold.c index 1d8fd74f72c..64515aabc04 100644 --- a/gcc/gimple-fold.c +++ b/gcc/gimple-fold.c @@ -2493,17 +2493,16 @@ gimple_fold_builtin_strncat (gimple_stmt_iterator *gsi) if (!src_len || known_lower (stmt, len, src_len, true)) return false; - unsigned HOST_WIDE_INT dstsize; - bool found_dstsize = compute_builtin_object_size (dst, 1, &dstsize); - /* Warn on constant LEN. */ if (TREE_CODE (len) == INTEGER_CST) { bool nowarn = warning_suppressed_p (stmt, OPT_Wstringop_overflow_); + tree dstsize; - if (!nowarn && found_dstsize) + if (!nowarn && compute_builtin_object_size (dst, 1, &dstsize) + && TREE_CODE (dstsize) == INTEGER_CST) { - int cmpdst = compare_tree_int (len, dstsize); + int cmpdst = tree_int_cst_compare (len, dstsize); if (cmpdst >= 0) { @@ -2519,7 +2518,7 @@ gimple_fold_builtin_strncat (gimple_stmt_iterator *gsi) ? G_("%qD specified bound %E equals " "destination size") : G_("%qD specified bound %E exceeds " - "destination size %wu"), + "destination size %E"), fndecl, len, dstsize); if (nowarn) suppress_warning (stmt, OPT_Wstringop_overflow_); diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-1.c b/gcc/testsuite/gcc.dg/builtin-object-size-1.c index 8cdae49a6b1..0154f4e9695 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-1.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-1.c @@ -424,6 +424,35 @@ test8 (void) abort (); } +void +__attribute__ ((noinline)) +test9 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 0) != 10) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 0) != 10) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 0) != sizeof (y)) + abort (); +} + int main (void) { @@ -437,5 +466,6 @@ main (void) test6 (4); test7 (); test8 (); + test9 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-2.c b/gcc/testsuite/gcc.dg/builtin-object-size-2.c index ad2dd296a9a..5cf29291aff 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-2.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-2.c @@ -382,6 +382,35 @@ test7 (void) abort (); } +void +__attribute__ ((noinline)) +test8 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 1) != 10) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 1) != 10) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 1) != sizeof (y.c)) + abort (); +} + int main (void) { @@ -394,5 +423,6 @@ main (void) test5 (4); test6 (); test7 (); + test8 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-3.c b/gcc/testsuite/gcc.dg/builtin-object-size-3.c index d5ca5047ee9..3a692c4e3d2 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-3.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-3.c @@ -430,6 +430,36 @@ test8 (void) abort (); } +void +__attribute__ ((noinline)) +test9 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 2) != 6) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 2) != 2) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 2) + != sizeof (y) - __builtin_offsetof (struct A, c) - 8) + abort (); +} + int main (void) { @@ -443,5 +473,6 @@ main (void) test6 (4); test7 (); test8 (); + test9 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-4.c b/gcc/testsuite/gcc.dg/builtin-object-size-4.c index 9f159e36a0f..87381620cc9 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-4.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-4.c @@ -395,6 +395,35 @@ test7 (void) abort (); } +void +__attribute__ ((noinline)) +test8 (unsigned cond) +{ + char *buf2 = malloc (10); + char *p; + + if (cond) + p = &buf2[8]; + else + p = &buf2[4]; + + if (__builtin_object_size (&p[-4], 3) != 6) + abort (); + + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 3) != 2) + abort (); + + p = &y.c[8]; + for (unsigned i = cond; i > 0; i--) + p--; + + if (__builtin_object_size (p, 3) != sizeof (y.c) - 8) + abort (); +} + int main (void) { @@ -407,5 +436,6 @@ main (void) test5 (4); test6 (); test7 (); + test8 (1); exit (0); } diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-5.c b/gcc/testsuite/gcc.dg/builtin-object-size-5.c index 7c274cdfd42..8e63d9c7a5e 100644 --- a/gcc/testsuite/gcc.dg/builtin-object-size-5.c +++ b/gcc/testsuite/gcc.dg/builtin-object-size-5.c @@ -53,4 +53,29 @@ test4 (size_t x) abort (); } +void +test5 (void) +{ + char *p = &buf[0x90000004]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + +void +test6 (void) +{ + char *p = &buf[-4]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + +void +test7 (void) +{ + char *buf2 = __builtin_malloc (8); + char *p = &buf2[0x90000004]; + if (__builtin_object_size (p + 2, 0) != 0) + abort (); +} + /* { dg-final { scan-assembler-not "abort" } } */ diff --git a/gcc/tree-object-size.c b/gcc/tree-object-size.c index b4881ef198f..32ef6dd5133 100644 --- a/gcc/tree-object-size.c +++ b/gcc/tree-object-size.c @@ -45,6 +45,14 @@ struct object_size_info unsigned int *stack, *tos; }; +struct GTY(()) object_size +{ + /* Estimate of bytes till the end of the object. */ + tree size; + /* Estimate of the size of the whole object. */ + tree wholesize; +}; + enum { OST_SUBOBJECT = 1, @@ -54,13 +62,12 @@ enum static tree compute_object_offset (const_tree, const_tree); static bool addr_object_size (struct object_size_info *, - const_tree, int, unsigned HOST_WIDE_INT *); -static unsigned HOST_WIDE_INT alloc_object_size (const gcall *, int); + const_tree, int, tree *, tree *t = NULL); +static tree alloc_object_size (const gcall *, int); static tree pass_through_call (const gcall *); static void collect_object_sizes_for (struct object_size_info *, tree); static void expr_object_size (struct object_size_info *, tree, tree); -static bool merge_object_sizes (struct object_size_info *, tree, tree, - unsigned HOST_WIDE_INT); +static bool merge_object_sizes (struct object_size_info *, tree, tree); static bool plus_stmt_object_size (struct object_size_info *, tree, gimple *); static bool cond_expr_object_size (struct object_size_info *, tree, gimple *); static void init_offset_limit (void); @@ -68,13 +75,13 @@ static void check_for_plus_in_loops (struct object_size_info *, tree); static void check_for_plus_in_loops_1 (struct object_size_info *, tree, unsigned int); -/* object_sizes[0] is upper bound for number of bytes till the end of - the object. - object_sizes[1] is upper bound for number of bytes till the end of - the subobject (innermost array or field with address taken). - object_sizes[2] is lower bound for number of bytes till the end of - the object and object_sizes[3] lower bound for subobject. */ -static vec<unsigned HOST_WIDE_INT> object_sizes[OST_END]; +/* object_sizes[0] is upper bound for the object size and number of bytes till + the end of the object. + object_sizes[1] is upper bound for the object size and number of bytes till + the end of the subobject (innermost array or field with address taken). + object_sizes[2] is lower bound for the object size and number of bytes till + the end of the object and object_sizes[3] lower bound for subobject. */ +static vec<object_size> object_sizes[OST_END]; /* Bitmaps what object sizes have been computed already. */ static bitmap computed[OST_END]; @@ -82,10 +89,46 @@ static bitmap computed[OST_END]; /* Maximum value of offset we consider to be addition. */ static unsigned HOST_WIDE_INT offset_limit; +/* Initial value of object sizes; zero for maximum and SIZE_MAX for minimum + object size. */ + +static inline unsigned HOST_WIDE_INT +initval (int object_size_type) +{ + return (object_size_type & OST_MINIMUM) ? HOST_WIDE_INT_M1U : 0; +} + +/* Unknown object size value; it's the opposite of initval. */ + static inline unsigned HOST_WIDE_INT unknown (int object_size_type) { - return ((unsigned HOST_WIDE_INT) -((object_size_type >> 1) ^ 1)); + return ~initval (object_size_type); +} + +/* Return true if VAL is represents an unknown size for OBJECT_SIZE_TYPE. */ + +static inline bool +size_unknown_p (tree val, int object_size_type) +{ + return (tree_fits_uhwi_p (val) + && tree_to_uhwi (val) == unknown (object_size_type)); +} + +/* Return a tree with initial value for OBJECT_SIZE_TYPE. */ + +static inline tree +size_initval (int object_size_type) +{ + return size_int (initval (object_size_type)); +} + +/* Return a tree with unknown value for OBJECT_SIZE_TYPE. */ + +static inline tree +size_unknown (int object_size_type) +{ + return size_int (unknown (object_size_type)); } /* Grow object_sizes[OBJECT_SIZE_TYPE] to num_ssa_names. */ @@ -110,47 +153,57 @@ object_sizes_release (int object_size_type) static inline bool object_sizes_unknown_p (int object_size_type, unsigned varno) { - return (object_sizes[object_size_type][varno] - == unknown (object_size_type)); + return size_unknown_p (object_sizes[object_size_type][varno].size, + object_size_type); } -/* Return size for VARNO corresponding to OSI. */ +/* Return size for VARNO corresponding to OSI. If WHOLE is true, return the + whole object size. */ -static inline unsigned HOST_WIDE_INT -object_sizes_get (struct object_size_info *osi, unsigned varno) +static inline tree +object_sizes_get (struct object_size_info *osi, unsigned varno, + bool whole = false) { - return object_sizes[osi->object_size_type][varno]; + if (whole) + return object_sizes[osi->object_size_type][varno].wholesize; + else + return object_sizes[osi->object_size_type][varno].size; } /* Set size for VARNO corresponding to OSI to VAL. */ -static inline bool -object_sizes_set_force (struct object_size_info *osi, unsigned varno, - unsigned HOST_WIDE_INT val) +static inline void +object_sizes_initialize (struct object_size_info *osi, unsigned varno, + tree val, tree wholeval) { - object_sizes[osi->object_size_type][varno] = val; - return true; + int object_size_type = osi->object_size_type; + + object_sizes[object_size_type][varno].size = val; + object_sizes[object_size_type][varno].wholesize = wholeval; } /* Set size for VARNO corresponding to OSI to VAL if it is the new minimum or maximum. */ static inline bool -object_sizes_set (struct object_size_info *osi, unsigned varno, - unsigned HOST_WIDE_INT val) +object_sizes_set (struct object_size_info *osi, unsigned varno, tree val, + tree wholeval) { int object_size_type = osi->object_size_type; - if ((object_size_type & OST_MINIMUM) == 0) - { - if (object_sizes[object_size_type][varno] < val) - return object_sizes_set_force (osi, varno, val); - } - else - { - if (object_sizes[object_size_type][varno] > val) - return object_sizes_set_force (osi, varno, val); - } - return false; + object_size osize = object_sizes[object_size_type][varno]; + + tree oldval = osize.size; + tree old_wholeval = osize.wholesize; + + enum tree_code code = object_size_type & OST_MINIMUM ? MIN_EXPR : MAX_EXPR; + + val = size_binop (code, val, oldval); + wholeval = size_binop (code, wholeval, old_wholeval); + + object_sizes[object_size_type][varno].size = val; + object_sizes[object_size_type][varno].wholesize = wholeval; + return (tree_int_cst_compare (oldval, val) != 0 + || tree_int_cst_compare (old_wholeval, wholeval) != 0); } /* Initialize OFFSET_LIMIT variable. */ @@ -164,6 +217,48 @@ init_offset_limit (void) offset_limit /= 2; } +/* Bytes at end of the object with SZ from offset OFFSET. If WHOLESIZE is not + NULL_TREE, use it to get the net offset of the pointer, which should always + be positive and hence, be within OFFSET_LIMIT for valid offsets. */ + +static tree +size_for_offset (tree sz, tree offset, tree wholesize = NULL_TREE) +{ + gcc_checking_assert (TREE_CODE (offset) == INTEGER_CST); + gcc_checking_assert (TREE_CODE (sz) == INTEGER_CST); + gcc_checking_assert (types_compatible_p (TREE_TYPE (sz), sizetype)); + + /* For negative offsets, if we have a distinct WHOLESIZE, use it to get a net + offset from the whole object. */ + if (wholesize && tree_int_cst_compare (sz, wholesize)) + { + gcc_checking_assert (TREE_CODE (wholesize) == INTEGER_CST); + gcc_checking_assert (types_compatible_p (TREE_TYPE (wholesize), + sizetype)); + + /* Restructure SZ - OFFSET as + WHOLESIZE - (WHOLESIZE + OFFSET - SZ) so that the offset part, i.e. + WHOLESIZE + OFFSET - SZ is only allowed to be positive. */ + tree tmp = size_binop (MAX_EXPR, wholesize, sz); + offset = fold_build2 (PLUS_EXPR, sizetype, tmp, offset); + offset = fold_build2 (MINUS_EXPR, sizetype, offset, sz); + sz = tmp; + } + + /* Safe to convert now, since a valid net offset should be non-negative. */ + if (!types_compatible_p (TREE_TYPE (offset), sizetype)) + fold_convert (sizetype, offset); + + if (integer_zerop (offset)) + return sz; + + /* Negative or too large offset even after adjustment, cannot be within + bounds of an object. */ + if (compare_tree_int (offset, offset_limit) > 0) + return size_zero_node; + + return size_binop (MINUS_EXPR, size_binop (MAX_EXPR, sz, offset), offset); +} /* Compute offset of EXPR within VAR. Return error_mark_node if unknown. */ @@ -274,19 +369,22 @@ decl_init_size (tree decl, bool min) /* Compute __builtin_object_size for PTR, which is a ADDR_EXPR. OBJECT_SIZE_TYPE is the second argument from __builtin_object_size. - If unknown, return unknown (object_size_type). */ + If unknown, return size_unknown (object_size_type). */ static bool addr_object_size (struct object_size_info *osi, const_tree ptr, - int object_size_type, unsigned HOST_WIDE_INT *psize) + int object_size_type, tree *psize, tree *pwholesize) { - tree pt_var, pt_var_size = NULL_TREE, var_size, bytes; + tree pt_var, pt_var_size = NULL_TREE, pt_var_wholesize = NULL_TREE; + tree var_size, bytes, wholebytes; gcc_assert (TREE_CODE (ptr) == ADDR_EXPR); /* Set to unknown and overwrite just before returning if the size could be determined. */ - *psize = unknown (object_size_type); + *psize = size_unknown (object_size_type); + if (pwholesize) + *pwholesize = size_unknown (object_size_type); pt_var = TREE_OPERAND (ptr, 0); while (handled_component_p (pt_var)) @@ -297,13 +395,14 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, if (TREE_CODE (pt_var) == MEM_REF) { - unsigned HOST_WIDE_INT sz; + tree sz, wholesize; if (!osi || (object_size_type & OST_SUBOBJECT) != 0 || TREE_CODE (TREE_OPERAND (pt_var, 0)) != SSA_NAME) { compute_builtin_object_size (TREE_OPERAND (pt_var, 0), object_size_type & ~OST_SUBOBJECT, &sz); + wholesize = sz; } else { @@ -312,46 +411,47 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, collect_object_sizes_for (osi, var); if (bitmap_bit_p (computed[object_size_type], SSA_NAME_VERSION (var))) - sz = object_sizes_get (osi, SSA_NAME_VERSION (var)); + { + sz = object_sizes_get (osi, SSA_NAME_VERSION (var)); + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (var), true); + } else - sz = unknown (object_size_type); + sz = wholesize = size_unknown (object_size_type); } - if (sz != unknown (object_size_type)) + if (!size_unknown_p (sz, object_size_type)) { - offset_int mem_offset; - if (mem_ref_offset (pt_var).is_constant (&mem_offset)) - { - offset_int dsz = wi::sub (sz, mem_offset); - if (wi::neg_p (dsz)) - sz = 0; - else if (wi::fits_uhwi_p (dsz)) - sz = dsz.to_uhwi (); - else - sz = unknown (object_size_type); - } + tree offset = TREE_OPERAND (pt_var, 1); + if (TREE_CODE (offset) != INTEGER_CST + || TREE_CODE (sz) != INTEGER_CST) + sz = wholesize = size_unknown (object_size_type); else - sz = unknown (object_size_type); + sz = size_for_offset (sz, offset, wholesize); } - if (sz != unknown (object_size_type) && sz < offset_limit) - pt_var_size = size_int (sz); + if (!size_unknown_p (sz, object_size_type) + && TREE_CODE (sz) == INTEGER_CST + && compare_tree_int (sz, offset_limit) < 0) + { + pt_var_size = sz; + pt_var_wholesize = wholesize; + } } else if (DECL_P (pt_var)) { - pt_var_size = decl_init_size (pt_var, object_size_type & OST_MINIMUM); + pt_var_size = pt_var_wholesize + = decl_init_size (pt_var, object_size_type & OST_MINIMUM); if (!pt_var_size) return false; } else if (TREE_CODE (pt_var) == STRING_CST) - pt_var_size = TYPE_SIZE_UNIT (TREE_TYPE (pt_var)); + pt_var_size = pt_var_wholesize = TYPE_SIZE_UNIT (TREE_TYPE (pt_var)); else return false; if (pt_var_size) { /* Validate the size determined above. */ - if (!tree_fits_uhwi_p (pt_var_size) - || tree_to_uhwi (pt_var_size) >= offset_limit) + if (compare_tree_int (pt_var_size, offset_limit) >= 0) return false; } @@ -496,28 +596,35 @@ addr_object_size (struct object_size_info *osi, const_tree ptr, bytes = size_binop (MIN_EXPR, bytes, bytes2); } } + + wholebytes + = object_size_type & OST_SUBOBJECT ? var_size : pt_var_wholesize; } else if (!pt_var_size) return false; else - bytes = pt_var_size; - - if (tree_fits_uhwi_p (bytes)) { - *psize = tree_to_uhwi (bytes); - return true; + bytes = pt_var_size; + wholebytes = pt_var_wholesize; } - return false; + if (TREE_CODE (bytes) != INTEGER_CST + || TREE_CODE (wholebytes) != INTEGER_CST) + return false; + + *psize = bytes; + if (pwholesize) + *pwholesize = wholebytes; + return true; } /* Compute __builtin_object_size for CALL, which is a GIMPLE_CALL. Handles calls to functions declared with attribute alloc_size. OBJECT_SIZE_TYPE is the second argument from __builtin_object_size. - If unknown, return unknown (object_size_type). */ + If unknown, return size_unknown (object_size_type). */ -static unsigned HOST_WIDE_INT +static tree alloc_object_size (const gcall *call, int object_size_type) { gcc_assert (is_gimple_call (call)); @@ -529,7 +636,7 @@ alloc_object_size (const gcall *call, int object_size_type) calltype = gimple_call_fntype (call); if (!calltype) - return unknown (object_size_type); + return size_unknown (object_size_type); /* Set to positions of alloc_size arguments. */ int arg1 = -1, arg2 = -1; @@ -549,7 +656,7 @@ alloc_object_size (const gcall *call, int object_size_type) || (arg2 >= 0 && (arg2 >= (int)gimple_call_num_args (call) || TREE_CODE (gimple_call_arg (call, arg2)) != INTEGER_CST))) - return unknown (object_size_type); + return size_unknown (object_size_type); tree bytes = NULL_TREE; if (arg2 >= 0) @@ -559,10 +666,7 @@ alloc_object_size (const gcall *call, int object_size_type) else if (arg1 >= 0) bytes = fold_convert (sizetype, gimple_call_arg (call, arg1)); - if (bytes && tree_fits_uhwi_p (bytes)) - return tree_to_uhwi (bytes); - - return unknown (object_size_type); + return bytes; } @@ -598,13 +702,13 @@ pass_through_call (const gcall *call) bool compute_builtin_object_size (tree ptr, int object_size_type, - unsigned HOST_WIDE_INT *psize) + tree *psize) { gcc_assert (object_size_type >= 0 && object_size_type < OST_END); /* Set to unknown and overwrite just before returning if the size could be determined. */ - *psize = unknown (object_size_type); + *psize = size_unknown (object_size_type); if (! offset_limit) init_offset_limit (); @@ -638,8 +742,7 @@ compute_builtin_object_size (tree ptr, int object_size_type, psize)) { /* Return zero when the offset is out of bounds. */ - unsigned HOST_WIDE_INT off = tree_to_shwi (offset); - *psize = off < *psize ? *psize - off : 0; + *psize = size_for_offset (*psize, offset); return true; } } @@ -747,12 +850,13 @@ compute_builtin_object_size (tree ptr, int object_size_type, print_generic_expr (dump_file, ssa_name (i), dump_flags); fprintf (dump_file, - ": %s %sobject size " - HOST_WIDE_INT_PRINT_UNSIGNED "\n", + ": %s %sobject size ", ((object_size_type & OST_MINIMUM) ? "minimum" : "maximum"), - (object_size_type & OST_SUBOBJECT) ? "sub" : "", - object_sizes_get (&osi, i)); + (object_size_type & OST_SUBOBJECT) ? "sub" : ""); + print_generic_expr (dump_file, object_sizes_get (&osi, i), + dump_flags); + fprintf (dump_file, "\n"); } } @@ -761,7 +865,7 @@ compute_builtin_object_size (tree ptr, int object_size_type, } *psize = object_sizes_get (&osi, SSA_NAME_VERSION (ptr)); - return *psize != unknown (object_size_type); + return !size_unknown_p (*psize, object_size_type); } /* Compute object_sizes for PTR, defined to VALUE, which is not an SSA_NAME. */ @@ -771,7 +875,7 @@ expr_object_size (struct object_size_info *osi, tree ptr, tree value) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (ptr); - unsigned HOST_WIDE_INT bytes; + tree bytes, wholesize; gcc_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_assert (osi->pass == 0); @@ -784,11 +888,11 @@ expr_object_size (struct object_size_info *osi, tree ptr, tree value) || !POINTER_TYPE_P (TREE_TYPE (value))); if (TREE_CODE (value) == ADDR_EXPR) - addr_object_size (osi, value, object_size_type, &bytes); + addr_object_size (osi, value, object_size_type, &bytes, &wholesize); else - bytes = unknown (object_size_type); + bytes = wholesize = size_unknown (object_size_type); - object_sizes_set (osi, varno, bytes); + object_sizes_set (osi, varno, bytes, wholesize); } @@ -799,16 +903,14 @@ call_object_size (struct object_size_info *osi, tree ptr, gcall *call) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (ptr); - unsigned HOST_WIDE_INT bytes; gcc_assert (is_gimple_call (call)); gcc_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_assert (osi->pass == 0); + tree bytes = alloc_object_size (call, object_size_type); - bytes = alloc_object_size (call, object_size_type); - - object_sizes_set (osi, varno, bytes); + object_sizes_set (osi, varno, bytes, bytes); } @@ -822,8 +924,9 @@ unknown_object_size (struct object_size_info *osi, tree ptr) gcc_checking_assert (!object_sizes_unknown_p (object_size_type, varno)); gcc_checking_assert (osi->pass == 0); + tree bytes = size_unknown (object_size_type); - object_sizes_set (osi, varno, unknown (object_size_type)); + object_sizes_set (osi, varno, bytes, bytes); } @@ -831,30 +934,22 @@ unknown_object_size (struct object_size_info *osi, tree ptr) the object size might need reexamination later. */ static bool -merge_object_sizes (struct object_size_info *osi, tree dest, tree orig, - unsigned HOST_WIDE_INT offset) +merge_object_sizes (struct object_size_info *osi, tree dest, tree orig) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (dest); - unsigned HOST_WIDE_INT orig_bytes; + tree orig_bytes, wholesize; if (object_sizes_unknown_p (object_size_type, varno)) return false; - if (offset >= offset_limit) - { - object_sizes_set (osi, varno, unknown (object_size_type)); - return false; - } if (osi->pass == 0) collect_object_sizes_for (osi, orig); orig_bytes = object_sizes_get (osi, SSA_NAME_VERSION (orig)); - if (orig_bytes != unknown (object_size_type)) - orig_bytes = (offset > orig_bytes) - ? HOST_WIDE_INT_0U : orig_bytes - offset; + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (orig), true); - if (object_sizes_set (osi, varno, orig_bytes)) + if (object_sizes_set (osi, varno, orig_bytes, wholesize)) osi->changed = true; return bitmap_bit_p (osi->reexamine, SSA_NAME_VERSION (orig)); @@ -870,8 +965,9 @@ plus_stmt_object_size (struct object_size_info *osi, tree var, gimple *stmt) { int object_size_type = osi->object_size_type; unsigned int varno = SSA_NAME_VERSION (var); - unsigned HOST_WIDE_INT bytes; + tree bytes, wholesize; tree op0, op1; + bool reexamine = false; if (gimple_assign_rhs_code (stmt) == POINTER_PLUS_EXPR) { @@ -896,31 +992,38 @@ plus_stmt_object_size (struct object_size_info *osi, tree var, gimple *stmt) && (TREE_CODE (op0) == SSA_NAME || TREE_CODE (op0) == ADDR_EXPR)) { - if (! tree_fits_uhwi_p (op1)) - bytes = unknown (object_size_type); - else if (TREE_CODE (op0) == SSA_NAME) - return merge_object_sizes (osi, var, op0, tree_to_uhwi (op1)); + if (TREE_CODE (op0) == SSA_NAME) + { + if (osi->pass == 0) + collect_object_sizes_for (osi, op0); + + bytes = object_sizes_get (osi, SSA_NAME_VERSION (op0)); + wholesize = object_sizes_get (osi, SSA_NAME_VERSION (op0), true); + reexamine = bitmap_bit_p (osi->reexamine, SSA_NAME_VERSION (op0)); + } else { - unsigned HOST_WIDE_INT off = tree_to_uhwi (op1); - - /* op0 will be ADDR_EXPR here. */ - addr_object_size (osi, op0, object_size_type, &bytes); - if (bytes == unknown (object_size_type)) - ; - else if (off > offset_limit) - bytes = unknown (object_size_type); - else if (off > bytes) - bytes = 0; - else - bytes -= off; + /* op0 will be ADDR_EXPR here. We should never come here during + reexamination. */ + gcc_checking_assert (osi->pass == 0); + addr_object_size (osi, op0, object_size_type, &bytes, &wholesize); } + + /* In the first pass, do not compute size for offset if either the + maximum size is unknown or the minimum size is not initialized yet; + the latter indicates a dependency loop and will be resolved in + subsequent passes. We attempt to compute offset for 0 minimum size + too because a negative offset could be within bounds of WHOLESIZE, + giving a non-zero result for VAR. */ + if (osi->pass != 0 || !size_unknown_p (bytes, 0)) + bytes = size_for_offset (bytes, op1, wholesize); } else - bytes = unknown (object_size_type); </cut>

4 years, 1 month

[TCWG CI] 464.h264ref slowed down by 4% after llvm: [SLP]Improve multinode analysis.

by ci_notify＠linaro.org

After llvm commit bd053769867f988500dc1b451c6439eefcf7643f Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve multinode analysis. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 4% from 11205 to 11640 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bd053769867f988500dc1b451c6439eefcf7643f cd investigate-llvm-bd053769867f988500dc1b451c6439eefcf7643f # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bd053769867f988500dc1b451c6439eefcf7643f ../artifacts/test.sh # Reproduce last_good build git checkout --detach 135d5d4a6d37f30173c1b9ea85a3a969c364b241 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bd053769867f988500dc1b451c6439eefcf7643f Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Tue Apr 6 08:35:52 2021 -0700 [SLP]Improve multinode analysis. Changes the preliminary multinode analysis: 1. Introduced scores for reversed loads/extractelements. 2. Improved shallow score calculation. 3. Lowered the cost of external uses (no need to consider it several times, just ones). 4. The initial lane for analysis is the one with the minimal possible reorderings. These changes in general shall reduce compile time and improve the reordering in many cases. Part of D57059. Differential Revision: https://reviews.llvm.org/D101109 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 253 ++++++++++++++++----- .../AArch64/transpose-inseltpoison.ll | 30 +-- .../Transforms/SLPVectorizer/AArch64/transpose.ll | 30 +-- .../AArch64/vectorize-free-extracts-inserts.ll | 20 +- llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll | 2 +- llvm/test/Transforms/SLPVectorizer/X86/addsub.ll | 24 +- .../Transforms/SLPVectorizer/X86/commutativity.ll | 20 +- .../SLPVectorizer/X86/crash_exceed_scheduling.ll | 6 +- .../Transforms/SLPVectorizer/X86/crash_smallpt.ll | 18 +- .../Transforms/SLPVectorizer/X86/extractelement.ll | 4 +- .../Transforms/SLPVectorizer/X86/insert-shuffle.ll | 34 ++- .../test/Transforms/SLPVectorizer/X86/lookahead.ll | 35 +-- .../Transforms/SLPVectorizer/X86/operandorder.ll | 44 ++-- .../Transforms/SLPVectorizer/X86/store-jumbled.ll | 4 +- .../SLPVectorizer/X86/stores_vectorize.ll | 6 +- .../test/Transforms/SLPVectorizer/X86/supernode.ll | 2 +- 16 files changed, 328 insertions(+), 204 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index d145b04c0694..c685432ae28e 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -1016,18 +1016,25 @@ public: std::swap(OpsVec[OpIdx1][Lane], OpsVec[OpIdx2][Lane]); } - // The hard-coded scores listed here are not very important. When computing - // the scores of matching one sub-tree with another, we are basically - // counting the number of values that are matching. So even if all scores - // are set to 1, we would still get a decent matching result. + // The hard-coded scores listed here are not very important, though it shall + // be higher for better matches to improve the resulting cost. When + // computing the scores of matching one sub-tree with another, we are + // basically counting the number of values that are matching. So even if all + // scores are set to 1, we would still get a decent matching result. // However, sometimes we have to break ties. For example we may have to // choose between matching loads vs matching opcodes. This is what these - // scores are helping us with: they provide the order of preference. + // scores are helping us with: they provide the order of preference. Also, + // this is important if the scalar is externally used or used in another + // tree entry node in the different lane. /// Loads from consecutive memory addresses, e.g. load(A[i]), load(A[i+1]). - static const int ScoreConsecutiveLoads = 3; + static const int ScoreConsecutiveLoads = 4; + /// Loads from reversed memory addresses, e.g. load(A[i+1]), load(A[i]). + static const int ScoreReversedLoads = 3; /// ExtractElementInst from same vector and consecutive indexes. - static const int ScoreConsecutiveExtracts = 3; + static const int ScoreConsecutiveExtracts = 4; + /// ExtractElementInst from same vector and reversed indices. + static const int ScoreReversedExtracts = 3; /// Constants. static const int ScoreConstants = 2; /// Instructions with the same opcode. @@ -1047,7 +1054,10 @@ public: /// \returns the score of placing \p V1 and \p V2 in consecutive lanes. static int getShallowScore(Value *V1, Value *V2, const DataLayout &DL, - ScalarEvolution &SE) { + ScalarEvolution &SE, int NumLanes) { + if (V1 == V2) + return VLOperands::ScoreSplat; + auto *LI1 = dyn_cast<LoadInst>(V1); auto *LI2 = dyn_cast<LoadInst>(V2); if (LI1 && LI2) { @@ -1057,8 +1067,17 @@ public: Optional<int> Dist = getPointersDiff( LI1->getType(), LI1->getPointerOperand(), LI2->getType(), LI2->getPointerOperand(), DL, SE, /*StrictCheck=*/true); - return (Dist && *Dist == 1) ? VLOperands::ScoreConsecutiveLoads - : VLOperands::ScoreFail; + if (!Dist) + return VLOperands::ScoreFail; + // The distance is too large - still may be profitable to use masked + // loads/gathers. + if (std::abs(*Dist) > NumLanes / 2) + return VLOperands::ScoreAltOpcodes; + // This still will detect consecutive loads, but we might have "holes" + // in some cases. It is ok for non-power-2 vectorization and may produce + // better results. It should not affect current vectorization. + return (*Dist > 0) ? VLOperands::ScoreConsecutiveLoads + : VLOperands::ScoreReversedLoads; } auto *C1 = dyn_cast<Constant>(V1); @@ -1068,18 +1087,41 @@ public: // Extracts from consecutive indexes of the same vector better score as // the extracts could be optimized away. - Value *EV; - ConstantInt *Ex1Idx, *Ex2Idx; - if (match(V1, m_ExtractElt(m_Value(EV), m_ConstantInt(Ex1Idx))) && - match(V2, m_ExtractElt(m_Deferred(EV), m_ConstantInt(Ex2Idx))) && - Ex1Idx->getZExtValue() + 1 == Ex2Idx->getZExtValue()) - return VLOperands::ScoreConsecutiveExtracts; + Value *EV1; + ConstantInt *Ex1Idx; + if (match(V1, m_ExtractElt(m_Value(EV1), m_ConstantInt(Ex1Idx)))) { + // Undefs are always profitable for extractelements. + if (isa<UndefValue>(V2)) + return VLOperands::ScoreConsecutiveExtracts; + Value *EV2 = nullptr; + ConstantInt *Ex2Idx = nullptr; + if (match(V2, + m_ExtractElt(m_Value(EV2), m_CombineOr(m_ConstantInt(Ex2Idx), + m_Undef())))) { + // Undefs are always profitable for extractelements. + if (!Ex2Idx) + return VLOperands::ScoreConsecutiveExtracts; + if (isUndefVector(EV2) && EV2->getType() == EV1->getType()) + return VLOperands::ScoreConsecutiveExtracts; + if (EV2 == EV1) { + int Idx1 = Ex1Idx->getZExtValue(); + int Idx2 = Ex2Idx->getZExtValue(); + int Dist = Idx2 - Idx1; + // The distance is too large - still may be profitable to use + // shuffles. + if (std::abs(Dist) > NumLanes / 2) + return VLOperands::ScoreAltOpcodes; + return (Dist > 0) ? VLOperands::ScoreConsecutiveExtracts + : VLOperands::ScoreReversedExtracts; + } + } + } auto *I1 = dyn_cast<Instruction>(V1); auto *I2 = dyn_cast<Instruction>(V2); if (I1 && I2) { - if (I1 == I2) - return VLOperands::ScoreSplat; + if (I1->getParent() != I2->getParent()) + return VLOperands::ScoreFail; InstructionsState S = getSameOpcode({I1, I2}); // Note: Only consider instructions with <= 2 operands to avoid // complexity explosion. @@ -1094,11 +1136,13 @@ public: return VLOperands::ScoreFail; } - /// Holds the values and their lane that are taking part in the look-ahead + /// Holds the values and their lanes that are taking part in the look-ahead /// score calculation. This is used in the external uses cost calculation. - SmallDenseMap<Value *, int> InLookAheadValues; + /// Need to hold all the lanes in case of splat/broadcast at least to + /// correctly check for the use in the different lane. + SmallDenseMap<Value *, SmallSet<int, 4>> InLookAheadValues; - /// \Returns the additinal cost due to uses of \p LHS and \p RHS that are + /// \returns the additional cost due to uses of \p LHS and \p RHS that are /// either external to the vectorized code, or require shuffling. int getExternalUsesCost(const std::pair<Value *, int> &LHS, const std::pair<Value *, int> &RHS) { @@ -1122,22 +1166,30 @@ public: for (User *U : V->users()) { if (const TreeEntry *UserTE = R.getTreeEntry(U)) { // The user is in the VectorizableTree. Check if we need to insert. - auto It = llvm::find(UserTE->Scalars, U); - assert(It != UserTE->Scalars.end() && "U is in UserTE"); - int UserLn = std::distance(UserTE->Scalars.begin(), It); + int UserLn = UserTE->findLaneForValue(U); assert(UserLn >= 0 && "Bad lane"); - if (UserLn != Ln) + // If the values are different, check just the line of the current + // value. If the values are the same, need to add UserInDiffLaneCost + // only if UserLn does not match both line numbers. + if ((LHS.first != RHS.first && UserLn != Ln) || + (LHS.first == RHS.first && UserLn != LHS.second && + UserLn != RHS.second)) { Cost += UserInDiffLaneCost; + break; + } } else { // Check if the user is in the look-ahead code. auto It2 = InLookAheadValues.find(U); if (It2 != InLookAheadValues.end()) { // The user is in the look-ahead code. Check the lane. - if (It2->second != Ln) + if (!It2->getSecond().contains(Ln)) { Cost += UserInDiffLaneCost; + break; + } } else { // The user is neither in SLP tree nor in the look-ahead code. Cost += ExternalUseCost; + break; } } // Limit the number of visited uses to cap compilation time. @@ -1176,32 +1228,36 @@ public: Value *V1 = LHS.first; Value *V2 = RHS.first; // Get the shallow score of V1 and V2. - int ShallowScoreAtThisLevel = - std::max((int)ScoreFail, getShallowScore(V1, V2, DL, SE) - - getExternalUsesCost(LHS, RHS)); + int ShallowScoreAtThisLevel = std::max( + (int)ScoreFail, getShallowScore(V1, V2, DL, SE, getNumLanes()) - + getExternalUsesCost(LHS, RHS)); int Lane1 = LHS.second; int Lane2 = RHS.second; // If reached MaxLevel, // or if V1 and V2 are not instructions, // or if they are SPLAT, - // or if they are not consecutive, early return the current cost. + // or if they are not consecutive, + // or if profitable to vectorize loads or extractelements, early return + // the current cost. auto *I1 = dyn_cast<Instruction>(V1); auto *I2 = dyn_cast<Instruction>(V2); if (CurrLevel == MaxLevel || !(I1 && I2) || I1 == I2 || ShallowScoreAtThisLevel == VLOperands::ScoreFail || - (isa<LoadInst>(I1) && isa<LoadInst>(I2) && ShallowScoreAtThisLevel)) + (((isa<LoadInst>(I1) && isa<LoadInst>(I2)) || + (isa<ExtractElementInst>(I1) && isa<ExtractElementInst>(I2))) && + ShallowScoreAtThisLevel)) return ShallowScoreAtThisLevel; assert(I1 && I2 && "Should have early exited."); // Keep track of in-tree values for determining the external-use cost. - InLookAheadValues[V1] = Lane1; - InLookAheadValues[V2] = Lane2; + InLookAheadValues[V1].insert(Lane1); + InLookAheadValues[V2].insert(Lane2); // Contains the I2 operand indexes that got matched with I1 operands. SmallSet<unsigned, 4> Op2Used; - // Recursion towards the operands of I1 and I2. We are trying all possbile + // Recursion towards the operands of I1 and I2. We are trying all possible // operand pairs, and keeping track of the best score. for (unsigned OpIdx1 = 0, NumOperands1 = I1->getNumOperands(); OpIdx1 != NumOperands1; ++OpIdx1) { @@ -1325,27 +1381,79 @@ public: return None; } - /// Helper for reorderOperandVecs. \Returns the lane that we should start - /// reordering from. This is the one which has the least number of operands - /// that can freely move about. + /// Helper for reorderOperandVecs. + /// \returns the lane that we should start reordering from. This is the one + /// which has the least number of operands that can freely move about or + /// less profitable because it already has the most optimal set of operands. unsigned getBestLaneToStartReordering() const { - unsigned BestLane = 0; unsigned Min = UINT_MAX; - for (unsigned Lane = 0, NumLanes = getNumLanes(); Lane != NumLanes; - ++Lane) { - unsigned NumFreeOps = getMaxNumOperandsThatCanBeReordered(Lane); - if (NumFreeOps < Min) { - Min = NumFreeOps; - BestLane = Lane; + unsigned SameOpNumber = 0; + // std::pair<unsigned, unsigned> is used to implement a simple voting + // algorithm and choose the lane with the least number of operands that + // can freely move about or less profitable because it already has the + // most optimal set of operands. The first unsigned is a counter for + // voting, the second unsigned is the counter of lanes with instructions + // with same/alternate opcodes and same parent basic block. + MapVector<unsigned, std::pair<unsigned, unsigned>> HashMap; + // Try to be closer to the original results, if we have multiple lanes + // with same cost. If 2 lanes have the same cost, use the one with the + // lowest index. + for (int I = getNumLanes(); I > 0; --I) { + unsigned Lane = I - 1; + OperandsOrderData NumFreeOpsHash = + getMaxNumOperandsThatCanBeReordered(Lane); + // Compare the number of operands that can move and choose the one with + // the least number. + if (NumFreeOpsHash.NumOfAPOs < Min) { + Min = NumFreeOpsHash.NumOfAPOs; + SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent; + HashMap.clear(); + HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane); + } else if (NumFreeOpsHash.NumOfAPOs == Min && + NumFreeOpsHash.NumOpsWithSameOpcodeParent < SameOpNumber) { + // Select the most optimal lane in terms of number of operands that + // should be moved around. + SameOpNumber = NumFreeOpsHash.NumOpsWithSameOpcodeParent; + HashMap[NumFreeOpsHash.Hash] = std::make_pair(1, Lane); + } else if (NumFreeOpsHash.NumOfAPOs == Min && + NumFreeOpsHash.NumOpsWithSameOpcodeParent == SameOpNumber) { + ++HashMap[NumFreeOpsHash.Hash].first; + } + } + // Select the lane with the minimum counter. + unsigned BestLane = 0; + unsigned CntMin = UINT_MAX; + for (const auto &Data : reverse(HashMap)) { + if (Data.second.first < CntMin) { + CntMin = Data.second.first; + BestLane = Data.second.second; } } return BestLane; } - /// \Returns the maximum number of operands that are allowed to be reordered - /// for \p Lane. This is used as a heuristic for selecting the first lane to - /// start operand reordering. - unsigned getMaxNumOperandsThatCanBeReordered(unsigned Lane) const { + /// Data structure that helps to reorder operands. + struct OperandsOrderData { + /// The best number of operands with the same APOs, which can be + /// reordered. + unsigned NumOfAPOs = UINT_MAX; + /// Number of operands with the same/alternate instruction opcode and + /// parent. + unsigned NumOpsWithSameOpcodeParent = 0; + /// Hash for the actual operands ordering. + /// Used to count operands, actually their position id and opcode + /// value. It is used in the voting mechanism to find the lane with the + /// least number of operands that can freely move about or less profitable + /// because it already has the most optimal set of operands. Can be + /// replaced with SmallVector<unsigned> instead but hash code is faster + /// and requires less memory. + unsigned Hash = 0; + }; + /// \returns the maximum number of operands that are allowed to be reordered + /// for \p Lane and the number of compatible instructions(with the same + /// parent/opcode). This is used as a heuristic for selecting the first lane + /// to start operand reordering. + OperandsOrderData getMaxNumOperandsThatCanBeReordered(unsigned Lane) const { unsigned CntTrue = 0; unsigned NumOperands = getNumOperands(); // Operands with the same APO can be reordered. We therefore need to count @@ -1354,11 +1462,45 @@ public: // a map. Instead we can simply count the number of operands that // correspond to one of them (in this case the 'true' APO), and calculate // the other by subtracting it from the total number of operands. - for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) - if (getData(OpIdx, Lane).APO) + // Operands with the same instruction opcode and parent are more + // profitable since we don't need to move them in many cases, with a high + // probability such lane already can be vectorized effectively. + bool AllUndefs = true; + unsigned NumOpsWithSameOpcodeParent = 0; + Instruction *OpcodeI = nullptr; + BasicBlock *Parent = nullptr; + unsigned Hash = 0; + for (unsigned OpIdx = 0; OpIdx != NumOperands; ++OpIdx) { + const OperandData &OpData = getData(OpIdx, Lane); + if (OpData.APO) ++CntTrue; - unsigned CntFalse = NumOperands - CntTrue; - return std::max(CntTrue, CntFalse); + // Use Boyer-Moore majority voting for finding the majority opcode and + // the number of times it occurs. + if (auto *I = dyn_cast<Instruction>(OpData.V)) { + if (!OpcodeI || !getSameOpcode({OpcodeI, I}).getOpcode() || + I->getParent() != Parent) { + if (NumOpsWithSameOpcodeParent == 0) { + NumOpsWithSameOpcodeParent = 1; + OpcodeI = I; + Parent = I->getParent(); + } else { + --NumOpsWithSameOpcodeParent; + } + } else { + ++NumOpsWithSameOpcodeParent; + } + } + Hash = hash_combine( + Hash, hash_value((OpIdx + 1) * (OpData.V->getValueID() + 1))); + AllUndefs = AllUndefs && isa<UndefValue>(OpData.V); + } + if (AllUndefs) + return {}; + OperandsOrderData Data; + Data.NumOfAPOs = std::max(CntTrue, NumOperands - CntTrue); + Data.NumOpsWithSameOpcodeParent = NumOpsWithSameOpcodeParent; + Data.Hash = Hash; + return Data; } /// Go through the instructions in VL and append their operands. @@ -2876,7 +3018,8 @@ void BoUpSLP::reorderTopToBottom() { // their ordering. DenseMap<const TreeEntry *, OrdersType> GathersToOrders; // Find all reorderable nodes with the given VF. - // Currently the are vectorized loads,extracts + some gathering of extracts. + // Currently the are vectorized stores,loads,extracts + some gathering of + // extracts. for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders]( const std::unique_ptr<TreeEntry> &TE) { if (Optional<OrdersType> CurrentOrder = @@ -3497,11 +3640,9 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth, } } - // If any of the scalars is marked as a value that needs to stay scalar, then - // we need to gather the scalars. // The reduction nodes (stored in UserIgnoreList) also should stay scalar. for (Value *V : VL) { - if (MustGather.count(V) || is_contained(UserIgnoreList, V)) { + if (is_contained(UserIgnoreList, V)) { LLVM_DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n"); if (TryToFindDuplicates(S)) newTreeEntry(VL, None /*not vectorized*/, S, UserTreeIdx, diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll index fa95ec7357aa..c8aa06677f8f 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll @@ -167,25 +167,17 @@ define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) { define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) { ; CHECK-LABEL: @build_vec_v4i32_3_binops( -; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> [[V0:%.*]], i64 0 -; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i64 1 -; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> [[V1:%.*]], i64 0 -; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i64 1 -; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1_0:%.*]] = mul i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP1_1:%.*]] = mul i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP3:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> -; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]] -; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]] -; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP2]], [[TMP4]] -; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> poison, i32 [[TMP2_0]], i64 0 -; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i64 1 -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <4 x i32> [[TMP3_1]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]] +; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2> +; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer +; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> +; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]] +; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: ret <4 x i32> [[TMP3_31]] ; %v0.0 = extractelement <2 x i32> %v0, i32 0 diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll index dcfdbee9bc5f..307480ce8018 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll @@ -167,25 +167,17 @@ define <4 x i32> @build_vec_v4i32_reuse_1(<2 x i32> %v0, <2 x i32> %v1) { define <4 x i32> @build_vec_v4i32_3_binops(<2 x i32> %v0, <2 x i32> %v1) { ; CHECK-LABEL: @build_vec_v4i32_3_binops( -; CHECK-NEXT: [[V0_0:%.*]] = extractelement <2 x i32> [[V0:%.*]], i64 0 -; CHECK-NEXT: [[V0_1:%.*]] = extractelement <2 x i32> [[V0]], i64 1 -; CHECK-NEXT: [[V1_0:%.*]] = extractelement <2 x i32> [[V1:%.*]], i64 0 -; CHECK-NEXT: [[V1_1:%.*]] = extractelement <2 x i32> [[V1]], i64 1 -; CHECK-NEXT: [[TMP0_0:%.*]] = add i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP0_1:%.*]] = add i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1_0:%.*]] = mul i32 [[V0_0]], [[V1_0]] -; CHECK-NEXT: [[TMP1_1:%.*]] = mul i32 [[V0_1]], [[V1_1]] -; CHECK-NEXT: [[TMP1:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP3:%.*]] = xor <2 x i32> [[V0]], [[V1]] -; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> -; CHECK-NEXT: [[TMP2_0:%.*]] = add i32 [[TMP0_0]], [[TMP0_1]] -; CHECK-NEXT: [[TMP2_1:%.*]] = add i32 [[TMP1_0]], [[TMP1_1]] -; CHECK-NEXT: [[TMP5:%.*]] = add <2 x i32> [[TMP2]], [[TMP4]] -; CHECK-NEXT: [[TMP3_0:%.*]] = insertelement <4 x i32> undef, i32 [[TMP2_0]], i64 0 -; CHECK-NEXT: [[TMP3_1:%.*]] = insertelement <4 x i32> [[TMP3_0]], i32 [[TMP2_1]], i64 1 -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <4 x i32> [[TMP3_1]], <4 x i32> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP1:%.*]] = add <2 x i32> [[V0:%.*]], [[V1:%.*]] +; CHECK-NEXT: [[TMP2:%.*]] = mul <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2> +; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP5]], <2 x i32> poison, <2 x i32> zeroinitializer +; CHECK-NEXT: [[TMP7:%.*]] = xor <2 x i32> [[V0]], [[V1]] +; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <2 x i32> <i32 1, i32 1> +; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP4]], [[TMP3]] +; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP3_31:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; CHECK-NEXT: ret <4 x i32> [[TMP3_31]] ; %v0.0 = extractelement <2 x i32> %v0, i32 0 diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll index d7ef813d6b72..b79d2d494aa4 100644 --- a/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll +++ b/llvm/test/Transforms/SLPVectorizer/AArch64/vectorize-free-extracts-inserts.ll @@ -282,19 +282,21 @@ define void @extracts_jumbled_4_lanes(<9 x double>* %ptr.1, <4 x double>* %ptr.2 ; CHECK-NEXT: [[V2_LANE_0:%.*]] = extractelement <4 x double> [[V_2]], i32 0 ; CHECK-NEXT: [[V2_LANE_1:%.*]] = extractelement <4 x double> [[V_2]], i32 1 ; CHECK-NEXT: [[V2_LANE_2:%.*]] = extractelement <4 x double> [[V_2]], i32 2 -; CHECK-NEXT: [[A_LANE_0:%.*]] = fmul double [[V1_LANE_0]], [[V2_LANE_2]] -; CHECK-NEXT: [[A_LANE_1:%.*]] = fmul double [[V1_LANE_2]], [[V2_LANE_1]] -; CHECK-NEXT: [[A_LANE_2:%.*]] = fmul double [[V1_LANE_1]], [[V2_LANE_2]] -; CHECK-NEXT: [[A_LANE_3:%.*]] = fmul double [[V1_LANE_3]], [[V2_LANE_0]] -; CHECK-NEXT: [[A_INS_0:%.*]] = insertelement <9 x double> undef, double [[A_LANE_0]], i32 0 -; CHECK-NEXT: [[A_INS_1:%.*]] = insertelement <9 x double> [[A_INS_0]], double [[A_LANE_1]], i32 1 -; CHECK-NEXT: [[A_INS_2:%.*]] = insertelement <9 x double> [[A_INS_1]], double [[A_LANE_2]], i32 2 -; CHECK-NEXT: [[A_INS_3:%.*]] = insertelement <9 x double> [[A_INS_2]], double [[A_LANE_3]], i32 3 +; CHECK-NEXT: [[TMP0:%.*]] = insertelement <4 x double> poison, double [[V1_LANE_0]], i32 0 +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x double> [[TMP0]], double [[V1_LANE_2]], i32 1 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x double> [[TMP1]], double [[V1_LANE_1]], i32 2 +; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x double> [[TMP2]], double [[V1_LANE_3]], i32 3 +; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x double> poison, double [[V2_LANE_2]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x double> [[TMP4]], double [[V2_LANE_1]], i32 1 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x double> [[TMP5]], double [[V2_LANE_2]], i32 2 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x double> [[TMP6]], double [[V2_LANE_0]], i32 3 +; CHECK-NEXT: [[TMP8:%.*]] = fmul <4 x double> [[TMP3]], [[TMP7]] +; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x double> [[TMP8]], <4 x double> poison, <9 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> ; CHECK-NEXT: call void @use(double [[V1_LANE_0]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_1]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_2]]) ; CHECK-NEXT: call void @use(double [[V1_LANE_3]]) -; CHECK-NEXT: store <9 x double> [[A_INS_3]], <9 x double>* [[PTR_1]], align 8 +; CHECK-NEXT: store <9 x double> [[TMP9]], <9 x double>* [[PTR_1]], align 8 ; CHECK-NEXT: ret void ; bb: diff --git a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll index 51a6e1ed81b1..7668747a75ac 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll @@ -1,6 +1,6 @@ ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py ; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-6 | FileCheck %s --check-prefix=CHECK -; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-8 -slp-min-tree-size=6 | FileCheck %s --check-prefix=FORCE_REDUCTION +; RUN: opt -slp-vectorizer -S < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=skylake -slp-threshold=-7 -slp-min-tree-size=6 | FileCheck %s --check-prefix=FORCE_REDUCTION define void @Test(i32) { ; CHECK-LABEL: @Test( diff --git a/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll b/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll index c9cb8951e882..ebbbefc9f81f 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/addsub.ll @@ -342,18 +342,18 @@ define void @vec_shuff_reorder() #0 { ; CHECK-LABEL: @vec_shuff_reorder( ; CHECK-NEXT: [[TMP1:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4 ; CHECK-NEXT: [[TMP2:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4 -; CHECK-NEXT: [[TMP3:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1) to <2 x float>*), align 4 -; CHECK-NEXT: [[TMP4:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1) to <2 x float>*), align 4 -; CHECK-NEXT: [[TMP5:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 3), align 4 -; CHECK-NEXT: [[TMP6:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 3), align 4 -; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0 -; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x float> [[TMP7]], <4 x float> [[TMP8]], <4 x i32> <i32 0, i32 4, i32 5, i32 3> -; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x float> [[TMP9]], float [[TMP5]], i32 3 -; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0 -; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x float> [[TMP11]], <4 x float> [[TMP12]], <4 x i32> <i32 0, i32 4, i32 5, i32 3> -; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x float> [[TMP13]], float [[TMP6]], i32 3 +; CHECK-NEXT: [[TMP3:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4 +; CHECK-NEXT: [[TMP4:%.*]] = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4 +; CHECK-NEXT: [[TMP5:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 2) to <2 x float>*), align 4 +; CHECK-NEXT: [[TMP6:%.*]] = load <2 x float>, <2 x float>* bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 2) to <2 x float>*), align 4 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0 +; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[TMP7]], float [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <2 x float> [[TMP5]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> +; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x float> [[TMP8]], <4 x float> [[TMP9]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> +; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0 +; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x float> [[TMP11]], float [[TMP4]], i32 1 +; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x float> [[TMP6]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> +; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <4 x float> [[TMP12]], <4 x float> [[TMP13]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> ; CHECK-NEXT: [[TMP15:%.*]] = fadd <4 x float> [[TMP10]], [[TMP14]] ; CHECK-NEXT: [[TMP16:%.*]] = fsub <4 x float> [[TMP10]], [[TMP14]] ; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x float> [[TMP15]], <4 x float> [[TMP16]], <4 x i32> <i32 0, i32 5, i32 2, i32 7> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll b/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll index d23dc9b1d822..1a218ae02aef 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/commutativity.ll @@ -16,21 +16,21 @@ define void @splat(i8 %a, i8 %b, i8 %c) { ; SSE-LABEL: @splat( -; SSE-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 -; SSE-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer -; SSE-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 -; SSE-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> [[TMP2]], i8 [[B:%.*]], i32 1 -; SSE-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; SSE-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 +; SSE-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> [[TMP1]], i8 [[B:%.*]], i32 1 +; SSE-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP2]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; SSE-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 +; SSE-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer ; SSE-NEXT: [[TMP4:%.*]] = xor <16 x i8> [[SHUFFLE]], [[SHUFFLE1]] ; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast ([32 x i8]* @cle to <16 x i8>*), align 16 ; SSE-NEXT: ret void ; ; AVX-LABEL: @splat( -; AVX-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 -; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer -; AVX-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 -; AVX-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> [[TMP2]], i8 [[B:%.*]], i32 1 -; AVX-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; AVX-NEXT: [[TMP1:%.*]] = insertelement <16 x i8> poison, i8 [[A:%.*]], i32 0 +; AVX-NEXT: [[TMP2:%.*]] = insertelement <16 x i8> [[TMP1]], i8 [[B:%.*]], i32 1 +; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <16 x i8> [[TMP2]], <16 x i8> poison, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0> +; AVX-NEXT: [[TMP3:%.*]] = insertelement <16 x i8> poison, i8 [[C:%.*]], i32 0 +; AVX-NEXT: [[SHUFFLE1:%.*]] = shufflevector <16 x i8> [[TMP3]], <16 x i8> poison, <16 x i32> zeroinitializer ; AVX-NEXT: [[TMP4:%.*]] = xor <16 x i8> [[SHUFFLE]], [[SHUFFLE1]] ; AVX-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast ([32 x i8]* @cle to <16 x i8>*), align 16 ; AVX-NEXT: ret void diff --git a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll index 6be7dda2375d..098b83bb0259 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_exceed_scheduling.ll @@ -34,9 +34,9 @@ define void @exceed(double %0, double %1) { ; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP5]] ; CHECK-NEXT: [[TMP12:%.*]] = fmul fast <2 x double> [[TMP10]], [[TMP11]] ; CHECK-NEXT: [[IXX101:%.*]] = fsub double undef, undef -; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[TMP7]], i32 0 -; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP1]], i32 1 -; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP13]], [[TMP14]] +; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> poison, double [[TMP1]], i32 1 +; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> [[TMP13]], double [[TMP7]], i32 0 +; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP14]], undef ; CHECK-NEXT: switch i32 undef, label [[BB1:%.*]] [ ; CHECK-NEXT: i32 0, label [[BB2:%.*]] ; CHECK-NEXT: ] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll b/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll index c8beac34fc90..9c8fbf8a2ed9 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/crash_smallpt.ll @@ -30,15 +30,19 @@ define void @main() #0 { ; CHECK-NEXT: br i1 undef, label [[COND_TRUE63_US:%.*]], label [[COND_FALSE66_US:%.*]] ; CHECK: cond.false66.us: ; CHECK-NEXT: [[ADD_I276_US:%.*]] = fadd double 0.000000e+00, undef -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[ADD_I276_US]], i32 0 -; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> [[TMP0]], <double 0.000000e+00, double 0xBFA5CC2D1960285F> +; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> <double poison, double 0xBFA5CC2D1960285F>, double [[ADD_I276_US]], i32 0 +; CHECK-NEXT: [[TMP1:%.*]] = fadd <2 x double> <double 0.000000e+00, double undef>, [[TMP0]] ; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> [[TMP1]], <double 1.400000e+02, double 1.400000e+02> ; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], <double 5.000000e+01, double 5.200000e+01> -; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> undef, [[TMP1]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[AGG_TMP99208_SROA_0_0_IDX]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP5]], align 8 -; CHECK-NEXT: [[TMP6:%.*]] = bitcast double* [[AGG_TMP101211_SROA_0_0_IDX]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP6]], align 8 +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x double> [[TMP1]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP1]], i32 1 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> <double poison, double undef>, double [[TMP4]], i32 0 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> <double undef, double poison>, double [[TMP5]], i32 1 +; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP6]], [[TMP7]] +; CHECK-NEXT: [[TMP9:%.*]] = bitcast double* [[AGG_TMP99208_SROA_0_0_IDX]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP9]], align 8 +; CHECK-NEXT: [[TMP10:%.*]] = bitcast double* [[AGG_TMP101211_SROA_0_0_IDX]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP8]], <2 x double>* [[TMP10]], align 8 ; CHECK-NEXT: unreachable ; CHECK: cond.true63.us: ; CHECK-NEXT: unreachable diff --git a/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll b/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll index 0a0c0e6763fd..1fff6841a538 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement.ll @@ -85,7 +85,7 @@ define float @f_used_twice_in_tree(<2 x float> %x) { ; THRESH1-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[X:%.*]], i32 1 ; THRESH1-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0 ; THRESH1-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1 -; THRESH1-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[X]], [[TMP3]] +; THRESH1-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP3]], [[X]] ; THRESH1-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP4]], i32 0 ; THRESH1-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP4]], i32 1 ; THRESH1-NEXT: [[ADD:%.*]] = fadd float [[TMP5]], [[TMP6]] @@ -95,7 +95,7 @@ define float @f_used_twice_in_tree(<2 x float> %x) { ; THRESH2-NEXT: [[TMP1:%.*]] = extractelement <2 x float> [[X:%.*]], i32 1 ; THRESH2-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0 ; THRESH2-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1 -; THRESH2-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[X]], [[TMP3]] +; THRESH2-NEXT: [[TMP4:%.*]] = fmul <2 x float> [[TMP3]], [[X]] ; THRESH2-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP4]], i32 0 ; THRESH2-NEXT: [[TMP6:%.*]] = extractelement <2 x float> [[TMP4]], i32 1 ; THRESH2-NEXT: [[ADD:%.*]] = fadd float [[TMP5]], [[TMP6]] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll index 2c983a353623..7d43465eecf8 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/insert-shuffle.ll @@ -11,25 +11,23 @@ define { <2 x float>, <2 x float> } @foo(%struct.sw* %v) { ; CHECK-NEXT: [[Y:%.*]] = getelementptr inbounds [[STRUCT_SW]], %struct.sw* [[V]], i64 0, i32 1 ; CHECK-NEXT: [[TMP1:%.*]] = bitcast float* [[X]] to <2 x float>* ; CHECK-NEXT: [[TMP2:%.*]] = load <2 x float>, <2 x float>* [[TMP1]], align 16 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 0, i32 1> ; CHECK-NEXT: [[TMP3:%.*]] = load float, float* undef, align 4 -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> <float poison, float undef, float poison, float poison>, float [[TMP0]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x float> [[TMP4]], <4 x float> [[TMP5]], <4 x i32> <i32 0, i32 1, i32 4, i32 5> -; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <2 x float> [[TMP2]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 undef, i32 undef> -; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x float> poison, <4 x float> [[TMP7]], <4 x i32> <i32 4, i32 5, i32 2, i32 3> -; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x float> [[TMP8]], float [[TMP3]], i32 2 -; CHECK-NEXT: [[TMP10:%.*]] = fmul <4 x float> [[TMP6]], [[TMP9]] -; CHECK-NEXT: [[TMP11:%.*]] = fadd <4 x float> poison, [[TMP10]] -; CHECK-NEXT: [[TMP12:%.*]] = fadd <4 x float> [[TMP11]], poison -; CHECK-NEXT: [[TMP13:%.*]] = fadd <4 x float> [[TMP12]], poison -; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP13]], i32 0 -; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP14]], i32 0 -; CHECK-NEXT: [[TMP15:%.*]] = extractelement <4 x float> [[TMP13]], i32 1 -; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP15]], i32 1 -; CHECK-NEXT: [[TMP16:%.*]] = extractelement <4 x float> [[TMP13]], i32 2 -; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP16]], i32 0 -; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x float> [[TMP13]], i32 3 -; CHECK-NEXT: [[VEC4:%.*]] = insertelement <2 x float> [[VEC3]], float [[TMP17]], i32 1 +; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> poison, float [[TMP0]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <4 x float> [[TMP4]], float [[TMP3]], i32 1 +; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x float> [[TMP5]], <4 x float> poison, <4 x i32> <i32 0, i32 undef, i32 1, i32 undef> +; CHECK-NEXT: [[TMP6:%.*]] = fmul <4 x float> [[SHUFFLE]], [[SHUFFLE1]] +; CHECK-NEXT: [[TMP7:%.*]] = fadd <4 x float> poison, [[TMP6]] +; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP7]], poison +; CHECK-NEXT: [[TMP9:%.*]] = fadd <4 x float> [[TMP8]], poison +; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x float> [[TMP9]], i32 0 +; CHECK-NEXT: [[VEC1:%.*]] = insertelement <2 x float> undef, float [[TMP10]], i32 0 +; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP9]], i32 1 +; CHECK-NEXT: [[VEC2:%.*]] = insertelement <2 x float> [[VEC1]], float [[TMP11]], i32 1 +; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP9]], i32 2 +; CHECK-NEXT: [[VEC3:%.*]] = insertelement <2 x float> undef, float [[TMP12]], i32 0 +; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP9]], i32 3 +; CHECK-NEXT: [[VEC4:%.*]] = insertelement <2 x float> [[VEC3]], float [[TMP13]], i32 1 ; CHECK-NEXT: [[INS1:%.*]] = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> [[VEC2]], 0 ; CHECK-NEXT: [[INS2:%.*]] = insertvalue { <2 x float>, <2 x float> } [[INS1]], <2 x float> [[VEC4]], 1 ; CHECK-NEXT: ret { <2 x float>, <2 x float> } [[INS2]] diff --git a/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll b/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll index 96502d44acee..ba3bd26d3861 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll @@ -37,7 +37,7 @@ define void @lookahead_basic(double* %array) { ; CHECK-NEXT: [[TMP7:%.*]] = load <2 x double>, <2 x double>* [[TMP6]], align 8 ; CHECK-NEXT: [[TMP8:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP5]], [[TMP7]] -; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP8]], [[TMP9]] +; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP8]] ; CHECK-NEXT: [[TMP11:%.*]] = bitcast double* [[IDX0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8 ; CHECK-NEXT: ret void @@ -175,7 +175,7 @@ define void @lookahead_alt2(double* %array) { ; CHECK-NEXT: [[TMP11:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP12:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <2 x double> [[TMP11]], <2 x double> [[TMP12]], <2 x i32> <i32 0, i32 3> -; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP13]], [[TMP10]] +; CHECK-NEXT: [[TMP14:%.*]] = fadd fast <2 x double> [[TMP10]], [[TMP13]] ; CHECK-NEXT: [[TMP15:%.*]] = bitcast double* [[IDX0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP14]], <2 x double>* [[TMP15]], align 8 ; CHECK-NEXT: ret void @@ -237,28 +237,29 @@ define void @lookahead_external_uses(double* %A, double *%B, double *%C, double ; CHECK-NEXT: [[IDXB2:%.*]] = getelementptr inbounds double, double* [[B]], i64 2 ; CHECK-NEXT: [[IDXA2:%.*]] = getelementptr inbounds double, double* [[A]], i64 2 ; CHECK-NEXT: [[IDXB1:%.*]] = getelementptr inbounds double, double* [[B]], i64 1 -; CHECK-NEXT: [[A0:%.*]] = load double, double* [[IDXA0]], align 8 +; CHECK-NEXT: [[B0:%.*]] = load double, double* [[IDXB0]], align 8 ; CHECK-NEXT: [[C0:%.*]] = load double, double* [[IDXC0]], align 8 ; CHECK-NEXT: [[D0:%.*]] = load double, double* [[IDXD0]], align 8 -; CHECK-NEXT: [[A1:%.*]] = load double, double* [[IDXA1]], align 8 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[IDXA0]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 8 ; CHECK-NEXT: [[B2:%.*]] = load double, double* [[IDXB2]], align 8 ; CHECK-NEXT: [[A2:%.*]] = load double, double* [[IDXA2]], align 8 -; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[IDXB0]] to <2 x double>* -; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 8 -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 -; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[A1]], i32 1 -; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> poison, double [[D0]], i32 0 -; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[B2]], i32 1 -; CHECK-NEXT: [[TMP6:%.*]] = fsub fast <2 x double> [[TMP3]], [[TMP5]] -; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[A0]], i32 0 -; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[A2]], i32 1 -; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP8]], [[TMP1]] -; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP6]] +; CHECK-NEXT: [[B1:%.*]] = load double, double* [[IDXB1]], align 8 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[B0]], i32 0 +; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B2]], i32 1 +; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]] +; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 +; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A2]], i32 1 +; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[D0]], i32 0 +; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[B1]], i32 1 +; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP6]], [[TMP8]] +; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP9]] ; CHECK-NEXT: [[IDXS0:%.*]] = getelementptr inbounds double, double* [[S:%.*]], i64 0 ; CHECK-NEXT: [[IDXS1:%.*]] = getelementptr inbounds double, double* [[S]], i64 1 ; CHECK-NEXT: [[TMP11:%.*]] = bitcast double* [[IDXS0]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8 -; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8 +; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP1]], i32 1 +; CHECK-NEXT: store double [[TMP12]], double* [[EXT1:%.*]], align 8 ; CHECK-NEXT: ret void ; entry: @@ -607,7 +608,7 @@ define void @ChecksExtractScores_different_vectors(double* %storeArray, double* ; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> poison, double [[EXTRA0]], i32 0 ; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[EXTRB1]], i32 1 ; CHECK-NEXT: [[TMP8:%.*]] = fmul <2 x double> [[TMP7]], [[TMP2]] -; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[TMP8]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP8]] ; CHECK-NEXT: [[SIDX0:%.*]] = getelementptr inbounds double, double* [[STOREARRAY:%.*]], i64 0 ; CHECK-NEXT: [[SIDX1:%.*]] = getelementptr inbounds double, double* [[STOREARRAY]], i64 1 ; CHECK-NEXT: [[TMP10:%.*]] = bitcast double* [[SIDX0]] to <2 x double>* diff --git a/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll b/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll index a0554d7c5a81..125cd23d0140 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll @@ -142,16 +142,13 @@ define void @shuffle_nodes_match1(double * noalias %from, double * noalias %to, ; CHECK-NEXT: br label [[LP:%.*]] ; CHECK: lp: ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] -; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1 -; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4 -; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4 -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0 -; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0 -; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -183,11 +180,11 @@ define void @vecload_vs_broadcast4(double * noalias %from, double * noalias %to, ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] ; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* ; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> ; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP2]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], [[SHUFFLE]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -218,16 +215,13 @@ define void @shuffle_nodes_match2(double * noalias %from, double * noalias %to, ; CHECK-NEXT: br label [[LP:%.*]] ; CHECK: lp: ; CHECK-NEXT: [[P:%.*]] = phi double [ 1.000000e+00, [[LP]] ], [ 0.000000e+00, [[ENTRY:%.*]] ] -; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1 -; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4 -; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4 -; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0 -; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> zeroinitializer -; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0 -; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[P]], i64 1 -; CHECK-NEXT: [[TMP4:%.*]] = fadd <2 x double> [[TMP1]], [[TMP3]] -; CHECK-NEXT: [[TMP5:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* -; CHECK-NEXT: store <2 x double> [[TMP4]], <2 x double>* [[TMP5]], align 4 +; CHECK-NEXT: [[TMP0:%.*]] = bitcast double* [[FROM:%.*]] to <2 x double>* +; CHECK-NEXT: [[TMP1:%.*]] = load <2 x double>, <2 x double>* [[TMP0]], align 4 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> poison, <2 x i32> <i32 1, i32 0> +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> [[TMP1]], double [[P]], i64 1 +; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[SHUFFLE]], [[TMP2]] +; CHECK-NEXT: [[TMP4:%.*]] = bitcast double* [[TO:%.*]] to <2 x double>* +; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 4 ; CHECK-NEXT: br i1 undef, label [[LP]], label [[EXT:%.*]] ; CHECK: ext: ; CHECK-NEXT: ret void @@ -348,7 +342,7 @@ define void @load_reorder_double(double* nocapture %c, double* noalias nocapture ; CHECK-NEXT: [[TMP2:%.*]] = load <2 x double>, <2 x double>* [[TMP1]], align 4 ; CHECK-NEXT: [[TMP3:%.*]] = bitcast double* [[A:%.*]] to <2 x double>* ; CHECK-NEXT: [[TMP4:%.*]] = load <2 x double>, <2 x double>* [[TMP3]], align 4 -; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP4]], [[TMP2]] +; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP2]], [[TMP4]] ; CHECK-NEXT: [[TMP6:%.*]] = bitcast double* [[C:%.*]] to <2 x double>* ; CHECK-NEXT: store <2 x double> [[TMP5]], <2 x double>* [[TMP6]], align 4 ; CHECK-NEXT: ret void diff --git a/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll b/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll index ced403ae5375..19f654e5a4f8 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll @@ -22,9 +22,9 @@ define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn ; CHECK-NEXT: [[GEP_8:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 1 ; CHECK-NEXT: [[GEP_9:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 2 ; CHECK-NEXT: [[GEP_10:%.*]] = getelementptr inbounds i32, i32* [[OUT]], i64 3 -; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2> ; CHECK-NEXT: [[TMP6:%.*]] = bitcast i32* [[GEP_7]] to <4 x i32>* -; CHECK-NEXT: store <4 x i32> [[REORDER_SHUFFLE]], <4 x i32>* [[TMP6]], align 4 +; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP6]], align 4 ; CHECK-NEXT: ret i32 undef ; %in.addr = getelementptr inbounds i32, i32* %in, i64 0 diff --git a/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll b/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll index 9983578a7058..65d1fce9e130 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll @@ -97,9 +97,9 @@ define void @store_reverse(i64* %p3) { ; CHECK-NEXT: [[TMP3:%.*]] = load <4 x i64>, <4 x i64>* [[TMP2]], align 8 ; CHECK-NEXT: [[TMP4:%.*]] = shl <4 x i64> [[TMP1]], [[TMP3]] ; CHECK-NEXT: [[ARRAYIDX14:%.*]] = getelementptr inbounds i64, i64* [[P3]], i64 4 -; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0> -; CHECK-NEXT: [[TMP6:%.*]] = bitcast i64* [[ARRAYIDX14]] to <4 x i64>* -; CHECK-NEXT: store <4 x i64> [[TMP5]], <4 x i64>* [[TMP6]], align 8 +; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0> +; CHECK-NEXT: [[TMP5:%.*]] = bitcast i64* [[ARRAYIDX14]] to <4 x i64>* +; CHECK-NEXT: store <4 x i64> [[SHUFFLE]], <4 x i64>* [[TMP5]], align 8 ; CHECK-NEXT: ret void ; entry: diff --git a/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll b/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll index bf98a148e9dc..f1ff95b51f8b 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/supernode.ll @@ -23,7 +23,7 @@ define void @test_supernode_add(double* %Aarray, double* %Barray, double *%Carra ; ENABLED-NEXT: [[C1:%.*]] = load double, double* [[IDXC1]], align 8 ; ENABLED-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[A0]], i32 0 ; ENABLED-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[C1]], i32 1 -; ENABLED-NEXT: [[TMP4:%.*]] = fadd fast <2 x double> [[TMP3]], [[TMP1]] +; ENABLED-NEXT: [[TMP4:%.*]] = fadd fast <2 x double> [[TMP1]], [[TMP3]] ; ENABLED-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0 ; ENABLED-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A1]], i32 1 ; ENABLED-NEXT: [[TMP7:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP6]] </cut>

4 years, 1 month

[ACTIVITY] week ending Dec. 12 2021

by Alex Bennée

Project Stratos =============== - posted Potential demo setup for a TSN/XDP networking Message-Id: <87wnkfkp2f.fsf(a)linaro.org> - final Stratos call of the year - CC and Arnd will look at fat virtq - nice update from EPAM on Zephyr - had another round of getting working ACPI on MachiatoBin - posted [PR to clean up some typos in EDK2] - might have a working Xen setup without needing SMC hacks [PR to clean up some typos in EDK2] <https://github.com/tianocore/edk2-platforms/pull/34> vhost-device maintainer effort ([UM-196]) - finished review of https://github.com/rust-vmm/vhost-device/pull/4 [UM-196] <https://linaro.atlassian.net/browse/UM-196> QEMU Upstream Work ([UM-2]) =========================== - discussion around Suggestions for TCG performance improvements Message-Id: <c76bde31-8f3b-2d03-b7c7-9e026d4b5873(a)huawei.com> - did a bunch of bug triage and tagging [UM-2] <https://linaro.atlassian.net/browse/UM-2> Upstream MTTCG tests ([QEMU-52]) - awaiting final review of [kvm-unit-tests PATCH v9 0/9] MTTCG sanity tests for ARM Message-Id: <20211202115352.951548-1-alex.bennee(a)linaro.org> [QEMU-52] <https://linaro.atlassian.net/browse/QEMU-52> Completed Reviews [3/3] ======================= [PATCH] tests/plugin/syscall.c: fix compiler warnings Message-Id: <20211128011551.2115468-1-juro.bystricky(a)intel.com> [PATCH for-6.2? 0/2] arm_gicv3: Fix handling of LPIs in list registers Message-Id: <20211126163915.1048353-2-peter.maydell(a)linaro.org> [PATCH] tests/docker: add libfuse3 development headers Message-Id: <20211207160025.52466-1-stefanha(a)redhat.com> Absences ======== Current Review Queue ==================== TODO [PATCH 0/8] virtio: Add vhost-user based Video decode Message-Id: <20211209145601.331477-1-peter.griffin(a)linaro.org> ======================================================================================================================== TODO [PATCH for-7.0 0/6] target/arm: Implement LVA, LPA, LPA2 features Message-Id: <20211208231154.392029-1-richard.henderson(a)linaro.org> ======================================================================================================================================== TODO [PATCH-4.16 v2] xen/efi: Fix Grub2 boot on arm64 Message-Id: <20211104141206.25153-1-luca.fancellu(a)arm.com> =============================================================================================================== TODO [PATCH 00/16] fdt: Make OF_BOARD a boolean option Message-Id: <20211013010120.96851-1-sjg(a)chromium.org> =========================================================================================================== -- Alex Bennée

4 years, 2 months

[ACTIVITY] report week ending 10 Dec

by Peter Maydell

Progress: * UM-2 [QEMU upstream maintainership] - More code review: now have a target-arm.next poised and ready to send once 6.2 is released * QEMU-420 [GICv4 emulation] - Working on the ITS changes needed for GICv4 support (this turns out to be a more tractable end to start than the redistributor) - I have a preliminary set of 25 or so patches to the ITS which clean up the code and fix some pre-existing bugs that I found while working on the GICv4 changes - have implemented the new VMAPI, VMAPTI, VMAPP ITS commands -- PMM

4 years, 2 months

[TCWG CI] 464.h264ref slowed down by 7% after llvm: [LV] Pass compare predicate to getCmpSelInstrCost.

by ci_notify＠linaro.org

After llvm commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desmalen(a)arm.com> [LV] Pass compare predicate to getCmpSelInstrCost. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 7% from 11115 to 11846 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea cd investigate-llvm-3d549dddf75b6ff9e0ec8c053677750bde4226ea # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach 3d549dddf75b6ff9e0ec8c053677750bde4226ea ../artifacts/test.sh # Reproduce last_good build git checkout --detach ab31d003e16e483bff298ea2f28fec0f23e8eb79 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit 3d549dddf75b6ff9e0ec8c053677750bde4226ea Author: Sander de Smalen <sander.desmalen(a)arm.com> Date: Mon Dec 6 11:14:27 2021 +0000 [LV] Pass compare predicate to getCmpSelInstrCost. If the condition of a select is a compare, pass its predicate to TTI::getCmpSelInstrCost to get a more accurate cost value instead of passing BAD_ICMP_PREDICATE. I noticed that the commit message from D90070 had a comment about the vectorized select predicate possibly being composed of other compares with different predicate values, but I wasn't able to construct an example where this was an actual issue. If this is an issue, I guess we could add another check that the block isn't predicated for any reason. Reviewed By: dmgreen, fhahn Differential Revision: https://reviews.llvm.org/D114646 --- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | 11 ++++++++--- llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll | 14 +++++++------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp index 050879144afd..c03e506b7474 100644 --- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -7570,8 +7570,12 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, Type *CondTy = SI->getCondition()->getType(); if (!ScalarCond) CondTy = VectorType::get(CondTy, VF); - return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + + CmpInst::Predicate Pred = CmpInst::BAD_ICMP_PREDICATE; + if (auto *Cmp = dyn_cast<CmpInst>(SI->getCondition())) + Pred = Cmp->getPredicate(); + return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, CondTy, Pred, + CostKind, I); } case Instruction::ICmp: case Instruction::FCmp: { @@ -7581,7 +7585,8 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF, ValTy = IntegerType::get(ValTy->getContext(), MinBWs[Op0AsInstruction]); VectorTy = ToVectorTy(ValTy, VF); return TTI.getCmpSelInstrCost(I->getOpcode(), VectorTy, nullptr, - CmpInst::BAD_ICMP_PREDICATE, CostKind, I); + cast<CmpInst>(I)->getPredicate(), CostKind, + I); } case Instruction::Store: case Instruction::Load: { diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll index 62b18f44fbc5..20d2dc0b7cda 100644 --- a/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll +++ b/llvm/test/Transforms/LoopVectorize/AArch64/select-costs.ll @@ -5,17 +5,17 @@ target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128" target triple = "arm64-apple-ios5.0.0" define void @selects_1(i32* nocapture %dst, i32 %A, i32 %B, i32 %C, i32 %N) { -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 5 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 2 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and -; CHECK: LV: Found an estimated cost of 13 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond = select i1 %cmp1, i32 10, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond6 = select i1 %cmp2, i32 30, i32 %and +; CHECK: LV: Found an estimated cost of 1 for VF 4 For instruction: %cond11 = select i1 %cmp7, i32 %cond, i32 %cond6 ; CHECK-LABEL: define void @selects_1( ; CHECK: vector.body: -; CHECK: select <2 x i1> +; CHECK: select <4 x i1> entry: %cmp26 = icmp sgt i32 %N, 0 </cut>

4 years, 2 months

clang-thumbv7-full-2stage is red for 20 days

by Galina Kistanova

Dear Linaro Toolchain Working Group, clang-thumbv7-full-2stage is red for 20 days. Could you take it to the staging area and make it green again, please? Thanks Galina

4 years, 2 months

[TCWG CI] 433.milc slowed down by 4% after llvm: Add missing header

by ci_notify＠linaro.org

After llvm commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d Author: David Blaikie <dblaikie(a)gmail.com> Add missing header the following benchmarks slowed down by more than 2%: - 433.milc slowed down by 4% from 12427 to 12916 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d cd investigate-llvm-bd4c6a476fd037fb07a1c484f75d93ee40713d3d # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach bd4c6a476fd037fb07a1c484f75d93ee40713d3d ../artifacts/test.sh # Reproduce last_good build git checkout --detach 7d4da4e1ab7f79e51db0d5c2a0f5ef1711122dd7 ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit bd4c6a476fd037fb07a1c484f75d93ee40713d3d Author: David Blaikie <dblaikie(a)gmail.com> Date: Mon Nov 29 16:29:25 2021 -0800 Add missing header --- llvm/lib/Demangle/DLangDemangle.cpp | 1 + 1 file changed, 1 insertion(+) diff --git a/llvm/lib/Demangle/DLangDemangle.cpp b/llvm/lib/Demangle/DLangDemangle.cpp index faf91b239490..f380aa90035e 100644 --- a/llvm/lib/Demangle/DLangDemangle.cpp +++ b/llvm/lib/Demangle/DLangDemangle.cpp @@ -17,6 +17,7 @@ #include "llvm/Demangle/StringView.h" #include "llvm/Demangle/Utility.h" +#include <cctype> #include <cstring> #include <limits> </cut>

4 years, 2 months

[TCWG CI] 453.povray failed to build after llvm: [SLP]Fix reused extracts cost.

by ci_notify＠linaro.org

After llvm commit ba74bb3a226e1b4660537f274627285b1bf41ee1 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Fix reused extracts cost. the following benchmarks slowed down by more than 2%: - 453.povray failed to build Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O3 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O3_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 cd investigate-llvm-ba74bb3a226e1b4660537f274627285b1bf41ee1 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach ba74bb3a226e1b4660537f274627285b1bf41ee1 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 78cc133c63173a4b5b7a43750cc507d4cff683cf ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit ba74bb3a226e1b4660537f274627285b1bf41ee1 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Thu Dec 2 04:22:55 2021 -0800 [SLP]Fix reused extracts cost. If the extractelement instruction is used multiple times in the different tree entries (either vectorized, or gathered), need to compensate the scalar cost of such instructions. They are completely removed if all users are part of the tree but we need to compensate the cost only once for each instruction. Differential Revision: https://reviews.llvm.org/D114958 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 29 +++++++++++++--------- .../X86/extractelement-multiple-uses.ll | 23 +++++++++-------- 2 files changed, 29 insertions(+), 23 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index 95061e9053fa..335ad6c85387 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -4287,8 +4287,8 @@ bool BoUpSLP::canReuseExtract(ArrayRef<Value *> VL, Value *OpValue, bool BoUpSLP::areAllUsersVectorized(Instruction *I, ArrayRef<Value *> VectorizedVals) const { return (I->hasOneUse() && is_contained(VectorizedVals, I)) || - llvm::all_of(I->users(), [this](User *U) { - return ScalarToTreeEntry.count(U) > 0; + all_of(I->users(), [this](User *U) { + return ScalarToTreeEntry.count(U) > 0 || MustGather.contains(U); }); } @@ -4442,9 +4442,9 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // FIXME: it tries to fix a problem with MSVC buildbots. TargetTransformInfo &TTIRef = *TTI; auto &&AdjustExtractsCost = [this, &TTIRef, CostKind, VL, VecTy, - VectorizedVals](InstructionCost &Cost, - bool IsGather) { + VectorizedVals, E](InstructionCost &Cost) { DenseMap<Value *, int> ExtractVectorsTys; + SmallPtrSet<Value *, 4> CheckedExtracts; for (auto *V : VL) { if (isa<UndefValue>(V)) continue; @@ -4452,7 +4452,12 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // instruction itself is not going to be vectorized, consider this // instruction as dead and remove its cost from the final cost of the // vectorized tree. - if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals)) + // Also, avoid adjusting the cost for extractelements with multiple uses + // in different graph entries. + const TreeEntry *VE = getTreeEntry(V); + if (!CheckedExtracts.insert(V).second || + !areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) || + (VE && VE != E)) continue; auto *EE = cast<ExtractElementInst>(V); Optional<unsigned> EEIdx = getExtractIndex(EE); @@ -4549,11 +4554,6 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, } return GatherCost; } - if (isSplat(VL)) { - // Found the broadcasting of the single scalar, calculate the cost as the - // broadcast. - return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); - } if ((E->getOpcode() == Instruction::ExtractElement || all_of(E->Scalars, [](Value *V) { @@ -4571,13 +4571,18 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // single input vector or of 2 input vectors. InstructionCost Cost = computeExtractCost(VL, VecTy, *ShuffleKind, Mask, *TTI); - AdjustExtractsCost(Cost, /*IsGather=*/true); + AdjustExtractsCost(Cost); if (NeedToShuffleReuses) Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices); return Cost; } } + if (isSplat(VL)) { + // Found the broadcasting of the single scalar, calculate the cost as the + // broadcast. + return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); + } InstructionCost ReuseShuffleCost = 0; if (NeedToShuffleReuses) ReuseShuffleCost = TTI->getShuffleCost( @@ -4755,7 +4760,7 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I); } } else { - AdjustExtractsCost(CommonCost, /*IsGather=*/false); + AdjustExtractsCost(CommonCost); } return CommonCost; } diff --git a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll index c47f255f0bfe..31696752bbb3 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/extractelement-multiple-uses.ll @@ -2,24 +2,25 @@ ; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -march=core-avx2 -pass-remarks-output=%t | FileCheck %s ; RUN: FileCheck %s --input-file=%t --check-prefix=YAML -; YAML: --- !Missed +; YAML: --- !Passed ; YAML: Pass: slp-vectorizer -; YAML: Name: NotBeneficial +; YAML: Name: VectorizedList ; YAML: Function: multi_uses ; YAML: Args: -; YAML: - String: 'List vectorization was possible but not beneficial with cost ' -; YAML: - Cost: '0' -; YAML: - String: ' >= ' -; YAML: - Treshold: '0' +; YAML: - String: 'SLP vectorized with cost ' +; YAML: - Cost: '-1' +; YAML: - String: ' and with tree size ' +; YAML: - TreeSize: '3' define float @multi_uses(<2 x float> %x, <2 x float> %y) { ; CHECK-LABEL: @multi_uses( -; CHECK-NEXT: [[X0:%.*]] = extractelement <2 x float> [[X:%.*]], i32 0 -; CHECK-NEXT: [[X1:%.*]] = extractelement <2 x float> [[X]], i32 1 ; CHECK-NEXT: [[Y1:%.*]] = extractelement <2 x float> [[Y:%.*]], i32 1 -; CHECK-NEXT: [[X0X0:%.*]] = fmul float [[X0]], [[Y1]] -; CHECK-NEXT: [[X1X1:%.*]] = fmul float [[X1]], [[Y1]] -; CHECK-NEXT: [[ADD:%.*]] = fadd float [[X0X0]], [[X1X1]] +; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x float> poison, float [[Y1]], i32 0 +; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x float> [[TMP1]], float [[Y1]], i32 1 +; CHECK-NEXT: [[TMP3:%.*]] = fmul <2 x float> [[X:%.*]], [[TMP2]] +; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0 +; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1 +; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP4]], [[TMP5]] ; CHECK-NEXT: ret float [[ADD]] ; %x0 = extractelement <2 x float> %x, i32 0 </cut>

4 years, 2 months

[TCWG CI] 464.h264ref slowed down by 3% after llvm: [SLP]Improve isFixedVectorShuffle and its use.

by ci_notify＠linaro.org

After llvm commit dce6c434ead3ccbaa67b8db2301b2a9fb4319123 Author: Alexey Bataev <a.bataev(a)outlook.com> [SLP]Improve isFixedVectorShuffle and its use. the following benchmarks slowed down by more than 2%: - 464.h264ref slowed down by 3% from 10824 to 11101 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-dce6c434ead3ccbaa67b8db2301b2a9fb4319123 cd investigate-llvm-dce6c434ead3ccbaa67b8db2301b2a9fb4319123 # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach dce6c434ead3ccbaa67b8db2301b2a9fb4319123 ../artifacts/test.sh # Reproduce last_good build git checkout --detach 7a7c059d867554e116244ad5639d05d75ed1a7cd ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit dce6c434ead3ccbaa67b8db2301b2a9fb4319123 Author: Alexey Bataev <a.bataev(a)outlook.com> Date: Wed Nov 17 11:14:38 2021 -0800 [SLP]Improve isFixedVectorShuffle and its use. Extended support for undefined source vector/extract indices/non-fixed vector types, also no need to check for the parent of the extractelement instructions with the constant indicies. Differential Revision: https://reviews.llvm.org/D114121 --- llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | 67 +++++++++++++++------- .../X86/alternate-int-inseltpoison.ll | 24 ++++---- .../Transforms/SLPVectorizer/X86/alternate-int.ll | 24 ++++---- 3 files changed, 66 insertions(+), 49 deletions(-) diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp index e3d3d8992c23..4db630fbd063 100644 --- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp +++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp @@ -327,7 +327,11 @@ static bool isCommutative(Instruction *I) { /// TargetTransformInfo::getInstructionThroughput? static Optional<TargetTransformInfo::ShuffleKind> isFixedVectorShuffle(ArrayRef<Value *> VL, SmallVectorImpl<int> &Mask) { - auto *EI0 = cast<ExtractElementInst>(VL[0]); + const auto *It = + find_if(VL, [](Value *V) { return isa<ExtractElementInst>(V); }); + if (It == VL.end()) + return None; + auto *EI0 = cast<ExtractElementInst>(*It); if (isa<ScalableVectorType>(EI0->getVectorOperandType())) return None; unsigned Size = @@ -336,33 +340,41 @@ isFixedVectorShuffle(ArrayRef<Value *> VL, SmallVectorImpl<int> &Mask) { Value *Vec2 = nullptr; enum ShuffleMode { Unknown, Select, Permute }; ShuffleMode CommonShuffleMode = Unknown; + Mask.assign(VL.size(), UndefMaskElem); for (unsigned I = 0, E = VL.size(); I < E; ++I) { + // Undef can be represented as an undef element in a vector. + if (isa<UndefValue>(VL[I])) + continue; auto *EI = cast<ExtractElementInst>(VL[I]); + if (isa<ScalableVectorType>(EI->getVectorOperandType())) + return None; auto *Vec = EI->getVectorOperand(); + // We can extractelement from undef or poison vector. + if (isa<UndefValue>(Vec)) + continue; // All vector operands must have the same number of vector elements. if (cast<FixedVectorType>(Vec->getType())->getNumElements() != Size) return None; + if (isa<UndefValue>(EI->getIndexOperand())) + continue; auto *Idx = dyn_cast<ConstantInt>(EI->getIndexOperand()); if (!Idx) return None; // Undefined behavior if Idx is negative or >= Size. - if (Idx->getValue().uge(Size)) { - Mask.push_back(UndefMaskElem); + if (Idx->getValue().uge(Size)) continue; - } unsigned IntIdx = Idx->getValue().getZExtValue(); - Mask.push_back(IntIdx); - // We can extractelement from undef or poison vector. - if (isa<UndefValue>(Vec)) - continue; + Mask[I] = IntIdx; // For correct shuffling we have to have at most 2 different vector operands // in all extractelement instructions. - if (!Vec1 || Vec1 == Vec) + if (!Vec1 || Vec1 == Vec) { Vec1 = Vec; - else if (!Vec2 || Vec2 == Vec) + } else if (!Vec2 || Vec2 == Vec) { Vec2 = Vec; - else + Mask[I] += Size; + } else { return None; + } if (CommonShuffleMode == Permute) continue; // If the extract index is not the same as the operation number, it is a @@ -4414,15 +4426,19 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, bool IsGather) { DenseMap<Value *, int> ExtractVectorsTys; for (auto *V : VL) { + if (isa<UndefValue>(V)) + continue; // If all users of instruction are going to be vectorized and this // instruction itself is not going to be vectorized, consider this // instruction as dead and remove its cost from the final cost of the // vectorized tree. - if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals) || - (IsGather && ScalarToTreeEntry.count(V))) + if (!areAllUsersVectorized(cast<Instruction>(V), VectorizedVals)) continue; auto *EE = cast<ExtractElementInst>(V); - unsigned Idx = *getExtractIndex(EE); + Optional<unsigned> EEIdx = getExtractIndex(EE); + if (!EEIdx) + continue; + unsigned Idx = *EEIdx; if (TTIRef.getNumberOfParts(VecTy) != TTIRef.getNumberOfParts(EE->getVectorOperandType())) { auto It = @@ -4454,6 +4470,8 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, for (const auto &Data : ExtractVectorsTys) { auto *EEVTy = cast<FixedVectorType>(Data.first->getType()); unsigned NumElts = VecTy->getNumElements(); + if (Data.second % NumElts == 0) + continue; if (TTIRef.getNumberOfParts(EEVTy) > TTIRef.getNumberOfParts(VecTy)) { unsigned Idx = (Data.second / NumElts) * NumElts; unsigned EENumElts = EEVTy->getNumElements(); @@ -4516,10 +4534,12 @@ InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E, // broadcast. return TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy); } - if (E->getOpcode() == Instruction::ExtractElement && allSameType(VL) && - allSameBlock(VL) && - !isa<ScalableVectorType>( - cast<ExtractElementInst>(E->getMainOp())->getVectorOperandType())) { + if ((E->getOpcode() == Instruction::ExtractElement || + all_of(E->Scalars, + [](Value *V) { + return isa<ExtractElementInst, UndefValue>(V); + })) && + allSameType(VL)) { // Check that gather of extractelements can be represented as just a // shuffle of a single/two vectors the scalars are extracted from. SmallVector<int> Mask; @@ -5111,7 +5131,11 @@ bool BoUpSLP::isFullyVectorizableTinyTree(bool ForReduction) const { [this](Value *V) { return EphValues.contains(V); }) && (allConstant(TE->Scalars) || isSplat(TE->Scalars) || TE->Scalars.size() < Limit || - (TE->getOpcode() == Instruction::ExtractElement && + ((TE->getOpcode() == Instruction::ExtractElement || + all_of(TE->Scalars, + [](Value *V) { + return isa<ExtractElementInst, UndefValue>(V); + })) && isFixedVectorShuffle(TE->Scalars, Mask)) || (TE->State == TreeEntry::NeedToGather && TE->getOpcode() == Instruction::Load && !TE->isAltShuffle())); @@ -9183,8 +9207,9 @@ bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI, SmallVector<Value *, 16> BuildVectorOpds; SmallVector<int> Mask; if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) || - (llvm::all_of(BuildVectorOpds, - [](Value *V) { return isa<ExtractElementInst>(V); }) && + (llvm::all_of( + BuildVectorOpds, + [](Value *V) { return isa<ExtractElementInst, UndefValue>(V); }) && isFixedVectorShuffle(BuildVectorOpds, Mask))) return false; diff --git a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll index 8ab137cc2d7d..9c19a32b2f41 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll @@ -230,25 +230,21 @@ define <8 x i32> @ashr_shl_v8i32_const(<8 x i32> %a) { define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) { ; SSE-LABEL: @ashr_lshr_shl_v8i32( -; SSE-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A:%.*]], i32 6 -; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7 -; SSE-NEXT: [[B6:%.*]] = extractelement <8 x i32> [[B:%.*]], i32 6 -; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7 -; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> -; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7> ; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]] ; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5> -; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]] -; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]] -; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> -; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6 -; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7 -; SSE-NEXT: ret <8 x i32> [[R7]] +; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]] +; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7> +; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> +; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9> +; SSE-NEXT: ret <8 x i32> [[R71]] ; ; SLM-LABEL: @ashr_lshr_shl_v8i32( ; SLM-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> diff --git a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll index 3af16bf404a3..783b50ae4b17 100644 --- a/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll @@ -230,25 +230,21 @@ define <8 x i32> @ashr_shl_v8i32_const(<8 x i32> %a) { define <8 x i32> @ashr_lshr_shl_v8i32(<8 x i32> %a, <8 x i32> %b) { ; SSE-LABEL: @ashr_lshr_shl_v8i32( -; SSE-NEXT: [[A6:%.*]] = extractelement <8 x i32> [[A:%.*]], i32 6 -; SSE-NEXT: [[A7:%.*]] = extractelement <8 x i32> [[A]], i32 7 -; SSE-NEXT: [[B6:%.*]] = extractelement <8 x i32> [[B:%.*]], i32 6 -; SSE-NEXT: [[B7:%.*]] = extractelement <8 x i32> [[B]], i32 7 -; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> -; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> +; SSE-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> [[B:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; SSE-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP4:%.*]] = lshr <4 x i32> [[TMP1]], [[TMP2]] ; SSE-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 7> ; SSE-NEXT: [[TMP6:%.*]] = lshr <8 x i32> [[A]], [[B]] ; SSE-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> [[TMP6]], <8 x i32> poison, <2 x i32> <i32 4, i32 5> -; SSE-NEXT: [[AB6:%.*]] = shl i32 [[A6]], [[B6]] -; SSE-NEXT: [[AB7:%.*]] = shl i32 [[A7]], [[B7]] -; SSE-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[TMP9:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; SSE-NEXT: [[R51:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> -; SSE-NEXT: [[R6:%.*]] = insertelement <8 x i32> [[R51]], i32 [[AB6]], i32 6 -; SSE-NEXT: [[R7:%.*]] = insertelement <8 x i32> [[R6]], i32 [[AB7]], i32 7 -; SSE-NEXT: ret <8 x i32> [[R7]] +; SSE-NEXT: [[TMP8:%.*]] = shl <8 x i32> [[A]], [[B]] +; SSE-NEXT: [[TMP9:%.*]] = shufflevector <8 x i32> [[TMP8]], <8 x i32> poison, <2 x i32> <i32 6, i32 7> +; SSE-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R52:%.*]] = shufflevector <8 x i32> [[TMP10]], <8 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 undef, i32 undef> +; SSE-NEXT: [[TMP12:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> +; SSE-NEXT: [[R71:%.*]] = shufflevector <8 x i32> [[R52]], <8 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 8, i32 9> +; SSE-NEXT: ret <8 x i32> [[R71]] ; ; SLM-LABEL: @ashr_lshr_shl_v8i32( ; SLM-NEXT: [[TMP1:%.*]] = shufflevector <8 x i32> [[A:%.*]], <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> </cut>

4 years, 2 months

[TCWG CI] 453.povray slowed down by 3% after llvm: [llvm] Use range-based for loops (NFC)

by ci_notify＠linaro.org

After llvm commit f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c Author: Kazu Hirata <kazu(a)google.com> [llvm] Use range-based for loops (NFC) the following benchmarks slowed down by more than 2%: - 453.povray slowed down by 3% from 4906 to 5047 perf samples Below reproducer instructions can be used to re-build both "first_bad" and "last_good" cross-toolchains used in this bisection. Naturally, the scripts will fail when triggerring benchmarking jobs if you don't have access to Linaro TCWG CI. For your convenience, we have uploaded tarballs with pre-processed source and assembly files at: - First_bad save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Last_good save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… - Baseline save-temps: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Configuration: - Benchmark: SPEC CPU2006 - Toolchain: Clang + Glibc + LLVM Linker - Version: all components were built from their tip of trunk - Target: aarch64-linux-gnu - Compiler flags: -O2 -flto - Hardware: NVidia TX1 4x Cortex-A57 This benchmarking CI is work-in-progress, and we welcome feedback and suggestions at linaro-toolchain(a)lists.linaro.org . In our improvement plans is to add support for SPEC CPU2017 benchmarks and provide "perf report/annotate" data behind these reports. THIS IS THE END OF INTERESTING STUFF. BELOW ARE LINKS TO BUILDS, REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT. This commit has regressed these CI configurations: - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2_LTO First_bad build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Last_good build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Baseline build: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Even more details: https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… Reproduce builds: <cut> mkdir investigate-llvm-f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c cd investigate-llvm-f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c # Fetch scripts git clone https://git.linaro.org/toolchain/jenkins-scripts # Fetch manifests and test.sh script mkdir -p artifacts/manifests curl -o artifacts/manifests/build-baseline.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/manifests/build-parameters.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail curl -o artifacts/test.sh https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-… --fail chmod +x artifacts/test.sh # Reproduce the baseline build (build all pre-requisites) ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh # Save baseline build state (which is then restored in artifacts/test.sh) mkdir -p ./bisect rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ --exclude /llvm/ ./ ./bisect/baseline/ cd llvm # Reproduce first_bad build git checkout --detach f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c ../artifacts/test.sh # Reproduce last_good build git checkout --detach c572eb1ad9d8a528bcaff0160888aff31b1f4b5f ../artifacts/test.sh cd .. </cut> Full commit (up to 1000 lines): <cut> commit f240e528cea25fd2a9ae01b1e1fe77f507ed7a2c Author: Kazu Hirata <kazu(a)google.com> Date: Mon Nov 29 09:04:44 2021 -0800 [llvm] Use range-based for loops (NFC) --- llvm/lib/CodeGen/MachinePipeliner.cpp | 7 ++--- .../CodeGen/SelectionDAG/ResourcePriorityQueue.cpp | 4 +-- llvm/lib/Object/ELFObjectFile.cpp | 4 +-- llvm/lib/ObjectYAML/COFFEmitter.cpp | 32 ++++++++++------------ llvm/lib/Passes/StandardInstrumentations.cpp | 4 +-- llvm/lib/ProfileData/InstrProf.cpp | 11 ++++---- llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp | 18 ++++++------ llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp | 32 ++++++++++------------ 8 files changed, 50 insertions(+), 62 deletions(-) diff --git a/llvm/lib/CodeGen/MachinePipeliner.cpp b/llvm/lib/CodeGen/MachinePipeliner.cpp index 21be6718b7d9..8d6459a627fa 100644 --- a/llvm/lib/CodeGen/MachinePipeliner.cpp +++ b/llvm/lib/CodeGen/MachinePipeliner.cpp @@ -1519,9 +1519,8 @@ static bool pred_L(SetVector<SUnit *> &NodeOrder, SmallSetVector<SUnit *, 8> &Preds, const NodeSet *S = nullptr) { Preds.clear(); - for (SetVector<SUnit *>::iterator I = NodeOrder.begin(), E = NodeOrder.end(); - I != E; ++I) { - for (const SDep &Pred : (*I)->Preds) { + for (const SUnit *SU : NodeOrder) { + for (const SDep &Pred : SU->Preds) { if (S && S->count(Pred.getSUnit()) == 0) continue; if (ignoreDependence(Pred, true)) @@ -1530,7 +1529,7 @@ static bool pred_L(SetVector<SUnit *> &NodeOrder, Preds.insert(Pred.getSUnit()); } // Back-edges are predecessors with an anti-dependence. - for (const SDep &Succ : (*I)->Succs) { + for (const SDep &Succ : SU->Succs) { if (Succ.getKind() != SDep::Anti) continue; if (S && S->count(Succ.getSUnit()) == 0) diff --git a/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp b/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp index 55fe26eb64cd..2695ed36991c 100644 --- a/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/ResourcePriorityQueue.cpp @@ -268,8 +268,8 @@ bool ResourcePriorityQueue::isResourceAvailable(SUnit *SU) { // Now see if there are no other dependencies // to instructions already in the packet. - for (unsigned i = 0, e = Packet.size(); i != e; ++i) - for (const SDep &Succ : Packet[i]->Succs) { + for (const SUnit *S : Packet) + for (const SDep &Succ : S->Succs) { // Since we do not add pseudos to packets, might as well // ignore order deps. if (Succ.isCtrl()) diff --git a/llvm/lib/Object/ELFObjectFile.cpp b/llvm/lib/Object/ELFObjectFile.cpp index 50035d6c7523..cf1f12d9a9a7 100644 --- a/llvm/lib/Object/ELFObjectFile.cpp +++ b/llvm/lib/Object/ELFObjectFile.cpp @@ -682,7 +682,7 @@ readDynsymVersionsImpl(const ELFFile<ELFT> &EF, std::vector<VersionEntry> Ret; size_t I = 0; - for (auto It = Symbols.begin(), E = Symbols.end(); It != E; ++It) { + for (const ELFSymbolRef &Sym : Symbols) { ++I; Expected<const typename ELFT::Versym *> VerEntryOrErr = EF.template getEntry<typename ELFT::Versym>(*VerSec, I); @@ -691,7 +691,7 @@ readDynsymVersionsImpl(const ELFFile<ELFT> &EF, " from " + describe(EF, *VerSec) + ": " + toString(VerEntryOrErr.takeError())); - Expected<uint32_t> FlagsOrErr = It->getFlags(); + Expected<uint32_t> FlagsOrErr = Sym.getFlags(); if (!FlagsOrErr) return createError("unable to read flags for symbol with index " + Twine(I) + ": " + toString(FlagsOrErr.takeError())); diff --git a/llvm/lib/ObjectYAML/COFFEmitter.cpp b/llvm/lib/ObjectYAML/COFFEmitter.cpp index 5f38ca13cfc2..66ad16db1ba4 100644 --- a/llvm/lib/ObjectYAML/COFFEmitter.cpp +++ b/llvm/lib/ObjectYAML/COFFEmitter.cpp @@ -476,29 +476,25 @@ static bool writeCOFF(COFFParser &CP, raw_ostream &OS) { assert(OS.tell() == CP.SectionTableStart); // Output section table. - for (std::vector<COFFYAML::Section>::iterator i = CP.Obj.Sections.begin(), - e = CP.Obj.Sections.end(); - i != e; ++i) { - OS.write(i->Header.Name, COFF::NameSize); - OS << binary_le(i->Header.VirtualSize) - << binary_le(i->Header.VirtualAddress) - << binary_le(i->Header.SizeOfRawData) - << binary_le(i->Header.PointerToRawData) - << binary_le(i->Header.PointerToRelocations) - << binary_le(i->Header.PointerToLineNumbers) - << binary_le(i->Header.NumberOfRelocations) - << binary_le(i->Header.NumberOfLineNumbers) - << binary_le(i->Header.Characteristics); + for (const COFFYAML::Section &S : CP.Obj.Sections) { + OS.write(S.Header.Name, COFF::NameSize); + OS << binary_le(S.Header.VirtualSize) + << binary_le(S.Header.VirtualAddress) + << binary_le(S.Header.SizeOfRawData) + << binary_le(S.Header.PointerToRawData) + << binary_le(S.Header.PointerToRelocations) + << binary_le(S.Header.PointerToLineNumbers) + << binary_le(S.Header.NumberOfRelocations) + << binary_le(S.Header.NumberOfLineNumbers) + << binary_le(S.Header.Characteristics); } assert(OS.tell() == CP.SectionTableStart + CP.SectionTableSize); unsigned CurSymbol = 0; StringMap<unsigned> SymbolTableIndexMap; - for (std::vector<COFFYAML::Symbol>::iterator I = CP.Obj.Symbols.begin(), - E = CP.Obj.Symbols.end(); - I != E; ++I) { - SymbolTableIndexMap[I->Name] = CurSymbol; - CurSymbol += 1 + I->Header.NumberOfAuxSymbols; + for (const COFFYAML::Symbol &Sym : CP.Obj.Symbols) { + SymbolTableIndexMap[Sym.Name] = CurSymbol; + CurSymbol += 1 + Sym.Header.NumberOfAuxSymbols; } // Output section data. diff --git a/llvm/lib/Passes/StandardInstrumentations.cpp b/llvm/lib/Passes/StandardInstrumentations.cpp index 8e6be6730ea4..27a6c519ff82 100644 --- a/llvm/lib/Passes/StandardInstrumentations.cpp +++ b/llvm/lib/Passes/StandardInstrumentations.cpp @@ -225,8 +225,8 @@ std::string doSystemDiff(StringRef Before, StringRef After, return "Unable to read result."; // Clean up. - for (unsigned I = 0; I < NumFiles; ++I) { - std::error_code EC = sys::fs::remove(FileName[I]); + for (const std::string &I : FileName) { + std::error_code EC = sys::fs::remove(I); if (EC) return "Unable to remove temporary file."; } diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp index 1168ad27fe52..ab3487ecffe8 100644 --- a/llvm/lib/ProfileData/InstrProf.cpp +++ b/llvm/lib/ProfileData/InstrProf.cpp @@ -657,19 +657,18 @@ void InstrProfValueSiteRecord::merge(InstrProfValueSiteRecord &Input, Input.sortByTargetValues(); auto I = ValueData.begin(); auto IE = ValueData.end(); - for (auto J = Input.ValueData.begin(), JE = Input.ValueData.end(); J != JE; - ++J) { - while (I != IE && I->Value < J->Value) + for (const InstrProfValueData &J : Input.ValueData) { + while (I != IE && I->Value < J.Value) ++I; - if (I != IE && I->Value == J->Value) { + if (I != IE && I->Value == J.Value) { bool Overflowed; - I->Count = SaturatingMultiplyAdd(J->Count, Weight, I->Count, &Overflowed); + I->Count = SaturatingMultiplyAdd(J.Count, Weight, I->Count, &Overflowed); if (Overflowed) Warn(instrprof_error::counter_overflow); ++I; continue; } - ValueData.insert(I, *J); + ValueData.insert(I, J); } } diff --git a/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp b/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp index 43f0758f6598..8c3b9572201e 100644 --- a/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp +++ b/llvm/lib/Target/Hexagon/HexagonCommonGEP.cpp @@ -476,10 +476,10 @@ namespace { } // end anonymous namespace static const NodeSet *node_class(GepNode *N, NodeSymRel &Rel) { - for (NodeSymRel::iterator I = Rel.begin(), E = Rel.end(); I != E; ++I) - if (I->count(N)) - return &*I; - return nullptr; + for (const NodeSet &S : Rel) + if (S.count(N)) + return &S; + return nullptr; } // Create an ordered pair of GepNode pointers. The pair will be used in @@ -589,9 +589,8 @@ void HexagonCommonGEP::common() { dbgs() << "{ " << I->first << ", " << I->second << " }\n"; dbgs() << "Gep equivalence classes:\n"; - for (NodeSymRel::iterator I = EqRel.begin(), E = EqRel.end(); I != E; ++I) { + for (const NodeSet &S : EqRel) { dbgs() << '{'; - const NodeSet &S = *I; for (NodeSet::const_iterator J = S.begin(), F = S.end(); J != F; ++J) { if (J != S.begin()) dbgs() << ','; @@ -604,8 +603,7 @@ void HexagonCommonGEP::common() { // Create a projection from a NodeSet to the minimal element in it. using ProjMap = std::map<const NodeSet *, GepNode *>; ProjMap PM; - for (NodeSymRel::iterator I = EqRel.begin(), E = EqRel.end(); I != E; ++I) { - const NodeSet &S = *I; + for (const NodeSet &S : EqRel) { GepNode *Min = *std::min_element(S.begin(), S.end(), NodeOrder); std::pair<ProjMap::iterator,bool> Ins = PM.insert(std::make_pair(&S, Min)); (void)Ins; @@ -1280,8 +1278,8 @@ bool HexagonCommonGEP::runOnFunction(Function &F) { return false; // For now bail out on C++ exception handling. - for (Function::iterator A = F.begin(), Z = F.end(); A != Z; ++A) - for (BasicBlock::iterator I = A->begin(), E = A->end(); I != E; ++I) + for (const BasicBlock &BB : F) + for (const Instruction &I : BB) if (isa<InvokeInst>(I) || isa<LandingPadInst>(I)) return false; diff --git a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp index c2ddfd6164f4..c35e67d6726f 100644 --- a/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp +++ b/llvm/lib/Target/NVPTX/NVPTXAsmPrinter.cpp @@ -130,10 +130,8 @@ VisitGlobalVariableForEmission(const GlobalVariable *GV, for (unsigned i = 0, e = GV->getNumOperands(); i != e; ++i) DiscoverDependentGlobals(GV->getOperand(i), Others); - for (DenseSet<const GlobalVariable *>::iterator I = Others.begin(), - E = Others.end(); - I != E; ++I) - VisitGlobalVariableForEmission(*I, Order, Visited, Visiting); + for (const GlobalVariable *GV : Others) + VisitGlobalVariableForEmission(GV, Order, Visited, Visiting); // Now we can visit ourself Order.push_back(GV); @@ -699,35 +697,33 @@ static bool useFuncSeen(const Constant *C, void NVPTXAsmPrinter::emitDeclarations(const Module &M, raw_ostream &O) { DenseMap<const Function *, bool> seenMap; - for (Module::const_iterator FI = M.begin(), FE = M.end(); FI != FE; ++FI) { - const Function *F = &*FI; - - if (F->getAttributes().hasFnAttr("nvptx-libcall-callee")) { - emitDeclaration(F, O); + for (const Function &F : M) { + if (F.getAttributes().hasFnAttr("nvptx-libcall-callee")) { + emitDeclaration(&F, O); continue; } - if (F->isDeclaration()) { - if (F->use_empty()) + if (F.isDeclaration()) { + if (F.use_empty()) continue; - if (F->getIntrinsicID()) + if (F.getIntrinsicID()) continue; - emitDeclaration(F, O); + emitDeclaration(&F, O); continue; } - for (const User *U : F->users()) { + for (const User *U : F.users()) { if (const Constant *C = dyn_cast<Constant>(U)) { if (usedInGlobalVarDef(C)) { // The use is in the initialization of a global variable // that is a function pointer, so print a declaration // for the original function - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } // Emit a declaration of this function if the function that // uses this constant expr has already been seen. if (useFuncSeen(C, seenMap)) { - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } } @@ -746,11 +742,11 @@ void NVPTXAsmPrinter::emitDeclarations(const Module &M, raw_ostream &O) { // appearing in the module before the callee. so print out // a declaration for the callee. if (seenMap.find(caller) != seenMap.end()) { - emitDeclaration(F, O); + emitDeclaration(&F, O); break; } } - seenMap[F] = true; + seenMap[&F] = true; } } </cut>

4 years, 2 months

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

linaro-toolchain