ds-alignment.ll - OpenGrok history log for /llvm-project-15.0.7/llvm/test/CodeGen/AMDGPU/ds-alignment.ll

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2
# ac94073d	12-Apr-2022	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Refine 64 bit misaligned LDS ops selection Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b64 [AMDGPU] Refine 64 bit misaligned LDS ops selection Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b64 aligned by 8: 3.2 sec ds_write2_b32 aligned by 8: 3.2 sec ds_write_b16 * 4 aligned by 8: 7.0 sec ds_write_b8 * 8 aligned by 8: 13.2 sec ds_write_b64 aligned by 1: 7.3 sec ds_write2_b32 aligned by 1: 7.5 sec ds_write_b16 * 4 aligned by 1: 14.0 sec ds_write_b8 * 8 aligned by 1: 13.2 sec ds_write_b64 aligned by 2: 7.3 sec ds_write2_b32 aligned by 2: 7.5 sec ds_write_b16 * 4 aligned by 2: 7.1 sec ds_write_b8 * 8 aligned by 2: 13.3 sec ds_write_b64 aligned by 4: 4.6 sec ds_write2_b32 aligned by 4: 3.2 sec ds_write_b16 * 4 aligned by 4: 7.1 sec ds_write_b8 * 8 aligned by 4: 13.3 sec ds_read_b64 aligned by 8: 2.3 sec ds_read2_b32 aligned by 8: 2.2 sec ds_read_u16 * 4 aligned by 8: 4.8 sec ds_read_u8 * 8 aligned by 8: 8.6 sec ds_read_b64 aligned by 1: 4.4 sec ds_read2_b32 aligned by 1: 7.3 sec ds_read_u16 * 4 aligned by 1: 14.0 sec ds_read_u8 * 8 aligned by 1: 8.7 sec ds_read_b64 aligned by 2: 4.4 sec ds_read2_b32 aligned by 2: 7.3 sec ds_read_u16 * 4 aligned by 2: 4.8 sec ds_read_u8 * 8 aligned by 2: 8.7 sec ds_read_b64 aligned by 4: 4.4 sec ds_read2_b32 aligned by 4: 2.3 sec ds_read_u16 * 4 aligned by 4: 4.8 sec ds_read_u8 * 8 aligned by 4: 8.7 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b64 aligned by 8: 4.4 sec ds_write2_b32 aligned by 8: 4.3 sec ds_write_b16 * 4 aligned by 8: 7.9 sec ds_write_b8 * 8 aligned by 8: 13.0 sec ds_write_b64 aligned by 1: 23.2 sec ds_write2_b32 aligned by 1: 23.1 sec ds_write_b16 * 4 aligned by 1: 44.0 sec ds_write_b8 * 8 aligned by 1: 13.0 sec ds_write_b64 aligned by 2: 23.2 sec ds_write2_b32 aligned by 2: 23.1 sec ds_write_b16 * 4 aligned by 2: 7.9 sec ds_write_b8 * 8 aligned by 2: 13.1 sec ds_write_b64 aligned by 4: 13.5 sec ds_write2_b32 aligned by 4: 4.3 sec ds_write_b16 * 4 aligned by 4: 7.9 sec ds_write_b8 * 8 aligned by 4: 13.1 sec ds_read_b64 aligned by 8: 3.5 sec ds_read2_b32 aligned by 8: 3.4 sec ds_read_u16 * 4 aligned by 8: 5.3 sec ds_read_u8 * 8 aligned by 8: 8.5 sec ds_read_b64 aligned by 1: 13.1 sec ds_read2_b32 aligned by 1: 22.7 sec ds_read_u16 * 4 aligned by 1: 43.9 sec ds_read_u8 * 8 aligned by 1: 7.9 sec ds_read_b64 aligned by 2: 13.1 sec ds_read2_b32 aligned by 2: 22.7 sec ds_read_u16 * 4 aligned by 2: 5.6 sec ds_read_u8 * 8 aligned by 2: 7.9 sec ds_read_b64 aligned by 4: 13.1 sec ds_read2_b32 aligned by 4: 3.4 sec ds_read_u16 * 4 aligned by 4: 5.6 sec ds_read_u8 * 8 aligned by 4: 7.9 sec ``` GFX10 exposes a different pattern for sub-DWORD load/store performance than GFX9. On GFX9 it is faster to issue a single unaligned load or store than a fully split b8 access, where on GFX10 even a full split is better. However, this is a theoretical only gain because splitting an access to a sub-dword level will require more registers and packing/ unpacking logic, so ignoring this option it is better to use a single 64 bit instruction on a misaligned data with the exception of 4 byte aligned data where ds_read2_b32/ds_write2_b32 is better. Differential Revision: https://reviews.llvm.org/D123956 show more ...
Revision tags: llvmorg-14.0.1
# f6462a26	11-Apr-2022	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance [AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b128 aligned by 16: 4.9 sec ds_write2_b64 aligned by 16: 5.1 sec ds_write2_b32 * 2 aligned by 16: 5.5 sec ds_write_b128 aligned by 1: 8.1 sec ds_write2_b64 aligned by 1: 8.7 sec ds_write2_b32 * 2 aligned by 1: 14.0 sec ds_write_b128 aligned by 2: 8.1 sec ds_write2_b64 aligned by 2: 8.7 sec ds_write2_b32 * 2 aligned by 2: 14.0 sec ds_write_b128 aligned by 4: 5.6 sec ds_write2_b64 aligned by 4: 8.7 sec ds_write2_b32 * 2 aligned by 4: 5.6 sec ds_write_b128 aligned by 8: 5.6 sec ds_write2_b64 aligned by 8: 5.1 sec ds_write2_b32 * 2 aligned by 8: 5.6 sec ds_read_b128 aligned by 16: 3.8 sec ds_read2_b64 aligned by 16: 3.8 sec ds_read2_b32 * 2 aligned by 16: 4.0 sec ds_read_b128 aligned by 1: 4.6 sec ds_read2_b64 aligned by 1: 8.1 sec ds_read2_b32 * 2 aligned by 1: 14.0 sec ds_read_b128 aligned by 2: 4.6 sec ds_read2_b64 aligned by 2: 8.1 sec ds_read2_b32 * 2 aligned by 2: 14.0 sec ds_read_b128 aligned by 4: 4.6 sec ds_read2_b64 aligned by 4: 8.1 sec ds_read2_b32 * 2 aligned by 4: 4.0 sec ds_read_b128 aligned by 8: 4.6 sec ds_read2_b64 aligned by 8: 3.8 sec ds_read2_b32 * 2 aligned by 8: 4.0 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b128 aligned by 16: 6.2 sec ds_write2_b64 aligned by 16: 7.1 sec ds_write2_b32 * 2 aligned by 16: 7.6 sec ds_write_b128 aligned by 1: 24.1 sec ds_write2_b64 aligned by 1: 25.2 sec ds_write2_b32 * 2 aligned by 1: 43.7 sec ds_write_b128 aligned by 2: 24.1 sec ds_write2_b64 aligned by 2: 25.1 sec ds_write2_b32 * 2 aligned by 2: 43.7 sec ds_write_b128 aligned by 4: 14.4 sec ds_write2_b64 aligned by 4: 25.1 sec ds_write2_b32 * 2 aligned by 4: 7.6 sec ds_write_b128 aligned by 8: 14.4 sec ds_write2_b64 aligned by 8: 7.1 sec ds_write2_b32 * 2 aligned by 8: 7.6 sec ds_read_b128 aligned by 16: 6.2 sec ds_read2_b64 aligned by 16: 6.3 sec ds_read2_b32 * 2 aligned by 16: 7.5 sec ds_read_b128 aligned by 1: 12.5 sec ds_read2_b64 aligned by 1: 24.0 sec ds_read2_b32 * 2 aligned by 1: 43.6 sec ds_read_b128 aligned by 2: 12.5 sec ds_read2_b64 aligned by 2: 24.0 sec ds_read2_b32 * 2 aligned by 2: 43.6 sec ds_read_b128 aligned by 4: 12.5 sec ds_read2_b64 aligned by 4: 24.0 sec ds_read2_b32 * 2 aligned by 4: 7.5 sec ds_read_b128 aligned by 8: 12.5 sec ds_read2_b64 aligned by 8: 6.3 sec ds_read2_b32 * 2 aligned by 8: 7.5 sec ``` Differential Revision: https://reviews.llvm.org/D123634 show more ...
# 65b8a432	12-Apr-2022	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Update ds-alignment.ll test checks. NFC.
# 3870b360	11-Apr-2022	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Split unaligned 3 DWORD DS operations I have written a minitest to check the performance. Overall the benefit of aligned b96 operations on data which is not known but happens to be aligned [AMDGPU] Split unaligned 3 DWORD DS operations I have written a minitest to check the performance. Overall the benefit of aligned b96 operations on data which is not known but happens to be aligned is small, while performance hit of using b96 operations on a really unaligned memory is high. The only exception is when data is not aligned even by 4, it is better to use b96 in this case. Here is the test output on Vega and Navi: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b96 aligned: 3.4 sec ds_write_b32 + ds_write_b64 aligned: 4.5 sec ds_write_b32 * 3 aligned: 4.8 sec ds_write_b96 misaligned by 1: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 1: 7.2 sec ds_write_b32 * 3 misaligned by 1: 10.0 sec ds_write_b96 misaligned by 2: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 2: 7.2 sec ds_write_b32 * 3 misaligned by 2: 10.1 sec ds_write_b96 misaligned by 4: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.2 sec ds_write_b32 * 3 misaligned by 4: 4.9 sec ds_write_b96 misaligned by 8: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 8: 4.6 sec ds_write_b32 * 3 misaligned by 8: 4.9 sec ds_read_b96 aligned: 3.3 sec ds_read_b32 + ds_read_b64 aligned: 4.9 sec ds_read_b32 * 3 aligned: 2.6 sec ds_read_b96 misaligned by 1: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 1: 7.2 sec ds_read_b32 * 3 misaligned by 1: 10.1 sec ds_read_b96 misaligned by 2: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 2: 7.2 sec ds_read_b32 * 3 misaligned by 2: 10.1 sec ds_read_b96 misaligned by 4: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 4: 2.6 sec ds_read_b32 * 3 misaligned by 4: 2.6 sec ds_read_b96 misaligned by 8: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 8: 4.9 sec ds_read_b32 * 3 misaligned by 8: 2.6 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b96 aligned: 4.1 sec ds_write_b32 + ds_write_b64 aligned: 13.0 sec ds_write_b32 * 3 aligned: 4.5 sec ds_write_b96 misaligned by 1: 12.5 sec ds_write_b32 + ds_write_b64 misaligned by 1: 22.0 sec ds_write_b32 * 3 misaligned by 1: 31.5 sec ds_write_b96 misaligned by 2: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 2: 22.0 sec ds_write_b32 * 3 misaligned by 2: 31.5 sec ds_write_b96 misaligned by 4: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.0 sec ds_write_b32 * 3 misaligned by 4: 4.5 sec ds_write_b96 misaligned by 8: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 8: 13.0 sec ds_write_b32 * 3 misaligned by 8: 4.5 sec ds_read_b96 aligned: 3.8 sec ds_read_b32 + ds_read_b64 aligned: 12.8 sec ds_read_b32 * 3 aligned: 4.4 sec ds_read_b96 misaligned by 1: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 1: 21.8 sec ds_read_b32 * 3 misaligned by 1: 31.5 sec ds_read_b96 misaligned by 2: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 2: 21.9 sec ds_read_b32 * 3 misaligned by 2: 31.5 sec ds_read_b96 misaligned by 4: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 4: 3.8 sec ds_read_b32 * 3 misaligned by 4: 4.5 sec ds_read_b96 misaligned by 8: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 8: 12.8 sec ds_read_b32 * 3 misaligned by 8: 4.5 sec ``` Fixes: SWDEV-330802 Differential Revision: https://reviews.llvm.org/D123524 show more ...
# e66f0edb	07-Apr-2022	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Split unaligned LDS access instead of scalarizing There is no need to fully scalarize an unaligned operation in some case, just split it to alignment. Differential Revision: https://review [AMDGPU] Split unaligned LDS access instead of scalarizing There is no need to fully scalarize an unaligned operation in some case, just split it to alignment. Differential Revision: https://reviews.llvm.org/D123330 show more ...
Revision tags: llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3
# d8b69040	20-Jan-2022	Abinav Puthan Purayil <[email protected]>	[AMDGPU] Set MemoryVT for truncstores in tblgen. GlobalISelEmitter was skipping these patterns when its predicates were checked. This patch should allow us to select d16_hi stores in GlobalISel. Di [AMDGPU] Set MemoryVT for truncstores in tblgen. GlobalISelEmitter was skipping these patterns when its predicates were checked. This patch should allow us to select d16_hi stores in GlobalISel. Differential Revision: https://reviews.llvm.org/D117762 show more ...
Revision tags: llvmorg-13.0.1-rc2, llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2, llvmorg-13.0.0-rc1, llvmorg-14-init
# c2229724	26-Jul-2021	Matt Arsenault <[email protected]>	AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting These actions should only be used for adjusting the register types (and the memory type as needed to satisfy the regi AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting These actions should only be used for adjusting the register types (and the memory type as needed to satisfy the register type). Unaligned accesses should be split as a type of lowering. This has the effect of improving the code in many cases since now we produce zextloads instead of separate loads with ands. The load/store legality rules still seem far more complicated than necessary though. show more ...
# da067ed5	10-Nov-2021	Austin Kerbow <[email protected]>	[AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memor [AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior. Practically, this means that the scheduler will trigger the 'STALL' heuristic more often. This type of change needs to be evaluated experimentally. Preliminary results are positive. Fixes: SWDEV-282962 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114777 show more ...
# d2e66d7f	06-Sep-2021	Konstantin Schwarz <[email protected]>	[GlobalISel] Add a combine for and(load , mask) -> zextload This only handles simple masks, not shifted masks, for now. Reviewed By: aemerson Differential Revision: https://reviews.llvm.org/D109357
# 3ce1b963	08-Sep-2021	Joe Nash <[email protected]>	[AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D1095 [AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D109536 Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde show more ...
Revision tags: llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2
# 31a9659d	07-Jun-2021	Matt Arsenault <[email protected]>	GlobalISel: Avoid use of G_INSERT in insertParts G_INSERT legalization is incomplete and doesn't work very well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES padding with undef va GlobalISel: Avoid use of G_INSERT in insertParts G_INSERT legalization is incomplete and doesn't work very well. Instead try to use sequences of G_MERGE_VALUES/G_UNMERGE_VALUES padding with undef values (although this can get pretty large). For the case of load/store narrowing, this is still performing the load/stores in irregularly sized pieces. It might be cleaner to split this down into equal sized pieces, and rely on load/store merging to optimize it. show more ...
Revision tags: llvmorg-12.0.1-rc1
# ac64995c	08-Apr-2021	hsmahesha <[email protected]>	[AMDGPU] Only use ds_read/write_b128 for alignment >= 16 PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100008
# d5fee599	08-Apr-2021	hsmahesha <[email protected]>	[AMDGPU] Add some exhaustive ds read/write alignment tests PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100007