ds_read2.ll - OpenGrok history log for /llvm-project-15.0.7/llvm/test/CodeGen/AMDGPU/ds

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2
# ac94073d	12-Apr-2022	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Refine 64 bit misaligned LDS ops selection Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b64 [AMDGPU] Refine 64 bit misaligned LDS ops selection Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b64 aligned by 8: 3.2 sec ds_write2_b32 aligned by 8: 3.2 sec ds_write_b16 * 4 aligned by 8: 7.0 sec ds_write_b8 * 8 aligned by 8: 13.2 sec ds_write_b64 aligned by 1: 7.3 sec ds_write2_b32 aligned by 1: 7.5 sec ds_write_b16 * 4 aligned by 1: 14.0 sec ds_write_b8 * 8 aligned by 1: 13.2 sec ds_write_b64 aligned by 2: 7.3 sec ds_write2_b32 aligned by 2: 7.5 sec ds_write_b16 * 4 aligned by 2: 7.1 sec ds_write_b8 * 8 aligned by 2: 13.3 sec ds_write_b64 aligned by 4: 4.6 sec ds_write2_b32 aligned by 4: 3.2 sec ds_write_b16 * 4 aligned by 4: 7.1 sec ds_write_b8 * 8 aligned by 4: 13.3 sec ds_read_b64 aligned by 8: 2.3 sec ds_read2_b32 aligned by 8: 2.2 sec ds_read_u16 * 4 aligned by 8: 4.8 sec ds_read_u8 * 8 aligned by 8: 8.6 sec ds_read_b64 aligned by 1: 4.4 sec ds_read2_b32 aligned by 1: 7.3 sec ds_read_u16 * 4 aligned by 1: 14.0 sec ds_read_u8 * 8 aligned by 1: 8.7 sec ds_read_b64 aligned by 2: 4.4 sec ds_read2_b32 aligned by 2: 7.3 sec ds_read_u16 * 4 aligned by 2: 4.8 sec ds_read_u8 * 8 aligned by 2: 8.7 sec ds_read_b64 aligned by 4: 4.4 sec ds_read2_b32 aligned by 4: 2.3 sec ds_read_u16 * 4 aligned by 4: 4.8 sec ds_read_u8 * 8 aligned by 4: 8.7 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b64 aligned by 8: 4.4 sec ds_write2_b32 aligned by 8: 4.3 sec ds_write_b16 * 4 aligned by 8: 7.9 sec ds_write_b8 * 8 aligned by 8: 13.0 sec ds_write_b64 aligned by 1: 23.2 sec ds_write2_b32 aligned by 1: 23.1 sec ds_write_b16 * 4 aligned by 1: 44.0 sec ds_write_b8 * 8 aligned by 1: 13.0 sec ds_write_b64 aligned by 2: 23.2 sec ds_write2_b32 aligned by 2: 23.1 sec ds_write_b16 * 4 aligned by 2: 7.9 sec ds_write_b8 * 8 aligned by 2: 13.1 sec ds_write_b64 aligned by 4: 13.5 sec ds_write2_b32 aligned by 4: 4.3 sec ds_write_b16 * 4 aligned by 4: 7.9 sec ds_write_b8 * 8 aligned by 4: 13.1 sec ds_read_b64 aligned by 8: 3.5 sec ds_read2_b32 aligned by 8: 3.4 sec ds_read_u16 * 4 aligned by 8: 5.3 sec ds_read_u8 * 8 aligned by 8: 8.5 sec ds_read_b64 aligned by 1: 13.1 sec ds_read2_b32 aligned by 1: 22.7 sec ds_read_u16 * 4 aligned by 1: 43.9 sec ds_read_u8 * 8 aligned by 1: 7.9 sec ds_read_b64 aligned by 2: 13.1 sec ds_read2_b32 aligned by 2: 22.7 sec ds_read_u16 * 4 aligned by 2: 5.6 sec ds_read_u8 * 8 aligned by 2: 7.9 sec ds_read_b64 aligned by 4: 13.1 sec ds_read2_b32 aligned by 4: 3.4 sec ds_read_u16 * 4 aligned by 4: 5.6 sec ds_read_u8 * 8 aligned by 4: 7.9 sec ``` GFX10 exposes a different pattern for sub-DWORD load/store performance than GFX9. On GFX9 it is faster to issue a single unaligned load or store than a fully split b8 access, where on GFX10 even a full split is better. However, this is a theoretical only gain because splitting an access to a sub-dword level will require more registers and packing/ unpacking logic, so ignoring this option it is better to use a single 64 bit instruction on a misaligned data with the exception of 4 byte aligned data where ds_read2_b32/ds_write2_b32 is better. Differential Revision: https://reviews.llvm.org/D123956 show more ...
Revision tags: llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init
# 359a792f	28-Jan-2022	Jay Foad <[email protected]>	[AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases Previously when combining two loads this pass would sink the first one down to the second one, putting the combined load wh [AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases Previously when combining two loads this pass would sink the first one down to the second one, putting the combined load where the second one was. It would also sink any intervening instructions which depended on the first load down to just after the combined load. For example, if we started with this sequence of instructions (code flowing from left to right): X A B C D E F Y After combining loads X and Y into XY we might end up with: A B C D E F XY But if B D and F depended on X, we would get: A C E XY B D F Now if the original code had some short disjoint live ranges from A to B, C to D and E to F, in the transformed code these live ranges will be long and overlapping. In this way a single merge of two loads could cause an unbounded increase in register pressure. To fix this, change the way the way that loads are moved in order to merge them so that: - The second load is moved up to the first one. (But when merging stores, we still move the first store down to the second one.) - Intervening instructions are never moved. - Instead, if we find an intervening instruction that would need to be moved, give up on the merge. But this case should now be pretty rare because normal stores have no outputs, and normal loads only have address register inputs, but these will be identical for any pair of loads that we try to merge. As well as fixing the unbounded register pressure increase problem, moving loads up and stores down seems like it should usually be a win for memory latency reasons. Differential Revision: https://reviews.llvm.org/D119006 show more ...
Revision tags: llvmorg-13.0.1, llvmorg-13.0.1-rc3
# 89c447e4	16-Jan-2022	Matt Arsenault <[email protected]>	AMDGPU: Stop reserving 36-bytes before kernel arguments for amdpal This was inheriting the mesa behavior, and as far as I know nobody is using opencl kernels with amdpal. The isMesaKernel check was AMDGPU: Stop reserving 36-bytes before kernel arguments for amdpal This was inheriting the mesa behavior, and as far as I know nobody is using opencl kernels with amdpal. The isMesaKernel check was irrelevant because this property needs to be held for all functions. show more ...
Revision tags: llvmorg-13.0.1-rc2, llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2
# 729bf9b2	14-Aug-2021	Matt Arsenault <[email protected]>	AMDGPU: Enable fixed function ABI by default Code using indirect calls is broken without this, and there isn't really much value in supporting the old attempt to vary the argument placement based on AMDGPU: Enable fixed function ABI by default Code using indirect calls is broken without this, and there isn't really much value in supporting the old attempt to vary the argument placement based on uses. This resulted in more argument shuffling code anyway. Also have the option stop implying all inputs need to be passed. This will no rely on the amdgpu-no-* attributes to avoid passing unnecessary values. show more ...
# da067ed5	10-Nov-2021	Austin Kerbow <[email protected]>	[AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memor [AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior. Practically, this means that the scheduler will trigger the 'STALL' heuristic more often. This type of change needs to be evaluated experimentally. Preliminary results are positive. Fixes: SWDEV-282962 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114777 show more ...
# 3ce1b963	08-Sep-2021	Joe Nash <[email protected]>	[AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D1095 [AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D109536 Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde show more ...
# 722b8e0e	14-Aug-2021	Matt Arsenault <[email protected]>	AMDGPU: Invert ABI attribute handling Previously we assumed all callable functions did not need any implicitly passed inputs, and added attributes to functions to indicate when they were necessary. AMDGPU: Invert ABI attribute handling Previously we assumed all callable functions did not need any implicitly passed inputs, and added attributes to functions to indicate when they were necessary. Requiring attributes for correctness is pretty ugly, and it makes supporting indirect and external calls more complicated. This inverts the direction of the attributes, so an undecorated function is assumed to need all implicit imputs. This enables AMDGPUAttributor by default to mark when functions are proven to not need a given input. This strips the equivalent functionality from the legacy AMDGPUAnnotateKernelFeatures pass. However, AMDGPUAnnotateKernelFeatures is not fully removed at this point although it should be in the future. It is still necessary for the two hacky amdgpu-calls and amdgpu-stack-objects attributes, which would be better served by a trivial analysis on the IR during selection. Additionally, AMDGPUAnnotateKernelFeatures still redundantly handles the uniform-work-group-size attribute to be removed in a future commit. At this point when not using -amdgpu-fixed-function-abi, we are still modifying the ABI based on these newly negated attributes. In the future, this option will be removed and the locations for implicit inputs will always be fixed. We will then use the new attributes to avoid passing the values when unnecessary. show more ...
Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2
# 2b43209e	15-Jun-2021	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Propagate LDS align into to instructions Differential Revision: https://reviews.llvm.org/D104316
# 05289dfb	07-Jun-2021	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Handle constant LDS uses from different kernels This allows to lower an LDS variable into a kernel structure even if there is a constant expression used from different kernels. Differentia [AMDGPU] Handle constant LDS uses from different kernels This allows to lower an LDS variable into a kernel structure even if there is a constant expression used from different kernels. Differential Revision: https://reviews.llvm.org/D103655 show more ...
# 52ffbfdf	07-Jun-2021	hsmahesha <[email protected]>	[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their siz [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their size. This will make sure that the members of sorted structure are properly aligned, and hence it will further reduce the probability of unaligned LDS access. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D103261 show more ...
# 753437fc	04-Jun-2021	hsmahesha <[email protected]>	Revert "[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering." This reverts commit d71ff907ef23eaef86ad66ba2d711e4986cd6cb2.
# d71ff907	04-Jun-2021	hsmahesha <[email protected]>	[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their siz [AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their size. This will make sure that the members of sorted structure are properly aligned, and hence it will further reduce the probability of unaligned LDS access. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D103261 show more ...
# 5e2facb9	26-May-2021	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Fix kernel LDS lowering for constants There is a trivial but severe bug in the recent code collecting LDS globals used by kernel. It aborts scan on the first constant without scanning furth [AMDGPU] Fix kernel LDS lowering for constants There is a trivial but severe bug in the recent code collecting LDS globals used by kernel. It aborts scan on the first constant without scanning further uses. That leads to LDS overallocation with multiple kernels in certain cases. Differential Revision: https://reviews.llvm.org/D103190 show more ...
Revision tags: llvmorg-12.0.1-rc1
# 8de4db69	19-May-2021	Stanislav Mekhanoshin <[email protected]>	[AMDGPU] Lower kernel LDS into a sorted structure Differential Revision: https://reviews.llvm.org/D102954
# ac64995c	08-Apr-2021	hsmahesha <[email protected]>	[AMDGPU] Only use ds_read/write_b128 for alignment >= 16 PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100008
Revision tags: llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4, llvmorg-12.0.0-rc3, llvmorg-12.0.0-rc2, llvmorg-11.1.0, llvmorg-11.1.0-rc3, llvmorg-12.0.0-rc1, llvmorg-13-init, llvmorg-11.1.0-rc2, llvmorg-11.1.0-rc1, llvmorg-11.0.1, llvmorg-11.0.1-rc2
# 2291bd13	30-Nov-2020	Austin Kerbow <[email protected]>	[AMDGPU] Update subtarget features for new target ID support Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic [AMDGPU] Update subtarget features for new target ID support Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic subtarget features. The possible settings are: - Unsupported: The GPU has no support for XNACK/SRAMECC. - Any: Preference is unspecified. Use conservative settings that can run anywhere. - Off: Request support for XNACK/SRAMECC Off - On: Request support for XNACK/SRAMECC On GCNSubtarget will track the four options based on the following criteria. If the subtarget does not support XNACK/SRAMECC we say the setting is "Unsupported". If no subtarget features for XNACK/SRAMECC are requested we must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the feature string when initializing the subtarget, the settings are "On/Off". The defaults are updated to be conservatively correct, meaning if no setting for XNACK or SRAMECC is explicitly requested, defaults will be used which generate code that can be run anywhere. This corresponds to the "Any" setting. Differential Revision: https://reviews.llvm.org/D85882 show more ...
# 1ebe86ad	05-Jan-2021	Mircea Trofin <[email protected]>	[NFC] Removed unused prefixes in test/CodeGen/AMDGPU More patches to follow. Differential Revision: https://reviews.llvm.org/D94121
Revision tags: llvmorg-11.0.1-rc1
# d2e52eec	10-Nov-2020	Matt Arsenault <[email protected]>	AMDGPU: Select global saddr mode from SGPR pointer Use the 64-bit SGPR base with a 0 offset, since it's 1 fewer instruction to materialize the 0 vs. the 64-bit copy.
# 040c5027	02-Nov-2020	Jay Foad <[email protected]>	[AMDGPU] Fix ds_read2/write2 with unaligned offsets These instructions use a scaled offset. We were wrongly selecting them even when the required offset was not a multiple of the scale factor. Diff [AMDGPU] Fix ds_read2/write2 with unaligned offsets These instructions use a scaled offset. We were wrongly selecting them even when the required offset was not a multiple of the scale factor. Differential Revision: https://reviews.llvm.org/D90607 show more ...
# 32897c05	03-Nov-2020	Jay Foad <[email protected]>	[AMDGPU] Specify a triple to avoid codegen changes depending on host OS
# 0892d2a3	02-Nov-2020	Jay Foad <[email protected]>	Revert "Fix ds_read2/write2 unaligned offsets" This reverts commit 2e7e898c8f0b38dc11fbce2553fc715067aaf42f. It was committed by mistake.
# 2e7e898c	02-Nov-2020	Jay Foad <[email protected]>	Fix ds_read2/write2 unaligned offsets
# c8cbaa15	02-Nov-2020	Jay Foad <[email protected]>	[AMDGPU] Precommit ds_read2/write2 with unaligned offset tests. NFC.
# f3881d65	02-Nov-2020	Jay Foad <[email protected]>	[AMDGPU] Generate test checks. NFC.
# d3f13f3e	02-Nov-2020	Jay Foad <[email protected]>	[AMDGPU] Remove a comment. NFC. This was obsoleted by f78687df9b7 which added gfx9 aligned/unaligned tests.
12