|
Revision tags: llvmorg-20.1.0, llvmorg-20.1.0-rc3, llvmorg-20.1.0-rc2, llvmorg-20.1.0-rc1, llvmorg-21-init, llvmorg-19.1.7, llvmorg-19.1.6, llvmorg-19.1.5, llvmorg-19.1.4, llvmorg-19.1.3, llvmorg-19.1.2, llvmorg-19.1.1, llvmorg-19.1.0, llvmorg-19.1.0-rc4, llvmorg-19.1.0-rc3, llvmorg-19.1.0-rc2, llvmorg-19.1.0-rc1, llvmorg-20-init, llvmorg-18.1.8, llvmorg-18.1.7, llvmorg-18.1.6, llvmorg-18.1.5, llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2 |
|
| #
ac94073d |
| 12-Apr-2022 |
Stanislav Mekhanoshin <[email protected]> |
[AMDGPU] Refine 64 bit misaligned LDS ops selection
Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack-
ds_write_b64
[AMDGPU] Refine 64 bit misaligned LDS ops selection
Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack-
ds_write_b64 aligned by 8: 3.2 sec ds_write2_b32 aligned by 8: 3.2 sec ds_write_b16 * 4 aligned by 8: 7.0 sec ds_write_b8 * 8 aligned by 8: 13.2 sec ds_write_b64 aligned by 1: 7.3 sec ds_write2_b32 aligned by 1: 7.5 sec ds_write_b16 * 4 aligned by 1: 14.0 sec ds_write_b8 * 8 aligned by 1: 13.2 sec ds_write_b64 aligned by 2: 7.3 sec ds_write2_b32 aligned by 2: 7.5 sec ds_write_b16 * 4 aligned by 2: 7.1 sec ds_write_b8 * 8 aligned by 2: 13.3 sec ds_write_b64 aligned by 4: 4.6 sec ds_write2_b32 aligned by 4: 3.2 sec ds_write_b16 * 4 aligned by 4: 7.1 sec ds_write_b8 * 8 aligned by 4: 13.3 sec ds_read_b64 aligned by 8: 2.3 sec ds_read2_b32 aligned by 8: 2.2 sec ds_read_u16 * 4 aligned by 8: 4.8 sec ds_read_u8 * 8 aligned by 8: 8.6 sec ds_read_b64 aligned by 1: 4.4 sec ds_read2_b32 aligned by 1: 7.3 sec ds_read_u16 * 4 aligned by 1: 14.0 sec ds_read_u8 * 8 aligned by 1: 8.7 sec ds_read_b64 aligned by 2: 4.4 sec ds_read2_b32 aligned by 2: 7.3 sec ds_read_u16 * 4 aligned by 2: 4.8 sec ds_read_u8 * 8 aligned by 2: 8.7 sec ds_read_b64 aligned by 4: 4.4 sec ds_read2_b32 aligned by 4: 2.3 sec ds_read_u16 * 4 aligned by 4: 4.8 sec ds_read_u8 * 8 aligned by 4: 8.7 sec
Using platform: AMD Accelerated Parallel Processing Using device: gfx1030
ds_write_b64 aligned by 8: 4.4 sec ds_write2_b32 aligned by 8: 4.3 sec ds_write_b16 * 4 aligned by 8: 7.9 sec ds_write_b8 * 8 aligned by 8: 13.0 sec ds_write_b64 aligned by 1: 23.2 sec ds_write2_b32 aligned by 1: 23.1 sec ds_write_b16 * 4 aligned by 1: 44.0 sec ds_write_b8 * 8 aligned by 1: 13.0 sec ds_write_b64 aligned by 2: 23.2 sec ds_write2_b32 aligned by 2: 23.1 sec ds_write_b16 * 4 aligned by 2: 7.9 sec ds_write_b8 * 8 aligned by 2: 13.1 sec ds_write_b64 aligned by 4: 13.5 sec ds_write2_b32 aligned by 4: 4.3 sec ds_write_b16 * 4 aligned by 4: 7.9 sec ds_write_b8 * 8 aligned by 4: 13.1 sec ds_read_b64 aligned by 8: 3.5 sec ds_read2_b32 aligned by 8: 3.4 sec ds_read_u16 * 4 aligned by 8: 5.3 sec ds_read_u8 * 8 aligned by 8: 8.5 sec ds_read_b64 aligned by 1: 13.1 sec ds_read2_b32 aligned by 1: 22.7 sec ds_read_u16 * 4 aligned by 1: 43.9 sec ds_read_u8 * 8 aligned by 1: 7.9 sec ds_read_b64 aligned by 2: 13.1 sec ds_read2_b32 aligned by 2: 22.7 sec ds_read_u16 * 4 aligned by 2: 5.6 sec ds_read_u8 * 8 aligned by 2: 7.9 sec ds_read_b64 aligned by 4: 13.1 sec ds_read2_b32 aligned by 4: 3.4 sec ds_read_u16 * 4 aligned by 4: 5.6 sec ds_read_u8 * 8 aligned by 4: 7.9 sec ```
GFX10 exposes a different pattern for sub-DWORD load/store performance than GFX9. On GFX9 it is faster to issue a single unaligned load or store than a fully split b8 access, where on GFX10 even a full split is better. However, this is a theoretical only gain because splitting an access to a sub-dword level will require more registers and packing/ unpacking logic, so ignoring this option it is better to use a single 64 bit instruction on a misaligned data with the exception of 4 byte aligned data where ds_read2_b32/ds_write2_b32 is better.
Differential Revision: https://reviews.llvm.org/D123956
show more ...
|
|
Revision tags: llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init |
|
| #
359a792f |
| 28-Jan-2022 |
Jay Foad <[email protected]> |
[AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases
Previously when combining two loads this pass would sink the first one down to the second one, putting the combined load wh
[AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases
Previously when combining two loads this pass would sink the first one down to the second one, putting the combined load where the second one was. It would also sink any intervening instructions which depended on the first load down to just after the combined load.
For example, if we started with this sequence of instructions (code flowing from left to right):
X A B C D E F Y
After combining loads X and Y into XY we might end up with:
A B C D E F XY
But if B D and F depended on X, we would get:
A C E XY B D F
Now if the original code had some short disjoint live ranges from A to B, C to D and E to F, in the transformed code these live ranges will be long and overlapping. In this way a single merge of two loads could cause an unbounded increase in register pressure.
To fix this, change the way the way that loads are moved in order to merge them so that: - The second load is moved up to the first one. (But when merging stores, we still move the first store down to the second one.) - Intervening instructions are never moved. - Instead, if we find an intervening instruction that would need to be moved, give up on the merge. But this case should now be pretty rare because normal stores have no outputs, and normal loads only have address register inputs, but these will be identical for any pair of loads that we try to merge.
As well as fixing the unbounded register pressure increase problem, moving loads up and stores down seems like it should usually be a win for memory latency reasons.
Differential Revision: https://reviews.llvm.org/D119006
show more ...
|
|
Revision tags: llvmorg-13.0.1, llvmorg-13.0.1-rc3 |
|
| #
89c447e4 |
| 16-Jan-2022 |
Matt Arsenault <[email protected]> |
AMDGPU: Stop reserving 36-bytes before kernel arguments for amdpal
This was inheriting the mesa behavior, and as far as I know nobody is using opencl kernels with amdpal. The isMesaKernel check was
AMDGPU: Stop reserving 36-bytes before kernel arguments for amdpal
This was inheriting the mesa behavior, and as far as I know nobody is using opencl kernels with amdpal. The isMesaKernel check was irrelevant because this property needs to be held for all functions.
show more ...
|
|
Revision tags: llvmorg-13.0.1-rc2, llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2 |
|
| #
729bf9b2 |
| 14-Aug-2021 |
Matt Arsenault <[email protected]> |
AMDGPU: Enable fixed function ABI by default
Code using indirect calls is broken without this, and there isn't really much value in supporting the old attempt to vary the argument placement based on
AMDGPU: Enable fixed function ABI by default
Code using indirect calls is broken without this, and there isn't really much value in supporting the old attempt to vary the argument placement based on uses. This resulted in more argument shuffling code anyway.
Also have the option stop implying all inputs need to be passed. This will no rely on the amdgpu-no-* attributes to avoid passing unnecessary values.
show more ...
|
| #
da067ed5 |
| 10-Nov-2021 |
Austin Kerbow <[email protected]> |
[AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memor
[AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior.
Practically, this means that the scheduler will trigger the 'STALL' heuristic more often.
This type of change needs to be evaluated experimentally. Preliminary results are positive.
Fixes: SWDEV-282962
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114777
show more ...
|
| #
3ce1b963 |
| 08-Sep-2021 |
Joe Nash <[email protected]> |
[AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D1095
[AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
show more ...
|
| #
722b8e0e |
| 14-Aug-2021 |
Matt Arsenault <[email protected]> |
AMDGPU: Invert ABI attribute handling
Previously we assumed all callable functions did not need any implicitly passed inputs, and added attributes to functions to indicate when they were necessary.
AMDGPU: Invert ABI attribute handling
Previously we assumed all callable functions did not need any implicitly passed inputs, and added attributes to functions to indicate when they were necessary. Requiring attributes for correctness is pretty ugly, and it makes supporting indirect and external calls more complicated.
This inverts the direction of the attributes, so an undecorated function is assumed to need all implicit imputs. This enables AMDGPUAttributor by default to mark when functions are proven to not need a given input. This strips the equivalent functionality from the legacy AMDGPUAnnotateKernelFeatures pass.
However, AMDGPUAnnotateKernelFeatures is not fully removed at this point although it should be in the future. It is still necessary for the two hacky amdgpu-calls and amdgpu-stack-objects attributes, which would be better served by a trivial analysis on the IR during selection. Additionally, AMDGPUAnnotateKernelFeatures still redundantly handles the uniform-work-group-size attribute to be removed in a future commit.
At this point when not using -amdgpu-fixed-function-abi, we are still modifying the ABI based on these newly negated attributes. In the future, this option will be removed and the locations for implicit inputs will always be fixed. We will then use the new attributes to avoid passing the values when unnecessary.
show more ...
|
|
Revision tags: llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2 |
|
| #
2b43209e |
| 15-Jun-2021 |
Stanislav Mekhanoshin <[email protected]> |
[AMDGPU] Propagate LDS align into to instructions
Differential Revision: https://reviews.llvm.org/D104316
|
| #
05289dfb |
| 07-Jun-2021 |
Stanislav Mekhanoshin <[email protected]> |
[AMDGPU] Handle constant LDS uses from different kernels
This allows to lower an LDS variable into a kernel structure even if there is a constant expression used from different kernels.
Differentia
[AMDGPU] Handle constant LDS uses from different kernels
This allows to lower an LDS variable into a kernel structure even if there is a constant expression used from different kernels.
Differential Revision: https://reviews.llvm.org/D103655
show more ...
|
| #
52ffbfdf |
| 07-Jun-2021 |
hsmahesha <[email protected]> |
[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their siz
[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their size. This will make sure that the members of sorted structure are properly aligned, and hence it will further reduce the probability of unaligned LDS access.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103261
show more ...
|
| #
753437fc |
| 04-Jun-2021 |
hsmahesha <[email protected]> |
Revert "[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering."
This reverts commit d71ff907ef23eaef86ad66ba2d711e4986cd6cb2.
|
| #
d71ff907 |
| 04-Jun-2021 |
hsmahesha <[email protected]> |
[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their siz
[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering.
Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their size. This will make sure that the members of sorted structure are properly aligned, and hence it will further reduce the probability of unaligned LDS access.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103261
show more ...
|
| #
5e2facb9 |
| 26-May-2021 |
Stanislav Mekhanoshin <[email protected]> |
[AMDGPU] Fix kernel LDS lowering for constants
There is a trivial but severe bug in the recent code collecting LDS globals used by kernel. It aborts scan on the first constant without scanning furth
[AMDGPU] Fix kernel LDS lowering for constants
There is a trivial but severe bug in the recent code collecting LDS globals used by kernel. It aborts scan on the first constant without scanning further uses. That leads to LDS overallocation with multiple kernels in certain cases.
Differential Revision: https://reviews.llvm.org/D103190
show more ...
|
|
Revision tags: llvmorg-12.0.1-rc1 |
|
| #
8de4db69 |
| 19-May-2021 |
Stanislav Mekhanoshin <[email protected]> |
[AMDGPU] Lower kernel LDS into a sorted structure
Differential Revision: https://reviews.llvm.org/D102954
|
| #
ac64995c |
| 08-Apr-2021 |
hsmahesha <[email protected]> |
[AMDGPU] Only use ds_read/write_b128 for alignment >= 16
PS: Submitting on behalf of Jay.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D100008
|
|
Revision tags: llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4, llvmorg-12.0.0-rc3, llvmorg-12.0.0-rc2, llvmorg-11.1.0, llvmorg-11.1.0-rc3, llvmorg-12.0.0-rc1, llvmorg-13-init, llvmorg-11.1.0-rc2, llvmorg-11.1.0-rc1, llvmorg-11.0.1, llvmorg-11.0.1-rc2 |
|
| #
2291bd13 |
| 30-Nov-2020 |
Austin Kerbow <[email protected]> |
[AMDGPU] Update subtarget features for new target ID support
Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic
[AMDGPU] Update subtarget features for new target ID support
Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic subtarget features.
The possible settings are:
- Unsupported: The GPU has no support for XNACK/SRAMECC. - Any: Preference is unspecified. Use conservative settings that can run anywhere. - Off: Request support for XNACK/SRAMECC Off - On: Request support for XNACK/SRAMECC On
GCNSubtarget will track the four options based on the following criteria. If the subtarget does not support XNACK/SRAMECC we say the setting is "Unsupported". If no subtarget features for XNACK/SRAMECC are requested we must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the feature string when initializing the subtarget, the settings are "On/Off".
The defaults are updated to be conservatively correct, meaning if no setting for XNACK or SRAMECC is explicitly requested, defaults will be used which generate code that can be run anywhere. This corresponds to the "Any" setting.
Differential Revision: https://reviews.llvm.org/D85882
show more ...
|
| #
1ebe86ad |
| 05-Jan-2021 |
Mircea Trofin <[email protected]> |
[NFC] Removed unused prefixes in test/CodeGen/AMDGPU
More patches to follow.
Differential Revision: https://reviews.llvm.org/D94121
|
|
Revision tags: llvmorg-11.0.1-rc1 |
|
| #
d2e52eec |
| 10-Nov-2020 |
Matt Arsenault <[email protected]> |
AMDGPU: Select global saddr mode from SGPR pointer
Use the 64-bit SGPR base with a 0 offset, since it's 1 fewer instruction to materialize the 0 vs. the 64-bit copy.
|
| #
040c5027 |
| 02-Nov-2020 |
Jay Foad <[email protected]> |
[AMDGPU] Fix ds_read2/write2 with unaligned offsets
These instructions use a scaled offset. We were wrongly selecting them even when the required offset was not a multiple of the scale factor.
Diff
[AMDGPU] Fix ds_read2/write2 with unaligned offsets
These instructions use a scaled offset. We were wrongly selecting them even when the required offset was not a multiple of the scale factor.
Differential Revision: https://reviews.llvm.org/D90607
show more ...
|
| #
32897c05 |
| 03-Nov-2020 |
Jay Foad <[email protected]> |
[AMDGPU] Specify a triple to avoid codegen changes depending on host OS
|
| #
0892d2a3 |
| 02-Nov-2020 |
Jay Foad <[email protected]> |
Revert "Fix ds_read2/write2 unaligned offsets"
This reverts commit 2e7e898c8f0b38dc11fbce2553fc715067aaf42f.
It was committed by mistake.
|
| #
2e7e898c |
| 02-Nov-2020 |
Jay Foad <[email protected]> |
Fix ds_read2/write2 unaligned offsets
|
| #
c8cbaa15 |
| 02-Nov-2020 |
Jay Foad <[email protected]> |
[AMDGPU] Precommit ds_read2/write2 with unaligned offset tests. NFC.
|
| #
f3881d65 |
| 02-Nov-2020 |
Jay Foad <[email protected]> |
[AMDGPU] Generate test checks. NFC.
|
| #
d3f13f3e |
| 02-Nov-2020 |
Jay Foad <[email protected]> |
[AMDGPU] Remove a comment. NFC.
This was obsoleted by f78687df9b7 which added gfx9 aligned/unaligned tests.
|