amdgpu_hmm.c - OpenGrok history log for /linux-6.15/drivers/gpu/drm/amd/amdgpu/amdgpu

Revision (<<< Hide revision tags) (Show revision tags >>>)	Date	Author	Comments
Revision tags: v6.15, v6.15-rc7, v6.15-rc6, v6.15-rc5, v6.15-rc4, v6.15-rc3, v6.15-rc2, v6.15-rc1, v6.14, v6.14-rc7, v6.14-rc6, v6.14-rc5, v6.14-rc4, v6.14-rc3, v6.14-rc2, v6.14-rc1, v6.13, v6.13-rc7, v6.13-rc6, v6.13-rc5, v6.13-rc4, v6.13-rc3, v6.13-rc2, v6.13-rc1, v6.12, v6.12-rc7, v6.12-rc6, v6.12-rc5, v6.12-rc4, v6.12-rc3, v6.12-rc2, v6.12-rc1, v6.11, v6.11-rc7, v6.11-rc6, v6.11-rc5, v6.11-rc4, v6.11-rc3, v6.11-rc2, v6.11-rc1, v6.10, v6.10-rc7, v6.10-rc6, v6.10-rc5, v6.10-rc4, v6.10-rc3, v6.10-rc2, v6.10-rc1, v6.9, v6.9-rc7
# 9095e554	30-Apr-2024	Philip Yang <[email protected]>	drm/amdkfd: Remove arbitrary timeout for hmm_range_fault On system with khugepaged enabled and user cases with THP buffer, the hmm_range_fault may takes > 15 seconds to return -EBUSY, the arbitrary drm/amdkfd: Remove arbitrary timeout for hmm_range_fault On system with khugepaged enabled and user cases with THP buffer, the hmm_range_fault may takes > 15 seconds to return -EBUSY, the arbitrary timeout value is not accurate, cause memory allocation failure. Remove the arbitrary timeout value, return EAGAIN to application if hmm_range_fault return EBUSY, then userspace libdrm and Thunk will call ioctl again. Change EAGAIN to debug message as this is not error. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.9-rc6, v6.9-rc5, v6.9-rc4, v6.9-rc3, v6.9-rc2, v6.9-rc1, v6.8
# 22207fd5	06-Mar-2024	Vitaly Prosyak <[email protected]>	drm/amdgpu: fix use-after-free bug The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported drm/amdgpu: fix use-after-free bug The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung <[email protected]>. For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem_userptr arg; int ret; arg.addr = 0xffffffffffff0000; arg.size = 0x80000000; /2 Gb/ arg.flags = 0x7; ret = drmIoctl(fd, 0xc1186451/amdgpu_gem_userptr_ioctl/, &arg); } Due to the address and size are not valid there is a failure in amdgpu_hmm_register->mmu_interval_notifier_insert->__mmu_interval_notifier_insert-> check_shl_overflow, but we even the amdgpu_hmm_register failure we still call amdgpu_hmm_unregister into amdgpu_gem_object_free which causes access to a bad address. The following stack is below when the issue is reproduced when Kazan is enabled: [ +0.000014] Hardware name: ASUS System Product Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1401 12/03/2020 [ +0.000009] RIP: 0010:mmu_interval_notifier_remove+0x327/0x340 [ +0.000017] Code: ff ff 49 89 44 24 08 48 b8 00 01 00 00 00 00 ad de 4c 89 f7 49 89 47 40 48 83 c0 22 49 89 47 48 e8 ce d1 2d 01 e9 32 ff ff ff <0f> 0b e9 16 ff ff ff 4c 89 ef e8 fa 14 b3 ff e9 36 ff ff ff e8 80 [ +0.000014] RSP: 0018:ffffc90002657988 EFLAGS: 00010246 [ +0.000013] RAX: 0000000000000000 RBX: 1ffff920004caf35 RCX: ffffffff8160565b [ +0.000011] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8881a9f78260 [ +0.000010] RBP: ffffc90002657a70 R08: 0000000000000001 R09: fffff520004caf25 [ +0.000010] R10: 0000000000000003 R11: ffffffff8161d1d6 R12: ffff88810e988c00 [ +0.000010] R13: ffff888126fb5a00 R14: ffff88810e988c0c R15: ffff8881a9f78260 [ +0.000011] FS: 00007ff9ec848540(0000) GS:ffff8883cc880000(0000) knlGS:0000000000000000 [ +0.000012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000010] CR2: 000055b3f7e14328 CR3: 00000001b5770000 CR4: 0000000000350ef0 [ +0.000010] Call Trace: [ +0.000006] <TASK> [ +0.000007] ? show_regs+0x6a/0x80 [ +0.000018] ? __warn+0xa5/0x1b0 [ +0.000019] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.000018] ? report_bug+0x24a/0x290 [ +0.000022] ? handle_bug+0x46/0x90 [ +0.000015] ? exc_invalid_op+0x19/0x50 [ +0.000016] ? asm_exc_invalid_op+0x1b/0x20 [ +0.000017] ? kasan_save_stack+0x26/0x50 [ +0.000017] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.000019] ? mmu_interval_notifier_remove+0x327/0x340 [ +0.000019] ? mmu_interval_notifier_remove+0x23b/0x340 [ +0.000020] ? __pfx_mmu_interval_notifier_remove+0x10/0x10 [ +0.000017] ? kasan_save_alloc_info+0x1e/0x30 [ +0.000018] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? __kasan_kmalloc+0xb1/0xc0 [ +0.000018] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? __kasan_check_read+0x11/0x20 [ +0.000020] amdgpu_hmm_unregister+0x34/0x50 [amdgpu] [ +0.004695] amdgpu_gem_object_free+0x66/0xa0 [amdgpu] [ +0.004534] ? __pfx_amdgpu_gem_object_free+0x10/0x10 [amdgpu] [ +0.004291] ? do_syscall_64+0x5f/0xe0 [ +0.000023] ? srso_return_thunk+0x5/0x5f [ +0.000017] drm_gem_object_free+0x3b/0x50 [drm] [ +0.000489] amdgpu_gem_userptr_ioctl+0x306/0x500 [amdgpu] [ +0.004295] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004270] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? __this_cpu_preempt_check+0x13/0x20 [ +0.000015] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? sysvec_apic_timer_interrupt+0x57/0xc0 [ +0.000020] ? srso_return_thunk+0x5/0x5f [ +0.000014] ? asm_sysvec_apic_timer_interrupt+0x1b/0x20 [ +0.000022] ? drm_ioctl_kernel+0x17b/0x1f0 [drm] [ +0.000496] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004272] ? drm_ioctl_kernel+0x190/0x1f0 [drm] [ +0.000492] drm_ioctl_kernel+0x140/0x1f0 [drm] [ +0.000497] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004297] ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] [ +0.000489] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? __kasan_check_write+0x14/0x20 [ +0.000016] drm_ioctl+0x3da/0x730 [drm] [ +0.000475] ? __pfx_amdgpu_gem_userptr_ioctl+0x10/0x10 [amdgpu] [ +0.004293] ? __pfx_drm_ioctl+0x10/0x10 [drm] [ +0.000506] ? __pfx_rpm_resume+0x10/0x10 [ +0.000016] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? __kasan_check_write+0x14/0x20 [ +0.000010] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? _raw_spin_lock_irqsave+0x99/0x100 [ +0.000015] ? __pfx__raw_spin_lock_irqsave+0x10/0x10 [ +0.000014] ? srso_return_thunk+0x5/0x5f [ +0.000013] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? srso_return_thunk+0x5/0x5f [ +0.000011] ? preempt_count_sub+0x18/0xc0 [ +0.000013] ? srso_return_thunk+0x5/0x5f [ +0.000010] ? _raw_spin_unlock_irqrestore+0x27/0x50 [ +0.000019] amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu] [ +0.004272] __x64_sys_ioctl+0xcd/0x110 [ +0.000020] do_syscall_64+0x5f/0xe0 [ +0.000021] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ +0.000015] RIP: 0033:0x7ff9ed31a94f [ +0.000012] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00 [ +0.000013] RSP: 002b:00007fff25f66790 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000016] RAX: ffffffffffffffda RBX: 000055b3f7e133e0 RCX: 00007ff9ed31a94f [ +0.000012] RDX: 000055b3f7e133e0 RSI: 00000000c1186451 RDI: 0000000000000003 [ +0.000010] RBP: 00000000c1186451 R08: 0000000000000000 R09: 0000000000000000 [ +0.000009] R10: 0000000000000008 R11: 0000000000000246 R12: 00007fff25f66ca8 [ +0.000009] R13: 0000000000000003 R14: 000055b3f7021ba8 R15: 00007ff9ed7af040 [ +0.000024] </TASK> [ +0.000007] ---[ end trace 0000000000000000 ]--- v2: Consolidate any error handling into amdgpu_hmm_register which applied to kfd_bo also. (Christian) v3: Improve syntax and comment (Christian) Cc: Christian Koenig <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: Joonkyo Jung <[email protected]> Cc: Dokyung Song <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Signed-off-by: Vitaly Prosyak <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.8-rc7, v6.8-rc6, v6.8-rc5, v6.8-rc4, v6.8-rc3, v6.8-rc2, v6.8-rc1, v6.7, v6.7-rc8, v6.7-rc7, v6.7-rc6, v6.7-rc5
# 0c8c0e7a	08-Dec-2023	James Zhu <[email protected]>	drm/amdgpu: make an improvement on amdgpu_hmm_range_get_pages Only schedule when hmm_range_fault returns error. Signed-off-by: James Zhu <[email protected]> Acked-by: Felix Kuehling <Felix.Kuehling drm/amdgpu: make an improvement on amdgpu_hmm_range_get_pages Only schedule when hmm_range_fault returns error. Signed-off-by: James Zhu <[email protected]> Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# 78b4dfd3	08-Dec-2023	James Zhu <[email protected]>	drm/amdgpu: increase hmm range get pages timeout When application tries to allocate all system memory and cause memory to swap out. Needs more time for hmm_range_fault to validate the remaining page drm/amdgpu: increase hmm range get pages timeout When application tries to allocate all system memory and cause memory to swap out. Needs more time for hmm_range_fault to validate the remaining page for allocation. To be safe, increase timeout value to 1 second for 64MB range. Signed-off-by: James Zhu <[email protected]> Acked-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.7-rc4, v6.7-rc3, v6.7-rc2, v6.7-rc1, v6.6, v6.6-rc7, v6.6-rc6, v6.6-rc5, v6.6-rc4, v6.6-rc3, v6.6-rc2, v6.6-rc1, v6.5, v6.5-rc7, v6.5-rc6, v6.5-rc5, v6.5-rc4, v6.5-rc3, v6.5-rc2, v6.5-rc1, v6.4, v6.4-rc7, v6.4-rc6
# 5d1c70bb	06-Jun-2023	Philip Yang <[email protected]>	drm/amdgpu: Increase hmm range get pages timeout If hmm_range_fault returns -EBUSY, we should call hmm_range_fault again to validate the remaining pages. On one system with NUMA auto balancing enabl drm/amdgpu: Increase hmm range get pages timeout If hmm_range_fault returns -EBUSY, we should call hmm_range_fault again to validate the remaining pages. On one system with NUMA auto balancing enabled, hmm_range_fault takes 6 seconds for 1GB range because CPU migrate the range one page at a time. To be safe, increase timeout value to 1 second for 128MB range. Signed-off-by: Philip Yang <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
Revision tags: v6.4-rc5, v6.4-rc4, v6.4-rc3, v6.4-rc2, v6.4-rc1, v6.3, v6.3-rc7, v6.3-rc6, v6.3-rc5, v6.3-rc4, v6.3-rc3, v6.3-rc2, v6.3-rc1, v6.2, v6.2-rc8, v6.2-rc7, v6.2-rc6, v6.2-rc5, v6.2-rc4, v6.2-rc3, v6.2-rc2, v6.2-rc1, v6.1, v6.1-rc8, v6.1-rc7, v6.1-rc6, v6.1-rc5, v6.1-rc4, v6.1-rc3, v6.1-rc2, v6.1-rc1, v6.0, v6.0-rc7, v6.0-rc6, v6.0-rc5, v6.0-rc4, v6.0-rc3, v6.0-rc2, v6.0-rc1, v5.19, v5.19-rc8, v5.19-rc7, v5.19-rc6, v5.19-rc5, v5.19-rc4, v5.19-rc3, v5.19-rc2, v5.19-rc1, v5.18, v5.18-rc7, v5.18-rc6, v5.18-rc5, v5.18-rc4, v5.18-rc3, v5.18-rc2, v5.18-rc1, v5.17, v5.17-rc8, v5.17-rc7, v5.17-rc6, v5.17-rc5, v5.17-rc4, v5.17-rc3, v5.17-rc2, v5.17-rc1, v5.16, v5.16-rc8, v5.16-rc7, v5.16-rc6, v5.16-rc5, v5.16-rc4, v5.16-rc3, v5.16-rc2, v5.16-rc1, v5.15, v5.15-rc7, v5.15-rc6, v5.15-rc5, v5.15-rc4, v5.15-rc3, v5.15-rc2, v5.15-rc1, v5.14, v5.14-rc7, v5.14-rc6, v5.14-rc5, v5.14-rc4, v5.14-rc3, v5.14-rc2, v5.14-rc1, v5.13, v5.13-rc7, v5.13-rc6, v5.13-rc5, v5.13-rc4, v5.13-rc3, v5.13-rc2, v5.13-rc1, v5.12
# f95f51a4	22-Apr-2021	Felix Kuehling <[email protected]>	drm/amdgpu: Add notifier lock for KFD userptrs Add a per-process MMU notifier lock for processing notifiers from userptrs. Use that lock to properly synchronize page table updates with MMU notifiers drm/amdgpu: Add notifier lock for KFD userptrs Add a per-process MMU notifier lock for processing notifiers from userptrs. Use that lock to properly synchronize page table updates with MMU notifiers. Signed-off-by: Felix Kuehling <[email protected]> Reviewed-by: Xiaogang Chen<[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# c1420a5d	17-Nov-2022	James Zhu <[email protected]>	drm/amdgpu: fix stall on CPU when allocate large system memory -v2: 1. rename variable to redue confuse 2. optimize the code -v3: move new define out of the middle of the code -v4: squash in mi drm/amdgpu: fix stall on CPU when allocate large system memory -v2: 1. rename variable to redue confuse 2. optimize the code -v3: move new define out of the middle of the code -v4: squash in minmax error fix (Luben) When applications try to allocate large system (more than > 128GB), "stall cpu" is reported. for such large system memory, walk_page_range takes more than 20s usually. The warning message can be removed when splitting hmm range into smaller ones which is not more 64GB for each walk_page_range. [ 164.437617] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1753: amdgpu: create BO VA 0x7f63c7a00000 size 0x2f16000000 domain CPU [ 164.488847] amdgpu:amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu:1785: amdgpu: creating userptr BO for user_addr = 7f63c7a00000 [ 185.439116] rcu: INFO: rcu_sched self-detected stall on CPU [ 185.439125] rcu: 8-....: (20999 ticks this GP) idle=e22/1/0x4000000000000000 softirq=2242/2242 fqs=5249 [ 185.439137] (t=21000 jiffies g=6325 q=1215) [ 185.439141] NMI backtrace for cpu 8 [ 185.439143] CPU: 8 PID: 3470 Comm: kfdtest Kdump: loaded Tainted: G O 5.12.0-0_fbk5_zion_rc1_5697_g2c723fb88626 #1 [ 185.439147] Hardware name: HPE ProLiant XL675d Gen10 Plus/ProLiant XL675d Gen10 Plus, BIOS A47 11/06/2020 [ 185.439150] Call Trace: [ 185.439153] <IRQ> [ 185.439157] dump_stack+0x64/0x7c [ 185.439163] nmi_cpu_backtrace.cold.7+0x30/0x65 [ 185.439165] ? lapic_can_unplug_cpu+0x70/0x70 [ 185.439170] nmi_trigger_cpumask_backtrace+0xf9/0x100 [ 185.439174] rcu_dump_cpu_stacks+0xc5/0xf5 [ 185.439178] rcu_sched_clock_irq.cold.97+0x112/0x38c [ 185.439182] ? tick_sched_handle.isra.21+0x50/0x50 [ 185.439185] update_process_times+0x8c/0xc0 [ 185.439189] tick_sched_timer+0x63/0x70 [ 185.439192] __hrtimer_run_queues+0xff/0x250 [ 185.439195] hrtimer_interrupt+0xf4/0x200 [ 185.439199] __sysvec_apic_timer_interrupt+0x51/0xd0 [ 185.439201] sysvec_apic_timer_interrupt+0x69/0x90 [ 185.439206] </IRQ> [ 185.439207] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 185.439211] RIP: 0010:clear_page_rep+0x7/0x10 [ 185.439214] Code: e8 fe 7c 51 00 44 89 e2 48 89 ee 48 89 df e8 60 ff ff ff c6 03 00 5b 5d 41 5c c3 cc cc cc cc cc cc cc cc b9 00 02 00 00 31 c0 <f3> 48 ab c3 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00 [ 185.439218] RSP: 0018:ffffc9000f58f818 EFLAGS: 00000246 [ 185.439220] RAX: 0000000000000000 RBX: 0000000000000881 RCX: 000000000000005c [ 185.439223] RDX: 0000000000100dca RSI: 0000000000000000 RDI: ffff88a59e0e5d20 [ 185.439225] RBP: ffffea0096783940 R08: ffff888118c35280 R09: ffffea0096783940 [ 185.439227] R10: ffff888000000000 R11: 0000160000000000 R12: ffffea0096783980 [ 185.439228] R13: ffffea0096783940 R14: ffff88b07fdfdd00 R15: 0000000000000000 [ 185.439232] prep_new_page+0x81/0xc0 [ 185.439236] get_page_from_freelist+0x13be/0x16f0 [ 185.439240] ? release_pages+0x16a/0x4a0 [ 185.439244] __alloc_pages_nodemask+0x1ae/0x340 [ 185.439247] alloc_pages_vma+0x74/0x1e0 [ 185.439251] __handle_mm_fault+0xafe/0x1360 [ 185.439255] handle_mm_fault+0xc3/0x280 [ 185.439257] hmm_vma_fault.isra.22+0x49/0x90 [ 185.439261] __walk_page_range+0x692/0x9b0 [ 185.439265] walk_page_range+0x9b/0x120 [ 185.439269] hmm_range_fault+0x4f/0x90 [ 185.439274] amdgpu_hmm_range_get_pages+0x24f/0x260 [amdgpu] [ 185.439463] amdgpu_ttm_tt_get_user_pages+0xc2/0x190 [amdgpu] [ 185.439603] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x49f/0x7a0 [amdgpu] [ 185.439774] kfd_ioctl_alloc_memory_of_gpu+0xfb/0x410 [amdgpu] Signed-off-by: James Zhu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# d4cbff46	09-Nov-2022	Christian König <[email protected]>	drm/amdgpu: cleanup amdgpu_hmm_range_get_pages Remove unused parameters and cleanup dead code. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <alexander.deucher drm/amdgpu: cleanup amdgpu_hmm_range_get_pages Remove unused parameters and cleanup dead code. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...
# d9483ecd	09-Nov-2022	Christian König <[email protected]>	drm/amdgpu: rename the files for HMM handling Clean that up a bit, no functional change. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected] drm/amdgpu: rename the files for HMM handling Clean that up a bit, no functional change. Signed-off-by: Christian König <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Signed-off-by: Alex Deucher <[email protected]> show more ...