Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync kvm patch for loongarch #635

Open
wants to merge 67 commits into
base: linux-6.6.y
Choose a base branch
from

Conversation

lixianglai
Copy link

@lixianglai lixianglai commented Feb 28, 2025

Summary by Sourcery

This pull request synchronizes the KVM patch for the LoongArch architecture. It introduces support for performance monitoring units (PMU), LBT, stolen time, and IPI. It also includes changes to the MMU, timer, and other core components to improve virtualization performance and stability.

New Features:

  • Adds support for performance monitoring units (PMU) to allow guest VMs to monitor their performance.
  • Introduces support for LBT (likely Load-and-Branch Tracking) to improve the performance of nested virtualization.
  • Implements stolen time accounting to provide more accurate timekeeping in guest VMs.
  • Adds support for Inter-Processor Interrupts (IPI) to enable communication between virtual CPUs.
  • Adds support for virtualized EXTIOI (External Interrupt Controller)

gaosong-loongson and others added 30 commits February 28, 2025 19:25
Upstream: no

On LoongArch, the host and guest have their own PMU CSRs registers
and they share PMU hardware resources. A set of PMU CSRs consists of
a CTRL register and a CNTR register. We can set which PMU CSRs are used
by the guest by writing to the GCFG register [24: 26] bits.

On KVM side:
- Save the host PMU CSRs into structure kvm_context.
- If the host supports the PMU feature.
  - When entering guest mode. save the host PMU CSRs and restore the guest PMU CSRs.
  - When exiting guest mode, save the guest PMU CSRs and restore the host PMU CSRs.

Signed-off-by: Song Gao <gaosong@loongson.cn>
Link: https://lore.kernel.org/all/20240613120539.41021-1-gaosong@loongson.cn/
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Add iocsr and mmio memory read and write simulation to the kernel.
When the VM accesses the device address space through iocsr
instructions or mmio, it does not need to return to the qemu
user mode but directly completes the access in the kernel mode.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Added device model for IPI interrupt controller,
implemented basic create destroy interface,
and registered device model to kvm device table.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Implementation of IPI interrupt controller address
space read and write function simulation.

Signed-off-by: Min Zhou <zhoumin@loongson.cn>
Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Implements the communication interface between the user mode
program and the kernel in IPI interrupt control simulation,
which is used to obtain or send the simulation data of the
interrupt controller in the user mode process, and is used
in VM migration or VM saving and restoration.

Signed-off-by: Min Zhou <zhoumin@loongson.cn>
Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Added device model for EXTIOI interrupt controller,
implemented basic create destroy interface,
and registered device model to kvm device table.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Implementation of EXTIOI interrupt controller address
space read and write function simulation.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Implements the communication interface between the user mode
program and the kernel in EXTIOI interrupt control simulation,
which is used to obtain or send the simulation data of the
interrupt controller in the user mode process, and is used
in VM migration or VM saving and restoration.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Added device model for PCHPIC interrupt controller,
implemented basic create destroy interface,
and registered device model to kvm device table.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Implementation of IPI interrupt controller address
space read and write function simulation.

Implement interrupt injection interface under loongarch.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Implements the communication interface between the user mode
program and the kernel in PCHPIC interrupt control simulation,
which is used to obtain or send the simulation data of the
interrupt controller in the user mode process, and is used
in VM migration or VM saving and restoration.

Signed-off-by: Tianrui Zhao <zhaotianrui@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Enable the KVM_IRQ_ROUTING KVM_IRQCHIP KVM_MSI configuration item,
increase the KVM_CAP_IRQCHIP capability, and implement the query
interface of the kernel irqchip.

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

  This bug from the PR:
  https://gitee.com/anolis/cloud-kernel/pulls/3517

  [...]
  arch/loongarch/kvm/vcpu.c:1253:13: error: redefinition of 'kvm_lose_pmu'
   1253 | static void kvm_lose_pmu(struct kvm_vcpu *vcpu)
        |             ^~~~~~~~~~~~
  arch/loongarch/kvm/vcpu.c:202:13: note: previous definition of 'kvm_lose_pmu' with type 'void(struct kvm_vcpu *)'
    202 | static void kvm_lose_pmu(struct kvm_vcpu *vcpu)
        |             ^~~~~~~~~~~~
  arch/loongarch/kvm/vcpu.c: In function 'kvm_lose_pmu':
  arch/loongarch/kvm/vcpu.c:1257:38: error: 'KVM_LARCH_PERF' undeclared (first use in this function); did you mean 'KVM_LARCH_LSX'?
   1257 |         if (!(vcpu->arch.aux_inuse & KVM_LARCH_PERF))
        |                                      ^~~~~~~~~~~~~~
        |                                      KVM_LARCH_LSX
  arch/loongarch/kvm/vcpu.c:1280:35: error: 'KVM_PMU_PLV_ENABLE' undeclared (first use in this function); did you mean 'KVM_PV_ENABLE'?
   1280 |                                 & KVM_PMU_PLV_ENABLE) == 0)
        |                                   ^~~~~~~~~~~~~~~~~~
        |                                   KVM_PV_ENABLE
  [..]

Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit a2cd37863518 (""LoongArch: KVM: Add PV IPI support on host side)
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

On LoongArch system, IPI hw uses iocsr registers. There are one iocsr
register access on IPI sending, and two iocsr access on IPI receiving
for the IPI interrupt handler. In VM mode all iocsr accessing will cause
VM to trap into hypervisor. So with one IPI hw notification there will
be three times of trap.

In this patch PV IPI is added for VM, hypercall instruction is used for
IPI sender, and hypervisor will inject an SWI to the destination vcpu.
During the SWI interrupt handler, only CSR.ESTAT register is written to
clear irq. CSR.ESTAT register access will not trap into hypervisor, so
with PV IPI supported, there is one trap with IPI sender, and no trap
with IPI receiver, there is only one trap with IPI notification.

Also this patch adds IPI multicast support, the method is similar with
x86. With IPI multicast support, IPI notification can be sent to at
most 128 vcpus at one time. It greatly reduces the times of trapping
into hypervisor.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 30cf03a606b7 (""LoongArch: KVM: Add PV IPI support on guest side)
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

PARAVIRT config option and PV IPI is added for the guest side, function
pv_ipi_init() is used to add IPI sending and IPI receiving hooks. This
function firstly checks whether system runs in VM mode, and if kernel
runs in VM mode, it will call function kvm_para_available() to detect
the current hypervirsor type (now only KVM type detection is supported).
The paravirt functions can work only if current hypervisor type is KVM,
since there is only KVM supported on LoongArch now.

PV IPI uses virtual IPI sender and virtual IPI receiver functions. With
virtual IPI sender, IPI message is stored in memory rather than emulated
HW. IPI multicast is also supported, and 128 vcpus can received IPIs
at the same time like X86 KVM method. Hypercall method is used for IPI
sending.

With virtual IPI receiver, HW SWI0 is used rather than real IPI HW.
Since VCPU has separate HW SWI0 like HW timer, there is no trap in IPI
interrupt acknowledge. Since IPI message is stored in memory, there is
no trap in getting IPI message.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit bfd2ecade039 ("LoongArch: KVM: Add software breakpoint support")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

When VM runs in kvm mode, system will not exit to host mode when
executing a general software breakpoint instruction such as INSN_BREAK,
trap exception happens in guest mode rather than host mode. In order to
debug guest kernel on host side, one mechanism should be used to let VM
exit to host mode.

Here a hypercall instruction with a special code is used for software
breakpoint usage. VM exits to host mode and kvm hypervisor identifies
the special hypercall code and sets exit_reason with KVM_EXIT_DEBUG. And
then let qemu handle it.

Idea comes from ppc kvm, one api KVM_REG_LOONGARCH_DEBUG_INST is added
to get the hypercall code. VMM needs get sw breakpoint instruction with
this api and set the corresponding sw break point for guest kernel.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 4bf3b972cad4 ("LoongArch: KVM: Add mmio trace events support")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Add mmio trace events support, currently generic mmio events
KVM_TRACE_MMIO_WRITE/xxx_READ/xx_READ_UNSATISFIED are added here.

Also vcpu id field is added for all kvm trace events, since perf
KVM tool parses vcpu id information for kvm entry event.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 2584b0859859 ("LoongArch: KVM: Sync pending interrupt when getting ESTAT from user mode")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Currently interrupts are posted and cleared with the asynchronous mode,
meanwhile they are saved in SW state vcpu::arch::irq_pending and vcpu::
arch::irq_clear. When vcpu is ready to run, pending interrupt is written
back to CSR.ESTAT register from SW state vcpu::arch::irq_pending at the
guest entrance.

During VM migration stage, vcpu is put into stopped state, however
pending interrupts are not synced to CSR.ESTAT register. So there will
be interrupt lost when VCPU is migrated to another host machines.

Here in this patch when ESTAT CSR register is read from VMM user mode,
pending interrupts are synchronized to ESTAT also. So that VMM can get
correct pending interrupts.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 846f1299b2e2 ("LoongArch: KVM: Delay secondary mmu tlb flush until guest entry")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

With hardware assisted virtualization, there are two level HW mmu, one
is GVA to GPA mapping, the other is GPA to HPA mapping which is called
secondary mmu in generic. If there is page fault for secondary mmu,
there needs tlb flush operation indexed with fault GPA address and VMID.
VMID is stored at register CSR.GSTAT and will be reload or recalculated
before guest entry.

Currently CSR.GSTAT is not saved and restored during VCPU context
switch, instead it is recalculated during guest entry. So CSR.GSTAT is
effective only when a VCPU runs in guest mode, however it may not be
effective if the VCPU exits to host mode. Since register CSR.GSTAT may
be stale, it may records the VMID of the last schedule-out VCPU, rather
than the current VCPU.

Function kvm_flush_tlb_gpa() should be called with its real VMID, so
here move it to the guest entrance. Also an arch-specific request id
KVM_REQ_TLB_FLUSH_GPA is added to flush tlb for secondary mmu, and it
can be optimized if VMID is updated, since all guest tlb entries will
be invalid if VMID is updated.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 038f365107ed ("LoongArch: KVM: Select huge page only if secondary mmu supports it")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Currently page level selection about secondary mmu depends on memory
slot and page level about host mmu. There will be problems if page level
of secondary mmu is zero already. Huge page cannot be selected if there is
normal page mapped in secondary mmu already, since it is not supported to
merge normal pages into huge pages now.

So page level selection should depend on the following three conditions.
 1. Memslot is aligned for huge page and vm is not migrating.
 2. Page level of host mmu is also huge page.
 3. Page level of secondary mmu is suituable for huge page.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 1706dcacd354 ("LoongArch: KVM: Discard dirty page tracking on readonly memslot")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

For readonly memslot such as UEFI BIOS or UEFI var space, guest cannot
write this memory space directly. So it is not necessary to track dirty
pages for readonly memslot. Here we make such optimization in function
kvm_arch_commit_memory_region().

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 21d29364e1d1 ("LoongArch: KVM: Add memory barrier before update pmd entry")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

When updating pmd entry such as allocating new pmd page or splitting
huge page into normal page, it is necessary to firstly update all pte
entries, and then update pmd entry.

It is weak order with LoongArch system, there will be problem if other
VCPUs see pmd update firstly while ptes are not updated. Here smp_wmb()
is added to assure this.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit e31e55145f76 (""LoongArch: KVM: Add dirty bitmap initially all set support)
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Add KVM_DIRTY_LOG_INITIALLY_SET support on LoongArch system, this
feature comes from other architectures like x86 and arm64.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit dfd5f11227e6 ("LoongArch: KVM: Mark page accessed and dirty with page ref added")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Function kvm_map_page_fast() is fast path of secondary mmu page fault
flow, pfn is parsed from secondary mmu page table walker. However the
corresponding page reference is not added, it is dangerious to access
page out of mmu_lock.

Here page ref is added inside mmu_lock, function kvm_set_pfn_accessed()
and kvm_set_pfn_dirty() is called with page ref added, so that the page
will not be freed by others.

Also kvm_set_pfn_accessed() is removed here since it is called in the
following function kvm_release_pfn_clean().

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit d808899636ea ("LoongArch: KVM: always make pte young in page map's fast path")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

It seems redundant to check if pte is young before the call to
kvm_pte_mkyoung() in kvm_map_page_fast(). Just remove the check.

Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Jia Qingtong <jiaqingtong97@gmail.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 40e12dbc794b ("LoongArch: KVM: Add PV steal time support in host side")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Add ParaVirt steal time feature in host side, VM can search supported
features provided by KVM hypervisor, a feature KVM_FEATURE_STEAL_TIME
is added here. Like x86, steal time structure is saved in guest memory,
one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to
enable this feature.

One CPU attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to
save and restore the base address of steal time structure when a VM is
migrated.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 00c8fa6c61c5 (""LoongArch: KVM: Add PV steal time support in guest side)
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Per-cpu struct kvm_steal_time is added here, its size is 64 bytes and
also defined as 64 bytes, so that the whole structure is in one physical
page.

When a VCPU is online, function pv_enable_steal_time() is called. This
function will pass guest physical address of struct kvm_steal_time and
tells hypervisor to enable steal time. When a vcpu is offline, physical
address is set as 0 and tells hypervisor to disable steal time.

Here is an output of vmstat on guest when there is workload on both host
and guest. It shows steal time stat information.

procs -----------memory---------- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   bi    bo   in   cs us sy id wa st
15  1      0 7583616 184112  72208    20    0  162   52 31  6 43  0 20
17  0      0 7583616 184704  72192    0     0 6318 6885  5 60  8  5 22
16  0      0 7583616 185392  72144    0     0 1766 1081  0 49  0  1 50
16  0      0 7583616 184816  72304    0     0 6300 6166  4 62 12  2 20
18  0      0 7583632 184480  72240    0     0 2814 1754  2 58  4  1 35

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 0063265ee183 ("perf kvm: Add kvm-stat for loongarch64")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Add support for 'perf kvm stat' on loongarch64 platform, now only kvm
exit event is supported.

Here is example output about "perf kvm --host stat report" command

   Event name   Samples   Sample%     Time (ns)   Time%   Mean Time (ns)
    Mem Store     83969    51.00%     625697070   8.00%             7451
     Mem Read     37641    22.00%     112485730   1.00%             2988
    Interrupt     15542     9.00%      20620190   0.00%             1326
        IOCSR     15207     9.00%      94296190   1.00%             6200
    Hypercall      4873     2.00%      12265280   0.00%             2516
         Idle      3713     2.00%    6322055860  87.00%          1702681
          FPU      1819     1.00%       2750300   0.00%             1511
   Inst Fetch       502     0.00%       1341740   0.00%             2672
   Mem Modify       324     0.00%        602240   0.00%             1858
       CPUCFG        55     0.00%         77610   0.00%             1411
          CSR        12     0.00%         19690   0.00%             1640
         LASX         3     0.00%          4870   0.00%             1623
          LSX         2     0.00%          2100   0.00%             1050

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 05086585ea4f ("KVM: Discard zero mask with function kvm_dirty_ring_reset")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

Function kvm_reset_dirty_gfn may be called with parameters cur_slot /
cur_offset / mask are all zero, it does not represent real dirty page.
It is not necessary to clear dirty page in this condition. Also return
value of macro __fls() is undefined if mask is zero which is called in
funciton kvm_reset_dirty_gfn(). Here just return.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240613122803.1031511-1-maobibo@loongson.cn>
[Move the conditional inside kvm_reset_dirty_gfn; suggested by
 Sean Christopherson. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
commit 0c9fa3e92629 ("LoongArch: KVM: Invalidate guest steal time address on vCPU reset")
Conflict: none
Backport-reason: Synchronize upstream linux loongarch kvm
patch to support loongarch virtualization.
Checkpatch: no, to be consistent with upstream commit.

If ParaVirt steal time feature is enabled, there is a percpu gpa address
passed from guest vCPU and host modifies guest memory space with this gpa
address. When vCPU is reset normally, it will notify host and invalidate
gpa address.

However if VM is crashed and VMM reboots VM forcely, the vCPU reboot
notification callback will not be called in VM. Host needs invalidate
the gpa address, else host will modify guest memory during VM reboots.
Here it is invalidated from the vCPU KVM_REG_LOONGARCH_VCPU_RESET ioctl
interface.

Also funciton kvm_reset_timer() is removed at vCPU reset stage, since SW
emulated timer is only used in vCPU block state. When a vCPU is removed
from the block waiting queue, kvm_restore_timer() is called and SW timer
is cancelled. And the timer register is also cleared at VMM when a vCPU
is reset.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
lixianglai and others added 13 commits February 28, 2025 19:33
Upstream: no

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

On LoongArch system, there are two places to set cpu numa node. One
is in arch specified function smp_prepare_boot_cpu(), the other is
in generic function early_numa_node_init(). The latter will overwrite
the numa node information.

With hot-added cpu without numa information, cpu_logical_map() fails
to its physical cpuid at beginning since it is not enabled in ACPI
MADT table. So function early_cpu_to_node() also fails to get its
numa node for hot-added cpu, and generic function
early_numa_node_init() will overwrite with incorrect numa node.

APIs topo_get_cpu() and topo_add_cpu() is added here, like other
architectures logic cpu is allocated when parsing MADT table. When
parsing SRAT table or hot-add cpu, logic cpu is acquired by searching
all allocated logical cpu with matched physical id. It solves such
problems such as:
  1. Boot cpu is not the first entry in MADT table, the first entry
will be overwritten with later boot cpu.
  2. Physical cpu id not presented in MADT table is invalid, in later
SRAT/hot-add cpu parsing, invalid physical cpu detected is added
  3. For hot-add cpu, its logic cpu is allocated in MADT table parsing,
so early_cpu_to_node() can be used for hot-add cpu and cpu_to_node()
is correct for hot-add cpu.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Fix pch pic spinlock deadlock

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Added iommu support for loongarch

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

When the virtual machine is restarted, the data in extioi is not zeroed,
and there is a residual set interrupt bit, resulting in a hang

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Fixed 16 vfio devices cannot be pass-through to VMS.

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Register LLBCTL is separated CSR register from host, host exception
eret instruction will clear host LLBCTL CSR register, guest
exception will clear guest LLBCTL CSR register.

VCPU0 atomic64_fetch_add_unless      VCPU1 atomic64_fetch_add_unless
  0:   ll.d    %[p],  %[c]
       beq     %[p],  %[u], 1f
Here secondary mmu mapping is changed, hpa page is is replaced
with new page. And VCPU1 executed atomic instruction on new
page.
                                     0:   ll.d    %[p],  %[c]
                                          beq     %[p],  %[u], 1f
                                          add.d   %[rc], %[p], %[a]
                                          sc.d    %[rc], %[c]
       add.d   %[rc], %[p], %[a]
       sc.d    %[rc], %[c]
LLBCTL is on and it represents the memory is not modified, sc.d
will modify the memory directly.

Here clear guest LLBCTL_WCLLB register when mapping is the changed.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Add the ptw feature bit to cpucfg

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Delete duplicate function definitions

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

We need to empty the data in irqchip when the VM restarts.

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Fixed the probability of physical machine crashing
during repeated restarts after virtual machine passthrough

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

Before ptw is enabled, when a virtual machine writes data to
a physical page, a page modification exception will be triggered.
In the exception processing, the dirty position of pte is set,
and the page dirty bitmap of kvm is set.
During migration, the page dirty bitmap is used for
dirty page migration.  After ptw is enabled, when the
virtual machine writes data to the physical page,
the ptw hardware directly writes the dirty bit of the pte
without triggering page modification exceptions.
kvm cannot set page dirty bitmap correctly,
resulting in partial data loss and migration failure.

In order to solve this problem, we use the write bit and
dirty bit of pte to mark whether the current page needs
to be migrated, that is, the write bit and dirty bit of
pte are cleared to zero at the beginning of the migration,
so that even when ptw is enabled, the page modification
exception will be triggered. In this way, the correct
dirty page marking process is entered to complete the
correct migration of memory, and the 50bit of pte is used to
record the original write property in order to restore the
correct write property of pte after the dirty page marking
is completed.

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Upstream: no

The extioi controller can send interrupts to only
four cpus and cannot send interrupts to other cpus.
This patch enables extioi to send interrupts to
a maximum of 256 cpus.

Signed-off-by: Xianglai Li <lixianglai@loongson.cn>
Copy link

sourcery-ai bot commented Feb 28, 2025

🧙 Sourcery has finished reviewing your pull request!


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mingcongbai for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deepin-ci-robot
Copy link

Hi @lixianglai. Thanks for your PR.

I'm waiting for a deepin-community member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lixianglai - I've reviewed your changes - here's some feedback:

Overall Comments:

  • The PMU save/restore functions could be simplified by using a loop instead of unrolling the registers.
  • Consider using helper functions to improve the readability of the MMIO read/write emulation code.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 5 issues found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

return ret;

/* Register loongarch pch pic interrupt controller interface. */
ret = kvm_loongarch_register_pch_pic_device();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Device registration return value handling.

Since the registration of pch_pic (and the other devices) is performed sequentially, ensure that on failure, any previously registered devices are correctly cleaned up to avoid resource leaks or inconsistent state.

Suggested implementation:

	ret = kvm_loongarch_register_ipi_device();
	if (ret)
		goto fail_ipi;

	/* Register loongarch extioi interrupt controller interface. */
	ret = kvm_loongarch_register_extioi_device();
	if (ret)
		goto fail_extioi;

	/* Register loongarch pch pic interrupt controller interface. */
	ret = kvm_loongarch_register_pch_pic_device();
	if (ret)
		goto fail_pch_pic;

	return 0;

fail_pch_pic:
	/* Clean up previously registered extioi device */
	kvm_loongarch_unregister_extioi_device();
fail_extioi:
	/* Clean up previously registered ipi device */
	kvm_loongarch_unregister_ipi_device();
fail_ipi:
	return ret;
}

Ensure that the functions kvm_loongarch_unregister_ipi_device() and kvm_loongarch_unregister_extioi_device() exist and correctly reverse the initialization performed during registration.

}

#define KVM_IPI_CLUSTER_SIZE (2 * BITS_PER_LONG)
#define KVM_IPI_CLUSTER_SIZE (2 * BITS_PER_LONG)

static void pv_send_ipi_mask(const struct cpumask *mask, unsigned int action)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting the bit-twiddling and clustering logic in pv_send_ipi_mask() into a helper function to improve readability and reduce nested conditionals within the loop.

Consider extracting the bit‐twiddling and clustering logic in
`pv_send_ipi_mask()` into a helper function to isolate the “cluster update”
logic. This would reduce the number of nested conditionals in the loop and
make the code easier to follow. For example, you can create a helper that
updates the cluster variables and the bitmap:

```c
static void update_cluster(int cpu, int *min, int *max, __uint128_t *bitmap)
{
    if (!*bitmap) {
        *min = *max = cpu;
    } else if (cpu < *min && cpu > (*max - KVM_IPI_CLUSTER_SIZE)) {
        *bitmap <<= (*min - cpu);
        *min = cpu;
    } else if (cpu > *min && cpu < (*min + KVM_IPI_CLUSTER_SIZE)) {
        *max = cpu > *max ? cpu : *max;
    } else {
        /* Cluster full; send IPI with current bitmap and start new cluster */
        kvm_hypercall3(KVM_HCALL_FUNC_IPI, (unsigned long)*bitmap,
                        (unsigned long)(*bitmap >> BITS_PER_LONG), *min);
        *min = *max = cpu;
        *bitmap = 0;
    }
    __set_bit(cpu - *min, (unsigned long *)bitmap);
}

Then in your loop in pv_send_ipi_mask(), simply call:

for_each_cpu(i, mask) {
    info = &per_cpu(irq_stat, i);
    old = atomic_fetch_or(action, &info->message);
    if (old)
        continue;
    cpu = cpu_logical_map(i);
    update_cluster(cpu, &min, &max, &bitmap);
}

This refactoring isolates the complex decision‐logic and keeps the loop more
readable while retaining all functionality.

@@ -116,6 +116,26 @@ void pud_init(void *addr)
EXPORT_SYMBOL_GPL(pud_init);
#endif

void kernel_pte_init(void *addr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider using a simple loop or memset instead of an unrolled loop with manual assignments to improve code clarity and maintainability

Consider replacing the unrolled loop with a simple, clear loop. For example, instead of mixing pointer arithmetic like:

    p[-3] = _PAGE_GLOBAL;
    p[-2] = _PAGE_GLOBAL;
    p[-1] = _PAGE_GLOBAL;

you could write:

    for (int i = 0; i < NUM_ENTRIES; i++) {
        p[i] = _PAGE_GLOBAL;
    }

If all entries share the same value in a contiguous memory block, even a `memset()` (or a helper wrapping it) could make the intent clearer. This change preserves functionality without the non-intuitive backward indexing, improving readability and maintainability.

}
}

static int loongarch_extioi_writeb(struct kvm_vcpu *vcpu,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting common bit-manipulation loops and update routines into helper functions to improve readability and reduce code duplication, such as the IRQ update loop using ffs().

Consider extracting common bit-manipulation loops and update routines as helper functions to reduce nested inline functions and duplicated control flow. For instance, the loops that iterate with ffs() (or equivalent) to update IRQ state appear in multiple places. Consolidating this code into a well—named helper function will make the intent clearer and decouple the locking/update logic.

For example, you could refactor the IRQ update loop into a helper function:

static void update_pending_irq(struct loongarch_extioi *s, int base_irq, u8 bitmask, int level)
{
    int irq = ffs(bitmask);
    while (irq) {
        extioi_update_irq(s, base_irq + irq - 1, level);
        bitmask &= ~(1 << (irq - 1));
        irq = ffs(bitmask);
    }
}

Then in your write functions you would replace patterns like:

irq = ffs(coreisr);
while (irq != 0) {
    extioi_update_irq(s, irq - 1 + index * 8, 0);
    coreisr &= ~(1 << (irq - 1));
    irq = ffs(coreisr);
}

with:

update_pending_irq(s, index * 8, coreisr, 0);

Similarly, extract routines for common bit fiddling when handling enable/disable transitions. This keeps functionality intact while improving readability and reducing nesting.

#include <linux/count_zeros.h>

/* update the isr according to irq level and route irq to extioi */
static void pch_pic_update_irq(struct loongarch_pch_pic *s, int irq, int level)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider simplifying the 'pch_pic_update_irq' function by using early returns and using the 'for_each_set_bit()' macro in 'pch_pic_update_batch_irqs' to reduce nested logic and repeated code.

It would help to simplify the duplicated branches by “failing‐fast” in the update function. For example, you can remove the duplicated code by using early returns rather than nested if/else. Also consider using the kernel helper macros (e.g. for_each_set_bit()) in your batch update.

Suggested refactoring for pch_pic_update_irq:

static void pch_pic_update_irq(struct loongarch_pch_pic *s, int irq, int level)
{
    u64 mask = 1ULL << irq;

    if (level) {
        if (!(mask & s->irr & ~s->mask))
            return;
        s->isr |= mask;
    } else {
        if (!(mask & s->isr & ~s->irr))
            return;
        s->isr &= ~mask;
    }

    irq = s->htmsi_vector[irq];
    extioi_set_irq(s->kvm->arch.extioi, irq, level);
}

And for pch_pic_update_batch_irqs:

static void pch_pic_update_batch_irqs(struct loongarch_pch_pic *s, u64 irq_mask, int level)
{
    unsigned int irq;
    for_each_set_bit(irq, &irq_mask, sizeof(irq_mask) * 8)
        pch_pic_update_irq(s, irq, level);
}

These changes reduce nested logic and repeated code while preserving the original behavior.

spin_unlock(&vcpu->arch.ipi_state.lock);
}

static uint64_t read_mailbox(struct kvm_vcpu *vcpu, int offset, int len)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring common lock-acquire/release and mailbox access patterns to reduce code duplication and improve readability by extracting helper functions and consolidating logic into a single function with a boolean write parameter, then creating simple wrapper functions for read and write operations, and applying similar factoring to other areas with repeated lock patterns or switch statements..

Consider refactoring common lock‐acquire/release and mailbox access patterns. For example, instead of duplicating similar locking and pointer arithmetic in both read_mailbox and write_mailbox, you can extract a helper:

static inline void *get_mailbox_ptr(struct kvm_vcpu *vcpu, int offset)
{
    return (void *)vcpu->arch.ipi_state.buf + (offset - 0x20);
}

Then rewrite the two functions to only differ in the read/write part:

```c
static uint64_t mailbox_access(struct kvm_vcpu *vcpu, int offset, int len,
                               bool write, uint64_t data)
{
    void *pbuf;
    uint64_t ret = 0;

    spin_lock(&vcpu->arch.ipi_state.lock);
    pbuf = get_mailbox_ptr(vcpu, offset);
    if (write) {
        if (len == 1)
            *(unsigned char *)pbuf = (unsigned char)data;
        else if (len == 2)
            *(unsigned short *)pbuf = (unsigned short)data;
        else if (len == 4)
            *(unsigned int *)pbuf = (unsigned int)data;
        else if (len == 8)
            *(unsigned long *)pbuf = (unsigned long)data;
        else
            kvm_err("%s: unknown data len: %d\n", __func__, len);
    } else {
        if (len == 1)
            ret = *(unsigned char *)pbuf;
        else if (len == 2)
            ret = *(unsigned short *)pbuf;
        else if (len == 4)
            ret = *(unsigned int *)pbuf;
        else if (len == 8)
            ret = *(unsigned long *)pbuf;
        else
            kvm_err("%s: unknown data len: %d\n", __func__, len);
    }
    spin_unlock(&vcpu->arch.ipi_state.lock);

    return ret;
}

Then redefine the read and write versions as simple wrappers:

static uint64_t read_mailbox(struct kvm_vcpu *vcpu, int offset, int len)
{
    return mailbox_access(vcpu, offset, len, false, 0);
}

static void write_mailbox(struct kvm_vcpu *vcpu, int offset,
                          uint64_t data, int len)
{
    (void)mailbox_access(vcpu, offset, len, true, data);
}

Similarly, you could factor out the repeated lock patterns in ipi_send/ipi_clear or in the switch statements (e.g. by mapping register offsets to member pointers) to reduce duplicate code paths.

These are non‐breaking changes that consolidate duplicated logic and lower cognitive complexity.

case 8:
ret = loongarch_extioi_writel(vcpu, extioi, addr, len, val);
break;
default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default时ret未赋值。

}

loongarch_ext_irq_unlock(extioi, flags);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多余的空行?

case 8:
ret = loongarch_extioi_readl(vcpu, extioi, addr, len, val);
break;
default:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default时ret未赋值。

__setup("loongarch_iommu=", la_iommu_setup);

static const struct pci_device_id loongson_iommu_pci_tbl[] = {
{ PCI_DEVICE(0x14, 0x3c0f) },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处是否使用PCI_VENDOR_ID_LOONGSON?

return ret;
}

if (h->length == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处if (h->type == la_iommu_target_ivhd_type) 是否需要放在if (h->type == la_iommu_target_ivhd_type)前面?

if (h->length == 0)
break;

if (*p == la_iommu_target_ivhd_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外一个地方用的 if (h->type == la_iommu_target_ivhd_type) 能不能统一?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants