v2.6.0

June 7, 2025

🚀 Major features

Optimize scheduler log
Support enflame gcu-share
Support metax GPU and metax sGPU
Helm chart add checksum annotation for restarting hami component after ConfigMap modification
Support for using RuntimeClass with nvidia devices
Add support for profiling via net/http/pprof package
Add nvidia gpu topoloy score registry to node
Feat: vGPUmonitor support MigInfo metrics

🐛 Major bug fixes

Fix stuck in driver 570+
Fix device memory not counted properly in comfyUI task
Fix cambricon devices not allocated properly
Fix wrong log and container request device count error
Fix vgpu-devices-allocated annotations are inconsistent
Fix removing node devices from node manager
Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
Fix device memory count error on cuMallocAsync
Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
Fix multi-process device memory count

📝 What's Changed

⬆️ Dependencies

Bump docker/build-push-action from 6.11.0 to 6.13.0 by (@dependabot) in #837
Bump golang.org/x/net from 0.26.0 to 0.35.0 by (@dependabot) in #859
Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by (@dependabot) in #941
Bump docker/login-action from 3.3.0 to 3.4.0 by (@dependabot) in #942
Bump docker/build-push-action from 6.13.0 to 6.15.0 by (@dependabot) in #899
build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0 by (@dependabot) in #1024
build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0 by (@dependabot) in #1052
build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0 by (@dependabot) in #1091

🔨 Other Changes

fix: Enhance GPU metrics collection and error handling in vGPU monitor by (@haitwang-cloud) in #827
refactor: update service configurations for device plugin and scheduler by (@haitwang-cloud) in #799
add ut for scheduler/score by (@shijinye) in #853
add ut for device/metax by (@shijinye) in #850
Remove duplicate log fields by (@learner0810) in #860
[docs] Fix default nvidia.resourceCoreName value in config.md by (@chinaran) in #842
Update libvgpu.so by (@archlitchi) in #876
update example.png by (@rockpanda) in #874
support ascend 910B2 by (@ouyangluwei163) in #885
fix docs typos by (@JinVei) in #869
Accelerate node score calculations using multiple goroutines by (@learner0810) in #824
Support Metax SGPU to sharing GPU by (@Kyrie336) in #895
docs: fix broken commmunity links by (@agilgur5) in #907
add config gpu core isolation policy for webhook by (@lengrongfu) in #901
feat: support scheduler replicas > 1 by (@Azusa-Yuan) in #898
docs: add syntax highlighting to various code blocks by (@agilgur5) in #906
Fix UT not be properly executed during CI phase by (@archlitchi) in #911
typo: fix typos in log and comment by (@popsiclexu) in #917
feat: Add kube-qps and kube-burst parameters. by (@chaunceyjiang) in #769
docs: Update MAINTAINERS file with current contributor information by (@Nimbus318) in #918
Nominate chaunceyjiang to reviewer by (@chaunceyjiang) in #926
build: update dependencies and remove unused cdiapi by (@yxxhero) in #903
add lengrongfu to reviewers by (@lengrongfu) in #937
chore: add namespace override for multi-namespace deployments by (@chinaran) in #924
fix: hygon dcu concurrent creation conflict by (@joy717) in #921
Fix the wrong describe of device registry in protocol.md by (@hurricane1988) in #910
chore: helm chart support scheduler webhook cert-manager by (@chinaran) in #951
refactor(scheduler): replace init methods with constructor functions by (@yxxhero) in #905
add Dependencies policy and Security policy by (@yangshiqi) in #934
scheduler: fix blocked the nodeNotify channel when node changes by (@Iceber) in #964
docs: Update Ascend910 support documentation by (@zhaikangqi331) in #988
update iluvatar's docs by (@yangshiqi) in #995
refactor: replace interface{} with any in various files by (@yxxhero) in #1000
scheduler: fix duplicate handling of the node label selector by (@Iceber) in #965
refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml by (@yxxhero) in #1002
update hami arch by (@wawa0210) in #1007
Update README.md by (@yowenter) in #1005
refactor: simplify code by using modern constructs by (@Shouren) in #978
scheduler: fix removing node devices from node manager by (@Iceber) in #966
feat: Add support for profiling via net/http/pprof package by (@Shouren) in #963
Support Enflame gcushare for enflame devices by (@archlitchi) in #1013
docs: Remove ACTIVE_OOM_KILLER environment variable description by (@chinaran) in #1015
refactor(vGPUmonitor): change Run to RunE and return errors by (@yxxhero) in #999
refactored the filter logs and event messages to enhance their clarity, by (@Wangmin362) in #1023
feat: Support for using RuntimeClass with nvidia devices by (@chinaran) in #1021
fix wrong log and container request device count error by (@Wangmin362) in #1020
feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification by (@chinaran) in #1022
fix vgpu-devices-allocated annotations are inconsistent #991 by (@ouyangluwei163) in #1012
add Enflame GCU S60 into roadmap. by (@winston-zhang-orz) in #1030
add nvidia-smi command show cuda version info by (@lengrongfu) in #953
Separate options from client to make the responsibility more clear. by (@yangshiqi) in #938
Add nvidia gpu topoloy score registry to node by (@lengrongfu) in #1018
fix(cicd): update ci.yaml to upload coverage to Codecov by (@Shouren) in #1056
feat(Actions): Add an action to label pr automatically by (@Shouren) in #1053
fix: Improve Metax GPU usability and fix related issues by (@Kyrie336) in #1063
fix(chart): support GKE pre-release versions via kubeVersion '-0' by (@Nimbus318) in #1072
fix: Dynamic GPU partitioning lacks single-GPU-level granularity. (#1… by (@Goend) in #1061
update maintainer information by (@wawa0210) in #1079
add LIBCUDA_LOG_LEVEL env to device-plugin by (@lengrongfu) in #1087
fix: missing apiVersion in serviceMonitor dashboard docs by (@ntheanh201) in #1077
test(pkg/util): Add some unit tests for pkg/util by (@Shouren) in #1067
feat: vGPUmonitor support MigInfo metrics by (@ouyangluwei163) in #1048
update hami-core version by (@lengrongfu) in #1082

Committers: 🆕 New Contributors

rockpanda (@rockpanda)
ouyangluwei163 (@ouyangluwei163)
JinVei (@JinVei)
Shouren (@Shouren)
Kyrie336 (@Kyrie336)
agilgur5 (@agilgur5)
Azusa-Yuan (@Azusa-Yuan)
popsiclexu (@popsiclexu)
hurricane1988 (@hurricane1988)
Iceber (@Iceber)
zhaikangqi331 (@zhaikangqi331)
yowenter (@yowenter)
Wangmin362 (@Wangmin362)
winston-zhang-orz (@winston-zhang-orz)
Goend (@Goend)
ntheanh201 (@ntheanh201)

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.3...v2.6.0

🚀 Major features​

🐛 Major bug fixes​

📝 What's Changed​

⬆️ Dependencies​

🔨 Other Changes​

Committers: 🆕 New Contributors​

🚀 Major features

🐛 Major bug fixes

📝 What's Changed

⬆️ Dependencies

🔨 Other Changes

Committers: 🆕 New Contributors