v2.6.0
🚀 主要功能
- Optimize scheduler log
- Support enflame gcu-share
- Support metax GPU and metax sGPU
- Helm chart add checksum annotation for restarting hami component after ConfigMap modification
- Support for using RuntimeClass with nvidia devices
- Add support for profiling via net/http/pprof package
- Add nvidia gpu topoloy score registry to node
- Feat: vGPUmonitor support MigInfo metrics
🐛 主要 bug 修复
- Fix stuck in driver 570+
- Fix device memory not counted properly in comfyUI task
- Fix cambricon devices not allocated properly
- Fix wrong log and container request device count error
- Fix vgpu-devices-allocated annotations are inconsistent
- Fix removing node devices from node manager
- Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
- Fix device memory count error on cuMallocAsync
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
- Fix multi-process device memory count
📝 变更内容
⬆️ Dependencies
- Bump docker/build-push-action from 6.11.0 to 6.13.0,作者 (@dependabot) ,PR #837
- Bump golang.org/x/net from 0.26.0 to 0.35.0,作者 (@dependabot) ,PR #859
- Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0,作者 (@dependabot) ,PR #941
- Bump docker/login-action from 3.3.0 to 3.4.0,作者 (@dependabot) ,PR #942
- Bump docker/build-push-action from 6.13.0 to 6.15.0,作者 (@dependabot) ,PR #899
- build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0,作者 (@dependabot) ,PR #1024
- build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0,作者 (@dependabot) ,PR #1052
- build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0,作者 (@dependabot) ,PR #1091
🔨 其他变更
- fix: Enhance GPU metrics collection and error handling in vGPU monitor,作者 (@haitwang-cloud) ,PR #827
- refactor: update service configurations for device plugin and scheduler,作者 (@haitwang-cloud) ,PR #799
- add ut for scheduler/score,作者 (@shijinye) ,PR #853
- add ut for device/metax,作者 (@shijinye) ,PR #850
- Remove duplicate log fields,作者 (@learner0810) ,PR #860
- [docs] Fix default nvidia.resourceCoreName value in config.md,作者 (@chinaran) ,PR #842
- Update libvgpu.so,作者 (@archlitchi) ,PR #876
- update example.png,作者 (@rockpanda) ,PR #874
- support ascend 910B2,作者 (@ouyangluwei163) ,PR #885
- fix docs typos,作者 (@JinVei) ,PR #869
- Accelerate node score calculations using multiple goroutines,作者 (@learner0810) ,PR #824
- Support Metax SGPU to sharing GPU,作者 (@Kyrie336) ,PR #895
- docs: fix broken commmunity links,作者 (@agilgur5) ,PR #907
- add config gpu core isolation policy for webhook,作者 (@lengrongfu) ,PR #901
- feat: support scheduler replicas > 1,作者 (@Azusa-Yuan) ,PR #898
- docs: add syntax highlighting to various code blocks,作者 (@agilgur5) ,PR #906
- Fix UT not be properly executed during CI phase,作者 (@archlitchi) ,PR #911
- typo: fix typos in log and comment,作者 (@popsiclexu) ,PR #917
- feat: Add kube-qps and kube-burst parameters.,作者 (@chaunceyjiang) ,PR #769
- docs: Update MAINTAINERS file with current contributor information,作者 (@Nimbus318) ,PR #918
- Nominate chaunceyjiang to reviewer,作者 (@chaunceyjiang) ,PR #926
- build: update dependencies and remove unused cdiapi,作者 (@yxxhero) ,PR #903
- add lengrongfu to reviewers,作者 (@lengrongfu) ,PR #937
- chore: add namespace override for multi-namespace deployments,作者 (@chinaran) ,PR #924
- fix: hygon dcu concurrent creation conflict,作者 (@joy717) ,PR #921
- Fix the wrong describe of device registry in protocol.md,作者 (@hurricane1988) ,PR #910
- chore: helm chart support scheduler webhook cert-manager,作者 (@chinaran) ,PR #951
- refactor(scheduler): replace init methods with constructor functions,作者 (@yxxhero) ,PR #905
- add Dependencies policy and Security policy,作者 (@yangshiqi) ,PR #934
- scheduler: fix blocked the nodeNotify channel when node changes,作者 (@Iceber) ,PR #964
- docs: Update Ascend910 support documentation,作者 (@zhaikangqi331) ,PR #988
- update iluvatar's docs,作者 (@yangshiqi) ,PR #995
- refactor: replace interface{} with any in various files,作者 (@yxxhero) ,PR #1000
- scheduler: fix duplicate handling of the node label selector,作者 (@Iceber) ,PR #965
- refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml,作者 (@yxxhero) ,PR #1002
- update hami arch,作者 (@wawa0210) ,PR #1007
- Update README.md,作者 (@yowenter) ,PR #1005
- refactor: simplify code by using modern constructs,作者 (@Shouren) ,PR #978
- scheduler: fix removing node devices from node manager,作者 (@Iceber) ,PR #966
- feat: Add support for profiling via net/http/pprof package,作者 (@Shouren) ,PR #963
- Support Enflame gcushare for enflame devices,作者 (@archlitchi) ,PR #1013
- docs: Remove ACTIVE_OOM_KILLER environment variable description,作者 (@chinaran) ,PR #1015
- refactor(vGPUmonitor): change Run to RunE and return errors,作者 (@yxxhero) ,PR #999
- refactored the filter logs and event messages to enhance their clarity,,作者 (@Wangmin362) ,PR #1023
- feat: Support for using RuntimeClass with nvidia devices,作者 (@chinaran) ,PR #1021
- fix wrong log and container request device count error,作者 (@Wangmin362) ,PR #1020
- feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification,作者 (@chinaran) ,PR #1022
- fix vgpu-devices-allocated annotations are inconsistent #991,作者 (@ouyangluwei163) ,PR #1012
- add Enflame GCU S60 into roadmap.,作者 (@winston-zhang-orz) ,PR #1030
- add nvidia-smi command show cuda version info,作者 (@lengrongfu) ,PR #953
- Separate options from client to make the responsibility more clear.,作者 (@yangshiqi) ,PR #938
- Add nvidia gpu topoloy score registry to node,作者 (@lengrongfu) ,PR #1018
- fix(cicd): update ci.yaml to upload coverage to Codecov,作者 (@Shouren) ,PR #1056
- feat(Actions): Add an action to label pr automatically,作者 (@Shouren) ,PR #1053
- fix: Improve Metax GPU usability and fix related issues,作者 (@Kyrie336) ,PR #1063
- fix(chart): support GKE pre-release versions via kubeVersion '-0'









