Skip to main content
← Back to index page

v2.6.0

agilgur5
Azusa-Yuan
Goend
hurricane1988
Iceber
JinVei
Kyrie336
ntheanh201
ouyangluwei163
popsiclexu

🚀 Major features​

  • Optimize scheduler log
  • Support enflame gcu-share
  • Support metax GPU and metax sGPU
  • Helm chart add checksum annotation for restarting hami component after ConfigMap modification
  • Support for using RuntimeClass with nvidia devices
  • Add support for profiling via net/http/pprof package
  • Add nvidia gpu topoloy score registry to node
  • Feat: vGPUmonitor support MigInfo metrics

🐛 Major bug fixes​

  • Fix stuck in driver 570+
  • Fix device memory not counted properly in comfyUI task
  • Fix cambricon devices not allocated properly
  • Fix wrong log and container request device count error
  • Fix vgpu-devices-allocated annotations are inconsistent
  • Fix removing node devices from node manager
  • Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
  • Fix device memory count error on cuMallocAsync
  • Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
  • Fix multi-process device memory count

📝 What's Changed​

âŦ†ī¸ Dependencies​

  • Bump docker/build-push-action from 6.11.0 to 6.13.0 by (@dependabot) in #837
  • Bump golang.org/x/net from 0.26.0 to 0.35.0 by (@dependabot) in #859
  • Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by (@dependabot) in #941
  • Bump docker/login-action from 3.3.0 to 3.4.0 by (@dependabot) in #942
  • Bump docker/build-push-action from 6.13.0 to 6.15.0 by (@dependabot) in #899
  • build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0 by (@dependabot) in #1024
  • build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0 by (@dependabot) in #1052
  • build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0 by (@dependabot) in #1091

🔨 Other Changes​

  • fix: Enhance GPU metrics collection and error handling in vGPU monitor by (@haitwang-cloud) in #827
  • refactor: update service configurations for device plugin and scheduler by (@haitwang-cloud) in #799
  • add ut for scheduler/score by (@shijinye) in #853
  • add ut for device/metax by (@shijinye) in #850
  • Remove duplicate log fields by (@learner0810) in #860
  • [docs] Fix default nvidia.resourceCoreName value in config.md by (@chinaran) in #842
  • Update libvgpu.so by (@archlitchi) in #876
  • update example.png by (@rockpanda) in #874
  • support ascend 910B2 by (@ouyangluwei163) in #885
  • fix docs typos by (@JinVei) in #869
  • Accelerate node score calculations using multiple goroutines by (@learner0810) in #824
  • Support Metax SGPU to sharing GPU by (@Kyrie336) in #895
  • docs: fix broken commmunity links by (@agilgur5) in #907
  • add config gpu core isolation policy for webhook by (@lengrongfu) in #901
  • feat: support scheduler replicas > 1 by (@Azusa-Yuan) in #898
  • docs: add syntax highlighting to various code blocks by (@agilgur5) in #906
  • Fix UT not be properly executed during CI phase by (@archlitchi) in #911
  • typo: fix typos in log and comment by (@popsiclexu) in #917
  • feat: Add kube-qps and kube-burst parameters. by (@chaunceyjiang) in #769
  • docs: Update MAINTAINERS file with current contributor information by (@Nimbus318) in #918
  • Nominate chaunceyjiang to reviewer by (@chaunceyjiang) in #926
  • build: update dependencies and remove unused cdiapi by (@yxxhero) in #903
  • add lengrongfu to reviewers by (@lengrongfu) in #937
  • chore: add namespace override for multi-namespace deployments by (@chinaran) in #924
  • fix: hygon dcu concurrent creation conflict by (@joy717) in #921
  • Fix the wrong describe of device registry in protocol.md by (@hurricane1988) in #910
  • chore: helm chart support scheduler webhook cert-manager by (@chinaran) in #951
  • refactor(scheduler): replace init methods with constructor functions by (@yxxhero) in #905
  • add Dependencies policy and Security policy by (@yangshiqi) in #934
  • scheduler: fix blocked the nodeNotify channel when node changes by (@Iceber) in #964
  • docs: Update Ascend910 support documentation by (@zhaikangqi331) in #988
  • update iluvatar's docs by (@yangshiqi) in #995
  • refactor: replace interface{} with any in various files by (@yxxhero) in #1000
  • scheduler: fix duplicate handling of the node label selector by (@Iceber) in #965
  • refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml by (@yxxhero) in #1002
  • update hami arch by (@wawa0210) in #1007
  • Update README.md by (@yowenter) in #1005
  • refactor: simplify code by using modern constructs by (@Shouren) in #978
  • scheduler: fix removing node devices from node manager by (@Iceber) in #966
  • feat: Add support for profiling via net/http/pprof package by (@Shouren) in #963
  • Support Enflame gcushare for enflame devices by (@archlitchi) in #1013
  • docs: Remove ACTIVE_OOM_KILLER environment variable description by (@chinaran) in #1015
  • refactor(vGPUmonitor): change Run to RunE and return errors by (@yxxhero) in #999
  • refactored the filter logs and event messages to enhance their clarity, by (@Wangmin362) in #1023
  • feat: Support for using RuntimeClass with nvidia devices by (@chinaran) in #1021
  • fix wrong log and container request device count error by (@Wangmin362) in #1020
  • feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification by (@chinaran) in #1022
  • fix vgpu-devices-allocated annotations are inconsistent #991 by (@ouyangluwei163) in #1012
  • add Enflame GCU S60 into roadmap. by (@winston-zhang-orz) in #1030
  • add nvidia-smi command show cuda version info by (@lengrongfu) in #953
  • Separate options from client to make the responsibility more clear. by (@yangshiqi) in #938
  • Add nvidia gpu topoloy score registry to node by (@lengrongfu) in #1018
  • fix(cicd): update ci.yaml to upload coverage to Codecov by (@Shouren) in #1056
  • feat(Actions): Add an action to label pr automatically by (@Shouren) in #1053
  • fix: Improve Metax GPU usability and fix related issues by (@Kyrie336) in #1063
  • fix(chart): support GKE pre-release versions via kubeVersion '-0' by (@Nimbus318) in #1072
  • fix: Dynamic GPU partitioning lacks single-GPU-level granularity. (#1â€Ļ by (@Goend) in #1061
  • update maintainer information by (@wawa0210) in #1079
  • add LIBCUDA_LOG_LEVEL env to device-plugin by (@lengrongfu) in #1087
  • fix: missing apiVersion in serviceMonitor dashboard docs by (@ntheanh201) in #1077
  • test(pkg/util): Add some unit tests for pkg/util by (@Shouren) in #1067
  • feat: vGPUmonitor support MigInfo metrics by (@ouyangluwei163) in #1048
  • update hami-core version by (@lengrongfu) in #1082

Committers: 🆕 New Contributors​

Full Changelog: https://github.com/Project-HAMi/HAMi/compare/v2.5.3...v2.6.0

CNCFHAMi is a CNCF Sandbox project