跳转到文档内容
← 返回索引页

v2.6.0

agilgur5
Azusa-Yuan
Goend
hurricane1988
Iceber
JinVei
Kyrie336
ntheanh201
ouyangluwei163
popsiclexu

🚀 主要功能

  • Optimize scheduler log
  • Support enflame gcu-share
  • Support metax GPU and metax sGPU
  • Helm chart add checksum annotation for restarting hami component after ConfigMap modification
  • Support for using RuntimeClass with nvidia devices
  • Add support for profiling via net/http/pprof package
  • Add nvidia gpu topoloy score registry to node
  • Feat: vGPUmonitor support MigInfo metrics

🐛 主要 bug 修复

  • Fix stuck in driver 570+
  • Fix device memory not counted properly in comfyUI task
  • Fix cambricon devices not allocated properly
  • Fix wrong log and container request device count error
  • Fix vgpu-devices-allocated annotations are inconsistent
  • Fix removing node devices from node manager
  • Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
  • Fix device memory count error on cuMallocAsync
  • Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
  • Fix multi-process device memory count

📝 变更内容

⬆️ Dependencies

  • Bump docker/build-push-action from 6.11.0 to 6.13.0,作者 (@dependabot) ,PR #837
  • Bump golang.org/x/net from 0.26.0 to 0.35.0,作者 (@dependabot) ,PR #859
  • Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0,作者 (@dependabot) ,PR #941
  • Bump docker/login-action from 3.3.0 to 3.4.0,作者 (@dependabot) ,PR #942
  • Bump docker/build-push-action from 6.13.0 to 6.15.0,作者 (@dependabot) ,PR #899
  • build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0,作者 (@dependabot) ,PR #1024
  • build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0,作者 (@dependabot) ,PR #1052
  • build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0,作者 (@dependabot) ,PR #1091

🔨 其他变更

  • fix: Enhance GPU metrics collection and error handling in vGPU monitor,作者 (@haitwang-cloud) ,PR #827
  • refactor: update service configurations for device plugin and scheduler,作者 (@haitwang-cloud) ,PR #799
  • add ut for scheduler/score,作者 (@shijinye) ,PR #853
  • add ut for device/metax,作者 (@shijinye) ,PR #850
  • Remove duplicate log fields,作者 (@learner0810) ,PR #860
  • [docs] Fix default nvidia.resourceCoreName value in config.md,作者 (@chinaran) ,PR #842
  • Update libvgpu.so,作者 (@archlitchi) ,PR #876
  • update example.png,作者 (@rockpanda) ,PR #874
  • support ascend 910B2,作者 (@ouyangluwei163) ,PR #885
  • fix docs typos,作者 (@JinVei) ,PR #869
  • Accelerate node score calculations using multiple goroutines,作者 (@learner0810) ,PR #824
  • Support Metax SGPU to sharing GPU,作者 (@Kyrie336) ,PR #895
  • docs: fix broken commmunity links,作者 (@agilgur5) ,PR #907
  • add config gpu core isolation policy for webhook,作者 (@lengrongfu) ,PR #901
  • feat: support scheduler replicas > 1,作者 (@Azusa-Yuan) ,PR #898
  • docs: add syntax highlighting to various code blocks,作者 (@agilgur5) ,PR #906
  • Fix UT not be properly executed during CI phase,作者 (@archlitchi) ,PR #911
  • typo: fix typos in log and comment,作者 (@popsiclexu) ,PR #917
  • feat: Add kube-qps and kube-burst parameters.,作者 (@chaunceyjiang) ,PR #769
  • docs: Update MAINTAINERS file with current contributor information,作者 (@Nimbus318) ,PR #918
  • Nominate chaunceyjiang to reviewer,作者 (@chaunceyjiang) ,PR #926
  • build: update dependencies and remove unused cdiapi,作者 (@yxxhero) ,PR #903
  • add lengrongfu to reviewers,作者 (@lengrongfu) ,PR #937
  • chore: add namespace override for multi-namespace deployments,作者 (@chinaran) ,PR #924
  • fix: hygon dcu concurrent creation conflict,作者 (@joy717) ,PR #921
  • Fix the wrong describe of device registry in protocol.md,作者 (@hurricane1988) ,PR #910
  • chore: helm chart support scheduler webhook cert-manager,作者 (@chinaran) ,PR #951
  • refactor(scheduler): replace init methods with constructor functions,作者 (@yxxhero) ,PR #905
  • add Dependencies policy and Security policy,作者 (@yangshiqi) ,PR #934
  • scheduler: fix blocked the nodeNotify channel when node changes,作者 (@Iceber) ,PR #964
  • docs: Update Ascend910 support documentation,作者 (@zhaikangqi331) ,PR #988
  • update iluvatar's docs,作者 (@yangshiqi) ,PR #995
  • refactor: replace interface{} with any in various files,作者 (@yxxhero) ,PR #1000
  • scheduler: fix duplicate handling of the node label selector,作者 (@Iceber) ,PR #965
  • refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml,作者 (@yxxhero) ,PR #1002
  • update hami arch,作者 (@wawa0210) ,PR #1007
  • Update README.md,作者 (@yowenter) ,PR #1005
  • refactor: simplify code by using modern constructs,作者 (@Shouren) ,PR #978
  • scheduler: fix removing node devices from node manager,作者 (@Iceber) ,PR #966
  • feat: Add support for profiling via net/http/pprof package,作者 (@Shouren) ,PR #963
  • Support Enflame gcushare for enflame devices,作者 (@archlitchi) ,PR #1013
  • docs: Remove ACTIVE_OOM_KILLER environment variable description,作者 (@chinaran) ,PR #1015
  • refactor(vGPUmonitor): change Run to RunE and return errors,作者 (@yxxhero) ,PR #999
  • refactored the filter logs and event messages to enhance their clarity,,作者 (@Wangmin362) ,PR #1023
  • feat: Support for using RuntimeClass with nvidia devices,作者 (@chinaran) ,PR #1021
  • fix wrong log and container request device count error,作者 (@Wangmin362) ,PR #1020
  • feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification,作者 (@chinaran) ,PR #1022
  • fix vgpu-devices-allocated annotations are inconsistent #991,作者 (@ouyangluwei163) ,PR #1012
  • add Enflame GCU S60 into roadmap.,作者 (@winston-zhang-orz) ,PR #1030
  • add nvidia-smi command show cuda version info,作者 (@lengrongfu) ,PR #953
  • Separate options from client to make the responsibility more clear.,作者 (@yangshiqi) ,PR #938
  • Add nvidia gpu topoloy score registry to node,作者 (@lengrongfu) ,PR #1018
  • fix(cicd): update ci.yaml to upload coverage to Codecov,作者 (@Shouren) ,PR #1056
  • feat(Actions): Add an action to label pr automatically,作者 (@Shouren) ,PR #1053
  • fix: Improve Metax GPU usability and fix related issues,作者 (@Kyrie336) ,PR #1063
  • fix(chart): support GKE pre-release versions via kubeVersion '-0',作者 (@Nimbus318) ,PR #1072
  • fix: Dynamic GPU partitioning lacks single-GPU-level granularity. (#1…,作者 (@Goend) ,PR #1061
  • update maintainer information,作者 (@wawa0210) ,PR #1079
  • add LIBCUDA_LOG_LEVEL env to device-plugin,作者 (@lengrongfu) ,PR #1087
  • fix: missing apiVersion in serviceMonitor dashboard docs,作者 (@ntheanh201) ,PR #1077
  • test(pkg/util): Add some unit tests for pkg/util,作者 (@Shouren) ,PR #1067
  • feat: vGPUmonitor support MigInfo metrics,作者 (@ouyangluwei163) ,PR #1048
  • update hami-core version,作者 (@lengrongfu) ,PR #1082

贡献者:🆕 新贡献者

完整更新日志: https://github.com/Project-HAMi/HAMi/compare/v2.5.3...v2.6.0

CNCFHAMi 是 CNCF Sandbox 项目