v2.5.0
🚀 主要功能
- Support dynamic mig feature, please refer to this document
- Reinstall Hami will NOT crash GPU tasks
- Put all configurations into a configMap, you can customize hami installation by modify its content: see details
🐛 主要 bug 修复
- Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
- Fix hami-core stuck on high glib images, like 'tf-serving:latest'
📝 变更内容
⬆️ Dependencies
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0,作者 (@dependabot) ,PR #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker,作者 (@dependabot) ,PR #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0,作者 (@dependabot) ,PR #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0,作者 (@dependabot) ,PR #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0,作者 (@dependabot) ,PR #792
🔨 其他变更
- Fix Kubernetes version string handling by stripping metadata,作者 (@Nimbus318) ,PR #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit,作者 (@archlitchi) ,PR #624
- feat: support device plugin daemonset update strategy,作者 (@devenami) ,PR #628
- add ut about schedule policy,作者 (@yt-huang) ,PR #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch.,作者 (@haitwang-cloud) ,PR #626
- add ut for the scheduler,作者 (@shijinye) ,PR #645
- docs(issue-tmpl): add FAQ link to issue templates,作者 (@Nimbus318) ,PR #647
- fix: filter device registry to node,作者 (@lengrongfu) ,PR #639
- Add self-hosted runner,作者 (@archlitchi) ,PR #659
- fix-example-yaml,作者 (@WQL782795) ,PR #667
- update docs,作者 (@yangshiqi) ,PR #668
- add ut for ascend,作者 (@shijinye) ,PR #664
- optimization map init in test,作者 (@lengrongfu) ,PR #678
- Optimize monitor,作者 (@for800000) ,PR #683
- fix code lint failed,作者 (@lengrongfu) ,PR #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName,作者 (@Nimbus318) ,PR #687
- fix vGPUmonitor deviceidx is always 0,作者 (@lengrongfu) ,PR #684
- add ut for pkg/scheduler/event.go,作者 (@Penguin-zlh) ,PR #688
- add ut for nodes,作者 (@shijinye) ,PR #695
- add license for pkg/scheduler/event_test.go,作者 (@Penguin-zlh) ,PR #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently,作者 (@lijm87) ,PR #575
- add ut for device/nvidia,作者 (@shijinye) ,PR #657
- add ut for pkg/monitor/nvidia/v0/spec.go,作者 (@yt-huang) ,PR #670
- Enable Dynamic-mig feature for HAMi,作者 (@archlitchi) ,PR #708
- Fix chart can not be deployed properly,作者 (@archlitchi) ,PR #711
- Fix NodeLock issue,作者 (@archlitchi) ,PR #714
- fix example yaml,作者 (@lixd) ,PR #709
- add ut for device/cambricon,作者 (@shijinye) ,PR #712
- Update dynamic mig documents and examples,作者 (@archlitchi) ,PR #718
- random time may be zero,作者 (@shijinye) ,PR #697
- fix grafana dashboard and clarify dashboard usage more clearly.,作者 (@jiangsanyin) ,PR #543
- doc(README): add examples for GPU sharing and update-examples,作者 (@xiaoyao) ,PR #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go,作者 (@yt-huang) ,PR #673
- Add design document to 'dynamic-mig' feature,作者 (@archlitchi) ,PR #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist, 作者 (@elrondwong) ,PR #724
- add ut for pkg/util/nodelock/nodelock.go,作者 (@learner0810) ,PR #719
- test: add ut for pkg/version/version.go,作者 (@Penguin-zlh) ,PR #677
- Update on mig mode,作者 (@archlitchi) ,PR #726
- Update documents for config & config_cn,作者 (@archlitchi) ,PR #729
- set PASS_DEVICE_SPECS ENV to device-plugin,作者 (@jingzhe6414) ,PR #690
- fix device-plugin-version,作者 (@learner0810) ,PR #743
- feat: Return the nodes that failed to be scheduled back to the scheduler,作者 (@chaunceyjiang) ,PR #746
- fix(log): fix missing log output in nvidiadeviceplugin server,作者 (@elrondwong) ,PR #735
- support configuration resources limits and requests,作者 (@flpanbin) ,PR #739
- feat(test): add TestMarshalNodeDevices scenarios,作者 (@elrondwong) ,PR #747
- print flags for device-plugin and scheduler,作者 (@flpanbin) ,PR #756
- Fix typos, add more contributors and maintainers.,作者 (@yangshiqi) ,PR #765
- Add a mind map(Chinese and English) to help understand this project,作者 (@oceanweave) ,PR #764
- [Docs] update config pages,作者 (@windsonsea) ,PR #760
- add ut for device-map,作者 (@KubeKyrie) ,PR #762
- refactor(ci): use go.mod file for Go version in workflows,作者 (@yxxhero) ,PR #766
- support set log level for device plugin,作者 (@flpanbin) ,PR #771
- feat: Restart/Upgrade device-plugin will not affect services.,作者 (@chaunceyjiang) ,PR #767
- add ut nvml devices,作者 (@KubeKyrie) ,PR #773
- add ut for device-map,作者 (@KubeKyrie) ,PR #772
- Optimize the time format layout,作者 (@learner0810) ,PR #741
- fix: nvidia-device-plugin no version info,作者 (@chaunceyjiang) ,PR #779
- HAMi supports e2e,作者 (@Rei1010) ,PR #775
- Proposal: enable E2E test,作者 (@Rei1010) ,PR #633
- add ut for device/iluvatar,作者 (@shijinye) ,PR #795
- add ut for device/hygon,作者 (@shijinye) ,PR #787
- add ut for pkg/monitor/nvidia/v1,作者 (@shijinye) ,PR #780
- refactor(logging): enhance log messages for device resource counting,作者 (@haitwang-cloud) ,PR #778
- Enrich pod health check,作者 (@Rei1010) ,PR #801
- docs: fix broken link,作者 (@lixd) ,PR #802
- Optimize the E2E execution logic,作者 (@Rei1010) ,PR #803
- optimize MetricsBindAddress to MetricsBindPort,作者 (@phoenixwu0229) ,PR #796
- fix: handle the node nil issue & E2E test failure,作者 (@haitwang-cloud) ,PR #804
- add ut for device/mthreads,作者 (@shijinye) ,PR #808
- fix: Resolve formatting issue in ConfigMap causing display anomalies,作者 (@lixd) ,PR #814
- [docs] Update ascend910b-support.md,作者 (@windsonsea) ,PR #816
- Refine metrics logs,作者 (@haitwang-cloud) ,PR #817
- Update mig-related logics and refine logs,作者 (@archlitchi) ,PR #833
- Add 910B4 config to device-configmap for ascend,作者 (@lijm87) ,PR #828
- [docs] fix: glibc version requirement in README,作者 (@chinaran) ,PR #826
- Update HAMi-core for v2.5.0,作者 (@archlitchi) ,PR #834
- FIx multi-process device memory count issue,作者 (@archlitchi) ,PR #835
- bump version to v2.5.0,作者 (@wawa0210) ,PR #836
- Fix CI,作者 (@archlitchi) ,PR #838
- Fix CI release,作者 (@archlitchi) ,PR #840
- Fix release ci,作者 (@archlitchi) ,PR #841
- Fix Dockerfile to make CI pass,作者 (@archlitchi) ,PR #846
- Fix E2E failure with pod status check,作者 (@Rei1010) ,PR #847
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU,作者 (@archlitchi) ,PR #848
贡献者:🆕 新贡献者
- yt-huang (@yt-huang)
- shijinye (@shijinye)
- WQL782795 (@WQL782795)
- yangshiqi (@yangshiqi)
- for800000 (@for800000)
- Penguin-zlh (@Penguin-zlh)
- lixd (@lixd)
- jiangsanyin (@jiangsanyin)
- xiaoyao (@xiaoyao)
- elrondwong (@elrondwong)
- learner0810 (@learner0810)
- jingzhe6414 (@jingzhe6414)
- flpanbin (@flpanbin)
- oceanweave (@oceanweave)
- windsonsea (@windsonsea)
- KubeKyrie (@KubeKyrie)
- yxxhero (@yxxhero)
- Rei1010 (@Rei1010)
- phoenixwu0229 (@phoenixwu0229)
- chinaran (@chinaran)
完整更新日志: https://github.com/Project-HAMi/HAMi/compare/v2.4.1...v2.5.0









