跳转到文档内容
← 返回索引页

v2.5.0

chinaran
elrondwong
flpanbin
for800000
jiangsanyin
jingzhe6414
KubeKyrie
learner0810
lixd
oceanweave

🚀 主要功能

  • Support dynamic mig feature, please refer to this document
  • Reinstall Hami will NOT crash GPU tasks
  • Put all configurations into a configMap, you can customize hami installation by modify its content: see details

🐛 主要 bug 修复

  • Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
  • Fix hami-core stuck on high glib images, like 'tf-serving:latest'

📝 变更内容

⬆️ Dependencies

  • Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0,作者 (@dependabot) ,PR #631
  • Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker,作者 (@dependabot) ,PR #676
  • Bump actions/upload-artifact from 4.4.3 to 4.5.0,作者 (@dependabot) ,PR #717
  • Bump docker/build-push-action from 6.9.0 to 6.10.0,作者 (@dependabot) ,PR #644
  • Bump docker/build-push-action from 6.10.0 to 6.11.0,作者 (@dependabot) ,PR #792

🔨 其他变更

  • Fix Kubernetes version string handling by stripping metadata,作者 (@Nimbus318) ,PR #623
  • Update vGPUmonitor to add dynamic adjustment on core and memory limit,作者 (@archlitchi) ,PR #624
  • feat: support device plugin daemonset update strategy,作者 (@devenami) ,PR #628
  • add ut about schedule policy,作者 (@yt-huang) ,PR #638
  • Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch.,作者 (@haitwang-cloud) ,PR #626
  • add ut for the scheduler,作者 (@shijinye) ,PR #645
  • docs(issue-tmpl): add FAQ link to issue templates,作者 (@Nimbus318) ,PR #647
  • fix: filter device registry to node,作者 (@lengrongfu) ,PR #639
  • Add self-hosted runner,作者 (@archlitchi) ,PR #659
  • fix-example-yaml,作者 (@WQL782795) ,PR #667
  • update docs,作者 (@yangshiqi) ,PR #668
  • add ut for ascend,作者 (@shijinye) ,PR #664
  • optimization map init in test,作者 (@lengrongfu) ,PR #678
  • Optimize monitor,作者 (@for800000) ,PR #683
  • fix code lint failed,作者 (@lengrongfu) ,PR #685
  • fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName,作者 (@Nimbus318) ,PR #687
  • fix vGPUmonitor deviceidx is always 0,作者 (@lengrongfu) ,PR #684
  • add ut for pkg/scheduler/event.go,作者 (@Penguin-zlh) ,PR #688
  • add ut for nodes,作者 (@shijinye) ,PR #695
  • add license for pkg/scheduler/event_test.go,作者 (@Penguin-zlh) ,PR #706
  • fix: exception happen when creating multiple ascend-gpu pods concurrently,作者 (@lijm87) ,PR #575
  • add ut for device/nvidia,作者 (@shijinye) ,PR #657
  • add ut for pkg/monitor/nvidia/v0/spec.go,作者 (@yt-huang) ,PR #670
  • Enable Dynamic-mig feature for HAMi,作者 (@archlitchi) ,PR #708
  • Fix chart can not be deployed properly,作者 (@archlitchi) ,PR #711
  • Fix NodeLock issue,作者 (@archlitchi) ,PR #714
  • fix example yaml,作者 (@lixd) ,PR #709
  • add ut for device/cambricon,作者 (@shijinye) ,PR #712
  • Update dynamic mig documents and examples,作者 (@archlitchi) ,PR #718
  • random time may be zero,作者 (@shijinye) ,PR #697
  • fix grafana dashboard and clarify dashboard usage more clearly.,作者 (@jiangsanyin) ,PR #543
  • doc(README): add examples for GPU sharing and update-examples,作者 (@xiaoyao) ,PR #665
  • add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go,作者 (@yt-huang) ,PR #673
  • Add design document to 'dynamic-mig' feature,作者 (@archlitchi) ,PR #725
  • fix(doc): fix a typo and resolve markdown warnings in the tasklist,作者 (@elrondwong) ,PR #724
  • add ut for pkg/util/nodelock/nodelock.go,作者 (@learner0810) ,PR #719
  • test: add ut for pkg/version/version.go,作者 (@Penguin-zlh) ,PR #677
  • Update on mig mode,作者 (@archlitchi) ,PR #726
  • Update documents for config & config_cn,作者 (@archlitchi) ,PR #729
  • set PASS_DEVICE_SPECS ENV to device-plugin,作者 (@jingzhe6414) ,PR #690
  • fix device-plugin-version,作者 (@learner0810) ,PR #743
  • feat: Return the nodes that failed to be scheduled back to the scheduler,作者 (@chaunceyjiang) ,PR #746
  • fix(log): fix missing log output in nvidiadeviceplugin server,作者 (@elrondwong) ,PR #735
  • support configuration resources limits and requests,作者 (@flpanbin) ,PR #739
  • feat(test): add TestMarshalNodeDevices scenarios,作者 (@elrondwong) ,PR #747
  • print flags for device-plugin and scheduler,作者 (@flpanbin) ,PR #756
  • Fix typos, add more contributors and maintainers.,作者 (@yangshiqi) ,PR #765
  • Add a mind map(Chinese and English) to help understand this project,作者 (@oceanweave) ,PR #764
  • [Docs] update config pages,作者 (@windsonsea) ,PR #760
  • add ut for device-map,作者 (@KubeKyrie) ,PR #762
  • refactor(ci): use go.mod file for Go version in workflows,作者 (@yxxhero) ,PR #766
  • support set log level for device plugin,作者 (@flpanbin) ,PR #771
  • feat: Restart/Upgrade device-plugin will not affect services.,作者 (@chaunceyjiang) ,PR #767
  • add ut nvml devices,作者 (@KubeKyrie) ,PR #773
  • add ut for device-map,作者 (@KubeKyrie) ,PR #772
  • Optimize the time format layout,作者 (@learner0810) ,PR #741
  • fix: nvidia-device-plugin no version info,作者 (@chaunceyjiang) ,PR #779
  • HAMi supports e2e,作者 (@Rei1010) ,PR #775
  • Proposal: enable E2E test,作者 (@Rei1010) ,PR #633
  • add ut for device/iluvatar,作者 (@shijinye) ,PR #795
  • add ut for device/hygon,作者 (@shijinye) ,PR #787
  • add ut for pkg/monitor/nvidia/v1,作者 (@shijinye) ,PR #780
  • refactor(logging): enhance log messages for device resource counting,作者 (@haitwang-cloud) ,PR #778
  • Enrich pod health check,作者 (@Rei1010) ,PR #801
  • docs: fix broken link,作者 (@lixd) ,PR #802
  • Optimize the E2E execution logic,作者 (@Rei1010) ,PR #803
  • optimize MetricsBindAddress to MetricsBindPort,作者 (@phoenixwu0229) ,PR #796
  • fix: handle the node nil issue & E2E test failure,作者 (@haitwang-cloud) ,PR #804
  • add ut for device/mthreads,作者 (@shijinye) ,PR #808
  • fix: Resolve formatting issue in ConfigMap causing display anomalies,作者 (@lixd) ,PR #814
  • [docs] Update ascend910b-support.md,作者 (@windsonsea) ,PR #816
  • Refine metrics logs,作者 (@haitwang-cloud) ,PR #817
  • Update mig-related logics and refine logs,作者 (@archlitchi) ,PR #833
  • Add 910B4 config to device-configmap for ascend,作者 (@lijm87) ,PR #828
  • [docs] fix: glibc version requirement in README,作者 (@chinaran) ,PR #826
  • Update HAMi-core for v2.5.0,作者 (@archlitchi) ,PR #834
  • FIx multi-process device memory count issue,作者 (@archlitchi) ,PR #835
  • bump version to v2.5.0,作者 (@wawa0210) ,PR #836
  • Fix CI,作者 (@archlitchi) ,PR #838
  • Fix CI release,作者 (@archlitchi) ,PR #840
  • Fix release ci,作者 (@archlitchi) ,PR #841
  • Fix Dockerfile to make CI pass,作者 (@archlitchi) ,PR #846
  • Fix E2E failure with pod status check,作者 (@Rei1010) ,PR #847
  • Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU,作者 (@archlitchi) ,PR #848

贡献者:🆕 新贡献者

完整更新日志: https://github.com/Project-HAMi/HAMi/compare/v2.4.1...v2.5.0

CNCFHAMi 是 CNCF Sandbox 项目