Machine Learning/AMD GPU - Wikitech
Jump to content
From Wikitech
Machine Learning
What GPU model do we have? On what hosts?
For the most up-to-date info on which hosts have GPUs, search for a GPU-specific metric such as
amd_rocm_gpu_usage_percent
in
Grafana Explorer
(WMF NDA only). Make sure to set the data source to "Thanos" to cover all clusters.
If you have access to root on cumin hosts, you can also run
sudo cumin C:prometheus::node_amd_rocm
to get a list of nodes with GPUs.
As of August 2024, the following production hosts have GPUs:
Hosts with GPU
Node
Cluster
GPU
VRAM
Intended use
dse-k8s-worker1001
DSE
2x AMD Radeon Pro WX 9100
16GB
Kubernetes workloads
dse-k8s-worker1009
DSE
2x AMD Instinct MI210
64GB
Kubernetes workloads
ml-lab1001
n/a
2x AMD Instinct MI210
64GB
Will be re-made as a build host
ml-lab1002
n/a
2x AMD Instinct MI210
64GB
* Experimentation with e.g. Jupyter notebooks
ml-serve1001
ml-serve
2x AMD Radeon Pro WX 9100
16GB
Kubernetes workloads
ml-serve1009
ml-serve
2x AMD Instinct MI210
64GB
Kubernetes workloads
ml-serve1010
ml-serve
2x AMD Instinct MI210
64GB
Kubernetes workloads
ml-serve1011
ml-serve
2x AMD Instinct MI210
64GB
Kubernetes workloads
ml-serve1012
ml-serve
8x AMD Instinct MI300
192GB
* Kubernetes workloads
ml-serve1013
ml-serve
8x AMD Instinct MI300
192GB
* Kubernetes workloads
ml-serve2009
ml-serve
2x AMD Instinct MI210
64GB
Kubernetes workloads
ml-serve2010
ml-serve
2x AMD Instinct MI210
64GB
Kubernetes workloads
ml-serve2011
ml-serve
2x AMD Instinct MI210
64GB
Kubernetes workloads
ml-staging2001
ml-staging
AMD Instinct MI100
32GB
Kubernetes workloads (staging)
ml-staging2003
ml-staging
2x AMD Instinct MI210
64GB
Kubernetes workloads (staging)
stat1008
n/a
AMD Radeon Pro WX 9100
16GB
Experimentation with e.g. Jupyter notebooks
stat1010
n/a
AMD Radeon Pro WX 9100
16GB
Experimentation with e.g. Jupyter notebooks
NOTE:
Entries marked with * are machines that re planned to be installed soon, but not available yet.
The WMF chose AMD because they are currently the only ones releasing their software stack open source:
Do we have Nvidia GPUs?
The short answer is no, we are not planning to use (now or in the future) Nvidia cards. Since it is a vendor-specific stance, it is worth to explain why we are doing it: the Nvidia drivers and tools are not open source and they represent a risk for the Foundation's policies. These are the main reasons:
Security
: they rely on binary-only blobs (running in the Linux Kernel) to work properly. High security vulnerabilities, that may require a kernel patch and rebuild, may not be rolled out on nodes running Nvidia software since they may not be compatible (so one has to wait for Nvidia upstream's updates before proceeding any further).
Ethical
: the Wikimedia foundation has a very firm policy on open-source software, and using proprietary-only hardware and software (when there is an alternative) is not contemplated. We risk not being compatible with certain libraries/tools/software that work with Nvidia CUDA (and related), but we accept the risk. We also go further, trying to promote open-source stacks to solve emergent Data and ML challenges (including working with upstream projects to support for platforms like
AMD ROCm
).
Cost and availability
: Nvidia cards tend to be more expensive than the available alternatives on the market, and due to their demand, there may be times when their supply is reduced (also to favor big players that demand way more hardware than us).
Debugging and updates
: with open source projects, it is easier and more effective to track down bugs/incompatibilities/issues/etc. and report them to upstream (as we have already done multiple times with AMD). With proprietary software, it is not that easy, and it is a challenge to get updates when required (since we need to wait for upstream's releases and hope that they fix a specific issue).
At the time of writing (November 2023), Nvidia seems to be oriented towards releasing part of their stack using open source licenses, but it seems more a rumor than a solid way to go. In a future where Nvidia and AMD both provide open-source solutions, we'll surely revisit the choice.
Should we run Nvidia cards on cloud providers to bypass the above concerns?
Running in the cloud may be a solution for ad-hoc projects, but some considerations need to be made:
The SRE team and our infrastructure stack don't support any cloud provider at the moment. Wikimedia Enterprise is pioneering on AWS, but they run a completely separate stack from production and with only public data. All automation and security boundaries that we built for production would need to be re-created somewhere else, or at least a bare minimum to consider a service running in the cloud maintainable and secure.
The cost of running ML services with GPUs in the cloud is not cheap nowadays, so it would be a big investment in terms of engineering resources and money. It is not an impossible project but we have to be realistic and weight pros and cons before taking any action; finally we also need evaluate if the pros/cons justify the cost that it would take to implement the project.
Should we run Nvidia cards on a subset of hosts in Production with specific security rules and boundaries?
This is an option, but the SRE team wouldn't maintain the solution. This means that the team owning the Nvidia hardware would need to provide the same support that SRE provides in Production, most notably security wise. For example, we mentioned above that high/critical security issues may require patching the Linux kernel, getting everything rebuilt and rolled out promptly to neutralize any attack surface. Without SRE support the team owning this special cluster/hardware should take care of the extra workload too (and that would surely be against efficiency and transparency for the whole org/Foundation).
Use the Debian packages
See
profile::statistics::gpu
or the
amd_rocm
module in operations/puppet.
Use the GPU on the host
All users in
analytics-privatedata-users
are automatically granted access to the GPUs, otherwise a user need to be in the
gpu-testers
POSIX group in operations/puppet. This is a workaround to force the users in that group to be in the
render
POSIX group (available on Debian), that grants access to the GPU. Please keep in mind few things:
Be careful in launching multiple parallel jobs on the same GPU, see
Follow up with the ML team for guidance and best practices!
Use tensorflow
The easiest solution is to create a Python 3 virtual environment on a stat host and then pip3 install
. Please remember that every version of the package is linked against a specific version of ROCm, so it may be possible that newer versions of tensorflow-rocm don't run on our hosts since we don't have an up to date version of ROCm deployed yet.
Upstream suggested to follow
and check every time what combination of tensorflow-rocm and ROCm is supported.
We have two versions of ROCm deployed:
4.2 on stat100[5,8] - Only tensorflow-rocm 2.5.0 is supported.
5.4.0 on the DSE K8s Cluster and Lift Wing - Only tensorflow-rocm 2.11.0.540 is supproted.
# Example for stat100x nodes
virtualenv
-p
python3
test_tf
source
test_tf/bin/activate
pip3
install
tensorflow-rocm
==
.5.0
Experimental
The fact that AMD forked the
tensorflow
pypi package poses some challenges when using other Tensorflow-based packages in our infrastructure. Most of them in fact have the
tensorflow
package dependency in their setup.py configurations, and this means that pip will always try to install it (conflicting with
tensorflow-rocm
). We are testing a hack that aims to trick pip, installing an empty
tensorflow
package alongside with
tensorflow-rocm
. This is the procedure:
Create an empty
tensorflow
package. The version that you use needs to be the same as
tensorflow-rocm
. A possible solution is:
mkdir
test
cd
test
cat
setup.py
<< EOF
from setuptools import setup, find_packages
setup(
name = 'tensorflow',
version='2.5.0',
packages=find_packages(),
EOF
python3
setup.py
bdist_wheel
ls
dist/tensorflow-2.5.0-py3-none-any.whl
Create your Python conda/venv environment as always.
pip install /path/to/dist/tensorflow-2.6-py3-none-any.whl
pip install tensorflow-rocm==2.5.0
pip install etc.. [namely all packages that require tensorflow, the ones that you are interested in]
You may need to solve some dependency issue when pip installing in this way, since installing packages separately may induce some conflicts that you wouldn't have in "regular" installs. Some useful tools:
pip install pipdeptree; pipdeptree (to check your dependency tree)
pip install pip --upgrade (to get the latest pip version)
Configure your Tensorflow script
By default, Tensorflow tasks take all available resources (both from the CPU and the GPU). In resource sharing settings, this might cause resources to saturate quickly and some process to block before execution. When using Tensorflow scripts on our GPU machines, please make sure you add to your code the following snippet:
For Tensorflow version 2.0 and 2.1:
import
tensorflow
as
tf
gpu_devices
tf
config
experimental
list_physical_devices
'GPU'
tf
config
experimental
set_memory_growth
gpu_devices
],
True
or directly
import
tensorflow
as
tf
tf
config
gpu
set_per_process_memory_growth
True
For prior versions:
import
tensorflow
as
tf
tf_config
tf
ConfigProto
()
tf_config
gpu_options
allow_growth
True
sess
tf
Session
config
tf_config
Also, a good practice is to limit the number of threads used by your tensorflow code.
For Tensorflow version 2.0 and 2.1:
import
tensorflow
as
tf
tf
config
threading
set_intra_op_parallelism_threads
10
#or lower values
tf
config
threading
set_inter_op_parallelism_threads
10
#or lower values
For prior versions:
import
tensorflow
as
tf
tf_config
tf
ConfigProto
intra_op_parallelism_threads
10
inter_op_parallelism_threads
10
sess
tf
Session
config
tf_config
Check the version of ROCm deployed on a host
elukey@stat1005:~/test$
dpkg
-l
rocm-dev
Desired
Unknown/Install/Remove/Purge/Hold
Status
Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
Err?
=(
none
/Reinst-required
Status,Err:
uppercase
bad
||
Name
Version
Architecture
Description
+++-
==============
============
============
=================================================
ii
rocm-dev
.7.22
amd64
Radeon
Open
Compute
ROCm
Runtime
software
stack
elukey@stat1008:~$
dpkg
-l
rocm-dev
Desired
Unknown/Install/Remove/Purge/Hold
Status
Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
Err?
=(
none
/Reinst-required
Status,Err:
uppercase
bad
||
Name
Version
Architecture
Description
+++-
==============
============
============
=================================================
ii
rocm-dev
.3.0-19
amd64
Radeon
Open
Compute
ROCm
Runtime
software
stack
Changelog in
Check usage of the GPU
On the host (limited to analytics-privatedata-users):
elukey@stat1005:~$
sudo
radeontop
In Grafana:
Code available in:
Outstanding issues
GPUs are not correctly handling multi-tasking -
Reset the GPU state
If the GPU gets stuck for some reason (unclean job completion, etc..) the following may happen:
radeontop shows steady RAM usage (90%+ for example).
tensorflow gets stuck when trying to execute jobs.
Usually rebooting the host works, but the following procedure might help as well:
run
sudo /opt/rocm/bin/rocm-smi
and get the id of the GPU (usually 1)
run
sudo /opt/rocm/bin/rocm-smi --gpureset -d X
(with X equals to the id of the GPU)
Upgrade the Debian packages
We import the Debian packages released by AMD for Ubuntu Xenial (and a few subsequent versions like Jammy and Noble) to the
amd-rocm
component in wikimedia-buster and -bookworm. Unfortunately, some packages like
rocm-gdb
depend on Ubuntu-specific packages (namely
libpython3.10
), which have different names in Debian. We thus made a
fake-libpython3.10
package that depends on the right packages in Debian, and has a
Provides:
field in its
control
file (see below).
Before starting to upgrade, please:
Check
, there is a changelog for every version. Pay attention for breaking changes and OS supported.
Check what version of ROCm is supported by what version of torch-rocm. As indicated in a previous section, upstream builds a fork of torch building/linking every version with a specific ROCm library version. ROCm's version of the torch repo is at
At this point, it is better to involve people that use Torch before targeting a specific release, to choose the best combination for our use cases.
Once you have a target release in mind:
1) Check
and see if a new version is out. If so, create a new component in
modules/aptrepo/files/distributions-wikimedia
Name
amd
rocmXX
Method
http
//
repo
radeon
com
rocm
apt
Suite
xenial
Components
main
thirdparty
amd
rocmXX
UDebComponents
Architectures
amd64
VerifyRelease
9386
B48A1A693C5C
ListShellHook
grep
dctrl
'^([..cut..])$'
||
$?
eq
Replace the XX wildcards with the version number of course. Also make sure that the
grep-dctrl
in
ListShellHook
has a sane value. Between different ROCm versions, packages disappear, or new dependencies are added. For new dependencies also make sure that they are under an open license, since there used to be at least one proprietary-license package in the past, that we had to work around.
There is a second file that needs an entry for the new repo:
modules/aptrepo/files/distributions-wikimedia
An example change where this was done for ROCm 6.3 is
2) ssh to
apt1002
, run puppet and check for updates (remember to replace the XX wildcards):
root@apt1001:/srv/wikimedia#
reprepro
--noskipold
--ignore
forbiddenchar
--component
thirdparty/amd-rocmXX
checkupdate
buster-wikimedia
Calculating
packages
to
get...
Updates
needed
for
'buster-wikimedia|thirdparty/amd-rocm|amd64'
..
'hsa-rocr-dev'
newly
installed
as
'1.1.9-87-g1566fdd'
from
'amd-rocm'
files
needed:
pool/thirdparty/amd-rocm/h/hsa-rocr-dev/hsa-rocr-dev_1.1.9-87-g1566fdd_amd64.deb
..
3) Check if the
fake-libpython
dependency has any changes (like
rocm-gdb
requiring a higher version number).
3) If so, create a control file like this on build2xxx:
### Commented entries have reasonable defaults.
### Uncomment to edit them.
# Source:
Section: misc
Priority: optional
# Homepage:
Standards-Version: 3.9.2
Source: python3.10
Package: fake-libpython3.10
Version: 3.10.0
Maintainer: Tobias Klausmann
# Pre-Depends:
Depends: libpython3.11
# Recommends:
# Suggests:
Provides: libpython3.10 (= 3.10.0), libpython3.10-minimal, libpython3.10-stdlib
# Replaces:
Architecture: all
# Multi-Arch:
# Copyright:
# Changelog:
# Readme:
# Extra-Files:
# Files:
#
Description: Fake libpython3.8 package to satisfy dependencies
Fake libpython3.10 package to satisfy dependencies
4) build the package with
equivs-build control
5) upload the package to reprepro (remember to replace the XX wildcards):
reprepro
-C
thirdparty/amd-rocmXX
includedeb
bookworm-wikimedia
~klausman/fake-libpython3.10/fake-libpython3.10_3.10.0_all.deb
6) Update the
thirdparty/rocmXX
component (remember to replace the XX wildcards):
reprepro
--noskipold
--ignore
forbiddenchar
--component
thirdparty/amd-rocmXX
update
bookworm-wikimedia
8) Update the versions supported by the amd_rocm module in operations/puppet.
9) On the host that you want to upgrade (this package list is an example, it may change):
sudo
apt
autoremove
-y
rocm-smi-lib
migraphx
miopengemm
rocminfo
hsakmt-roct
rocrand
hsa-rocr-dev
rocm-cmake
hsa-ext-rocr-dev
rocm-device-libs
hip_base
hip_samples
llvm-amdgpu
comgr
rocm-gdb
rocm-dbgapi
mivisionx
And then run puppet to install the new packages. Some quick tests to see if the GPU is properly recognized:
elukey@stat1005:~$
/opt/rocm/bin/rocminfo
..
elukey@stat1005:~$
/opt/rocm/opencl/bin/clinfo
..
elukey@stat1005:~$
/opt/rocm/bin/hipconfig
..
elukey@stat1005:~$
export
https_proxy
elukey@stat1005:~$
virtualenv
-p
python3
test
elukey@stat1005:~$
source
test/bin/activate
test
elukey@stat1005:~$
pip3
install
tensorflow-rocm
elukey@stat1008:~$
cat
gpu_test.py
import
tensorflow
as
tf
# Creates a graph.
with
tf.device
'/device:GPU:0'
tf.constant
([
.0,
.0,
.0,
.0,
.0,
.0
shape
=[
name
'a'
tf.constant
([
.0,
.0,
.0,
.0,
.0,
.0
shape
=[
name
'b'
tf.matmul
a,
# Runs the op.
test
elukey@stat1008:~$
python
gpu_test.py
tf.Tensor
[[
22
28
49
64
]]
shape
=(
dtype
float32
Retrieved from "
Machine Learning/AMD GPU
Add topic