How to Download the Afrispeech-200 Dataset on Linux
So I needed to download the AfriSpeech-200 dataset for my speech recognition research — 200 hours of Pan-African accented English speech across 120 accents from 13 countries. Simple enough, right?
Not quite. Here's what I ran into and how I fixed it.
What is AfriSpeech-200?
AfriSpeech-200 is an open-source Pan-African speech dataset designed for clinical and general domain Automatic Speech Recognition (ASR). It contains:
67,577 audio clips from 2,463 unique speakers
120 African accents across 13 countries
200+ hours of speech data
Both clinical and general domain transcripts
It lives on Hugging Face at tobiolatunji/afrispeech-200.
My Setup
OS: Pop!_OS (Ubuntu-based Linux)
Starting point: Clean slate — no Python environment, nothing installed
Step 1: Check Python
Python 3 comes pre-installed on Pop!_OS. Verify with:
python3 --version
If it's missing:
sudo apt update && sudo apt install python3 python3-pip python3-venv -y
Step 2: Create a Virtual Environment
python3 -m venv ~/afrispeech_env
source ~/afrispeech_env/bin/activate
Step 3: Install Hugging Face Libraries
pip install datasets==2.14.6 huggingface_hub
Why
datasets==2.14.6? AfriSpeech-200 uses an older-style loading script. Newer versions of thedatasetslibrary may throw compatibility errors, so pinning to 2.14.6 avoids that headache.
Step 4: Log in to Hugging Face
You need a free Hugging Face account. Get your token at huggingface.co/settings/tokens, then run:
huggingface-cli login
Paste your token when prompted.
Step 5: Check Your Disk Space
This dataset is ~120GB. Make sure you have enough space before starting:
df -h ~
Step 6: Download the Dataset
Note: If you see a warning that
huggingface-cli downloadis deprecated, usehf downloadinstead.
hf download tobiolatunji/afrispeech-200 \
--repo-type dataset \
--local-dir ~/afrispeech-200
Grab a cup of tea — this will take a while depending on your internet speed. ☕
Step 7: Move to Your Projects Folder
mv ~/afrispeech-200 ~/projects/afrispeech-200
Verify the Download
Check what you have:
ls ~/projects/afrispeech-200/
Count training samples:
tail -n +2 ~/projects/afrispeech-200/transcripts/*/train.csv | grep -v "^==>" | wc -l
Quick Recap
| Step | Command |
|---|---|
| Create environment | python3 -m venv ~/afrispeech_env |
| Activate | source ~/afrispeech_env/bin/activate |
| Install libs | pip install datasets==2.14.6 huggingface_hub |
| Login | huggingface-cli login |
| Download | hf download tobiolatunji/afrispeech-200 --repo-type dataset --local-dir ~/afrispeech-200 |
That's it! The dataset should now be on your machine and ready for experiments.
If you run into any issues, drop a comment below — happy to help.
Ridwan Bello is an ML researcher exploring African language technologies. Follow the journey at blog.ridwanbello.io
