How to Download the Afrispeech-200 Dataset on Linux

So I needed to download the AfriSpeech-200 dataset for my speech recognition research — 200 hours of Pan-African accented English speech across 120 accents from 13 countries. Simple enough, right?

Not quite. Here's what I ran into and how I fixed it.

What is AfriSpeech-200?

AfriSpeech-200 is an open-source Pan-African speech dataset designed for clinical and general domain Automatic Speech Recognition (ASR). It contains:

67,577 audio clips from 2,463 unique speakers
120 African accents across 13 countries
200+ hours of speech data
Both clinical and general domain transcripts

It lives on Hugging Face at tobiolatunji/afrispeech-200.

My Setup

OS: Pop!_OS (Ubuntu-based Linux)
Starting point: Clean slate — no Python environment, nothing installed

Step 1: Check Python

Python 3 comes pre-installed on Pop!_OS. Verify with:

python3 --version

If it's missing:

sudo apt update && sudo apt install python3 python3-pip python3-venv -y

Step 2: Create a Virtual Environment

python3 -m venv ~/afrispeech_env
source ~/afrispeech_env/bin/activate

Step 3: Install Hugging Face Libraries

pip install datasets==2.14.6 huggingface_hub

Why datasets==2.14.6? AfriSpeech-200 uses an older-style loading script. Newer versions of the datasets library may throw compatibility errors, so pinning to 2.14.6 avoids that headache.

Step 4: Log in to Hugging Face

You need a free Hugging Face account. Get your token at huggingface.co/settings/tokens, then run:

huggingface-cli login

Paste your token when prompted.

Step 5: Check Your Disk Space

This dataset is ~120GB. Make sure you have enough space before starting:

df -h ~

Step 6: Download the Dataset

Note: If you see a warning that huggingface-cli download is deprecated, use hf download instead.

hf download tobiolatunji/afrispeech-200 \
  --repo-type dataset \
  --local-dir ~/afrispeech-200

Grab a cup of tea — this will take a while depending on your internet speed. ☕

Step 7: Move to Your Projects Folder

mv ~/afrispeech-200 ~/projects/afrispeech-200

Verify the Download

Check what you have:

ls ~/projects/afrispeech-200/

Count training samples:

tail -n +2 ~/projects/afrispeech-200/transcripts/*/train.csv | grep -v "^==>" | wc -l

Quick Recap

Step	Command
Create environment	`python3 -m venv ~/afrispeech_env`
Activate	`source ~/afrispeech_env/bin/activate`
Install libs	`pip install datasets==2.14.6 huggingface_hub`
Login	`huggingface-cli login`
Download	`hf download tobiolatunji/afrispeech-200 --repo-type dataset --local-dir ~/afrispeech-200`

That's it! The dataset should now be on your machine and ready for experiments.

If you run into any issues, drop a comment below — happy to help.

Ridwan Bello is an ML researcher exploring African language technologies. Follow the journey at blog.ridwanbello.io

How to Download the Afrispeech-200 Dataset on Linux

Comments

More from this blog

Who Is AI in Education Actually Built For?

What is AfriSpeech-200?

My Setup

Step 1: Check Python

Step 2: Create a Virtual Environment

Step 3: Install Hugging Face Libraries

Step 4: Log in to Hugging Face

Step 5: Check Your Disk Space

Step 6: Download the Dataset

Step 7: Move to Your Projects Folder

Verify the Download

Quick Recap

Command Palette

Comments

More from this blog

What is AfriSpeech-200?

My Setup

Step 1: Check Python

Step 2: Create a Virtual Environment

Step 3: Install Hugging Face Libraries

Step 4: Log in to Hugging Face

Step 5: Check Your Disk Space

Step 6: Download the Dataset

Step 7: Move to Your Projects Folder

Verify the Download

Quick Recap