Skip to main content

Command Palette

Search for a command to run...

How to Download the Afrispeech-200 Dataset on Linux

Published
3 min read

So I needed to download the AfriSpeech-200 dataset for my speech recognition research — 200 hours of Pan-African accented English speech across 120 accents from 13 countries. Simple enough, right?

Not quite. Here's what I ran into and how I fixed it.


What is AfriSpeech-200?

AfriSpeech-200 is an open-source Pan-African speech dataset designed for clinical and general domain Automatic Speech Recognition (ASR). It contains:

  • 67,577 audio clips from 2,463 unique speakers

  • 120 African accents across 13 countries

  • 200+ hours of speech data

  • Both clinical and general domain transcripts

It lives on Hugging Face at tobiolatunji/afrispeech-200.


My Setup

  • OS: Pop!_OS (Ubuntu-based Linux)

  • Starting point: Clean slate — no Python environment, nothing installed


Step 1: Check Python

Python 3 comes pre-installed on Pop!_OS. Verify with:

python3 --version

If it's missing:

sudo apt update && sudo apt install python3 python3-pip python3-venv -y

Step 2: Create a Virtual Environment

python3 -m venv ~/afrispeech_env
source ~/afrispeech_env/bin/activate

Step 3: Install Hugging Face Libraries

pip install datasets==2.14.6 huggingface_hub

Why datasets==2.14.6? AfriSpeech-200 uses an older-style loading script. Newer versions of the datasets library may throw compatibility errors, so pinning to 2.14.6 avoids that headache.


Step 4: Log in to Hugging Face

You need a free Hugging Face account. Get your token at huggingface.co/settings/tokens, then run:

huggingface-cli login

Paste your token when prompted.


Step 5: Check Your Disk Space

This dataset is ~120GB. Make sure you have enough space before starting:

df -h ~

Step 6: Download the Dataset

Note: If you see a warning that huggingface-cli download is deprecated, use hf download instead.

hf download tobiolatunji/afrispeech-200 \
  --repo-type dataset \
  --local-dir ~/afrispeech-200

Grab a cup of tea — this will take a while depending on your internet speed. ☕


Step 7: Move to Your Projects Folder

mv ~/afrispeech-200 ~/projects/afrispeech-200

Verify the Download

Check what you have:

ls ~/projects/afrispeech-200/

Count training samples:

tail -n +2 ~/projects/afrispeech-200/transcripts/*/train.csv | grep -v "^==>" | wc -l

Quick Recap

Step Command
Create environment python3 -m venv ~/afrispeech_env
Activate source ~/afrispeech_env/bin/activate
Install libs pip install datasets==2.14.6 huggingface_hub
Login huggingface-cli login
Download hf download tobiolatunji/afrispeech-200 --repo-type dataset --local-dir ~/afrispeech-200

That's it! The dataset should now be on your machine and ready for experiments.

If you run into any issues, drop a comment below — happy to help.


Ridwan Bello is an ML researcher exploring African language technologies. Follow the journey at blog.ridwanbello.io