EGRA-Xhosa-14.9k: Annotated Child Reading Audio Dataset
  • Description

    The project involves collecting the child reading dataset for the language is Xhosa, a South African Bantu language. The collected dataset is then processed with the help of native speakers and utilized to train state-of-the-art machine learning models focussed on assessing whether the child has spoken the word correctly or not. The dataset contains 14,972 recordings with an average of 4 seconds each. Each recording is annotated by three independent markers and consists of children speaking a particular word or letter from the Xhosa language in a classroom setting.


    • Data publication title EGRA-Xhosa-14.9k: Annotated Child Reading Audio Dataset
    • Description

      The project involves collecting the child reading dataset for the language is Xhosa, a South African Bantu language. The collected dataset is then processed with the help of native speakers and utilized to train state-of-the-art machine learning models focussed on assessing whether the child has spoken the word correctly or not. The dataset contains 14,972 recordings with an average of 4 seconds each. Each recording is annotated by three independent markers and consists of children speaking a particular word or letter from the Xhosa language in a classroom setting.


    • Data type dataset
    • Keywords
      • EGRA-AI
      • EGRA
      • Children
      • Early Grade
      • Assessment
      • isiXhosa
      • Classroom
      • Annotated
    • Funding source
    • Grant number(s)
      • -
    • FoR codes
      • 461199 - Machine learning not elsewhere classified
      • 460599 - Data management and data science not elsewhere classified
      • 460299 - Artificial intelligence not elsewhere classified
      • 460199 - Applied computing not elsewhere classified
      • 470399 - Language studies not elsewhere classified
      • 490199 - Applied mathematics not elsewhere classified
      • 390399 - Education systems not elsewhere classified
      SEO codes
      • 130299 - Communication not elsewhere classified
      • 160199 - Learner and learning not elsewhere classified
      • 160399 - Teaching and curriculum not elsewhere classified
      • 169999 - Other education and training not elsewhere classified
      • 220199 - Communication technologies, systems and services not elsewhere classified
      • 220499 - Information systems, technologies and services not elsewhere classified
      • 280116 - Expanding knowledge in language, communication and culture
      Temporal (time) coverage
    • Start date 2024/02/01
    • End date 2024/11/30
    • Time period
       
      Spatial (location,mapping) coverage
    • Locations
      • South Africa
    • Related publications
        Name An End-to-End Approach for Child Reading Assessment in the Xhosa Language
      • URL
      • Notes submitted article
    • Related website
        Name
      • URL
      • Notes
    • Related metadata (including standards, codebooks, vocabularies, thesauri, ontologies)
    • Related data
        Name
      • URL
      • Notes
    • Related services
        Name
      • URL
      • Notes
      Citation Chevtchenko, Sergio; Navas, Nikhil; Vale, Rafaella; Ubaudi, Franco; Lucwaba, Sipumelele; Ardington, Cally; Afshar, Soheil; Antoniou, Mark; Afshar, Saeed (2025): EGRA-Xhosa-14.9k: Annotated Child Reading Audio Dataset. Western Sydney University. https://doi.org/10.26183/93x0-qy45