Data

"The ATC LID/ASR evaluation dataset is going to be published at Interspeech 2021. Stay tuned!"



Abstract: Detecting English Speech in the Air Traffic Control Voice Communication | icon-pdf.png


ASR dataset (V1).

Name: ATCO2-ASRdataset-v1_beta

Description: This dataset was build for development and evaluation of automatic speech recognizer techniques for English ATC data. Note: The dataset is considered as beta version and will be updated in the near future (some transcript fine tuning may happen). The dataset consists of English coming from LKTB, LKPR, LZIB, LSGS, LSZH, LSZB and YSSY airports The length of audio is 1.10 hours in total. We provided audio (wav format), English automatic transcript generated by an ASR and info file with meta information and nearby callsigns.

Link to file to download: https://www.replaywell.com/atco2/download/ATCO2-ASRdataset-v1_beta.tgz

LID dataset (V1).

Name: ATCO2-LIDdataset-v1_beta

Description: This dataset was build for development and evaluation of techniques for English and non-English speech classification of ATC data. Note: The dataset is considered as beta version and will be updated in the future (more language pairs will be add and some cleaning/debugging may happen). The dataset consists of language pairs:

CZEN - devel (6.11 hours),

CZEN - eval (6.21 hours)

FREN - devel (2.68 hours),

FREN - eval (3.27 hours),

GEEN - devel English only (5.61 hours),

GEEN - eval (2.41 hours),

EN-AU (Australian English) - eval English only (0.17 hours).

Where possible we split the pair to development and evaluation subsets. We provided audio (wav format), English automatic transcript generated by an ASR and info file with estimated SNR, language and length.

Link to file to download: https://www.replaywell.com/atco2/download/ATCO2-LIDdataset-v1_beta.tgz

 

DATASETS COLLECTED IN ATCO2. The data have been collected from several airports (data sizes are in hours)

LKPR: Prague (Czech Republic)
LKTB: Brno  (Czech Republic)
LSZH: Zurich (Switzerland)
LSZB: Bern (Switzerland)
LSGS: Sion (Switzerland)
YSSY: Sydney (Australia)
LZIB: Bratislava (Slovakia)
EETN: Tallinn (Estonia)

Total

All: 1517.171319
LKPR: 590.519025
LKTB: 341.325012
LSZH: 287.376496
LSZB: 138.460281
LSGS: 68.159527
YSSY: 65.163918
LZIB: 22.253698
EETN: 3.913363

Date LKPR LKTB LSZH LSZB LSGS YSSY LZIB EETN All
10/2020 17.4759 50.6165             68.0924
11/2020 58.4633 36.348             94.8113
12/2020 69.2494 21.8845             91.1339
01/2021 37.4121 25.2206             62.6328
02/2021 18.059968 25.777703             43.837672
03/2021 25.916452 24.702203             50.618656
04/2021 57.559116 33.174521 48.028701 34.070596 6.478430 4.716394 3.958835 1.310111 189.296704
05/2021 35.659576 39.189885 78.005412 42.569556 15.479776 30.957518 10.953263 0.747391 253.562378
06/2021 109.098188 34.528389 88.936661 57.455538 1.121517 29.490006 7.341600 0.408442 328.380339
07/2021 163.637212 47.870546 72.405722 4.364591 45.079805     1.447419 334.805295

 

DATASET ANNOTATED IN ATCO2. The data have been collected from several airports:

Total of annotated English speech: 137 minutes (roughly 2h17)

Date EN CZ LSGS_EN LSGS_FR LSZB_EN LSZB_GE LSZH_EN LSZH_GE LZIB_EN YSSY_EN Total_EN
9.6.2021 14.83 2.71 2.96 0.78 9.64 0 6.52 0.38 0   33.95
23.6.2021 14.83 2.71 13.22 0.78 14.67 0 9.19 0.46 1.3 1.85 55.06
19.7.2021 14.83 2.71 21.29 2 19.66 1.98 11.56 0.46 6.99 5.58 19.91
30.7.2021 18.31 2.71 26.82 2 22.81 1.98 16.74 0.46 12.05 8.9 105.63
19.8.2021 18.31 2.71 44.5 2 31.74 1.98 16.74 0.46 12.05 14.08 137.42