End to End Callsign Recognition System
In the last blog, we introduced a way to improve the word error rate (WER) on callsigns of the automatic speech recognition (ASR) output by incorporating surveillance information in the transcription process. In this blog post, we want to talk about extracting the callsigns from the ASR output. The process of callsign recognition can be broken down in two stages:
1) Tagging the callsign in the sequence
2) Mapping of the callsign word sequence into its ICAO format (ICAO stands for International Civil Aviation Organization)
Figure 1 illustrates the two-stage process. In the tagging step, the input transcript, originating from our ASR system is tagged with the IOB format (short for inside, outside, beginning), to find the tokens that are part of a callsign. In the second step, the part of the ASR transcript, that is tagged as callsign (labeled with B/I-CALL) is mapped to the standard ICAO format for callsigns, which consists of a 3 character airline identifier followed by the flight ID which consists out of several digits followed optionally by 1-2 characters (In case of interest, a list of airline identifiers can be found here:https://en.wikipedia.org/wiki/List_of_airline_codes).
Since we have two processes, the idea on hand is to train two different networks for the task, one that specializes in tagging and one that takes care of mapping the sequence tagged as callsign into the ICAO format. In this case, both processes can be tuned individually. The drawback of this architecture is, that information, that is lost in the first step, cannot be recovered in the second step. The other possibility is to train an End-to-End network, that outputs directly the ICAO callsign given the ASR transcripts as input. This architecture has the benefit, that there is no information loss in between. Both architectures are visualized in Figure 2. In our experiments showed that the End-To-End approach performs better than the 2 network solution in the majority of test cases.
A closer look at Figure 1 reveals that the predicted ICAO callsign does contain information that is missing in the labels and in the transcript, namely the last two digits of the flight id. This information comes from the surveillance information. Callsigns from planes near the location where the ATC Communication is recorded are time matched with the recordings and fed as additional input into the network as seen in Figure 3. In case the transcripts only contains the partial information of a callsign, the missing information can be recovered from the surveillance input. The End-To-End network shows a callsign accuracy rate over 90% on clean transcripts, if surveillance information is available. On our ASR output with a WER of 28.7, an accuracy over 80% is reached. The network also shows an increased resistance towards higher ASR WERs. The accuracy scores for two different datasets can be read up in our Interspeech paper submission: “Boosting of contextual information in ASR for air-traffic call-sign recognition”.