IITG-MV Phase IV


The fourth phase of data collection was done with the aim of facilitating the development of a GMM-UBM based online speaker verification system, to be ported on the telephone network. We used the Asterisk software and a telephony card to connect to the local PSTN (Public Switched Telephone Network). We were provided with a PRI line by BSNL for making and receiving calls on the network.

A 24 port PCI slot type telephone Card was used to connect to the local BSNL Telephone Exchange, through the PBX containing PRI card. A separate server was used, containing the telephone Card, to connect to the telephone lines. Asterisk software was installed on the server to use the card for building an IVR (Interactive Voice Response) System needed for data collection. Asterisk, as a tool, can be used for building a variety of telephone applications including Private Branch Exchanges (PBXs), Automatic Call Distribution Systems (ACDs), VoIP Gateways and other related communication projects. Here, we used Asterisk to build an IVR system for collecting data from distant subjects over the telephone network. The subjects were guided by the prerecorded IVR dialogues to give their speech data. The IVR System responded to their telephone keypad inputs, during the process of their data recording. The whole data collection effort can be divided into three parts as given below.

Part-I

In Part-I, the conference call facility of the mobile phone was used to record the speech data. The facilitator in the Laboratory first dialed the number of the BSNL line connected to the server. A welcome message stored in the server is played and asks to press a keypad digit to transfer the call to a particular extension number. After that, a PHP script file saved at the called extension number is executed. Before the call flow is executed, the script file first asks the facilitator to press the speaker ID of the subject whose recording is to be taken. This speaker ID is unique, and is chosen by the facilitator for each subject. Then, the facilitator calls the subject and adds him/her to the conference call. The script file then plays a series of instructions to the subject and successively stores his/her responses. The call flow during the entire recording is described by the flow diagram as shown in Figure-1. There is a total of 08 instructions played to the subject during the entire recording. Apart from these instructions, there is a total of 15 sentence files also that are played to the subject and have to be repeated by the subject.


Figure-1: Call flow diagram for recording Phase-IV of speaker recognition database

Using the above procedure, we collected data from 55 subjects covering all the major regions of the country. The collected speech files had to be used to aid in making the Universal Background Model (UBM). Thus, in order to get the desired variability, the protocol was to collect data of around 5 subjects from each region/state. The major regions covered in this way include Assam (North-east), Delhi/Punjab (North), Uttar Pradesh/Madhya Pradesh (representing Hindi speaking heartland) Karnataka, Tamil Nadu, Kerala, Andhra Pradesh (all representing south), Maharashtra and Gujarat (to represent west).

Since the data had to be used in making the UBM, only one session of recording was done for each subject. This whole data was combined with the Phase-III database to make a UBM. In previous Phase-III database out of 200 speakers 153 speakers were new and not common in any of the databases collected so far (Phase-I and Phase-II). We took these 153 speakers and combined with the 55 speakers collected in Phase-IV Part-I to make a 208 speaker UBM. In the 153 speakers data set there were only few females. So to ensure gender equality while collecting data for 55 speakers, we tried to increase the percentage of females. Our 55 speaker data set contains 37 females (67.28 %). The total data would be around 3.75 hours.

Part-II

Part-II of the data collection effort comprised of collecting data using a different IVR system. We implemented an IVR system that gives a unique speaker ID to each of the speaker and asked subjects to give 3 minutes of their reading style speech data for training the system. We collected data for 89 speakers this way.

Part-III

Part-III of the database consists of the the claim of the speakers against the speaker ID for testing. The speakers were asked to read a text for 30 seconds and this data is used for testing. Different trials of speakers were made against their speaker ID as genuine trials and against other speaker ID as imposter trials.

In total, we have collected around 7 hours of data, in Part-I,Part-II and Part-III of the entire Phase-IV of our data collection effort. Please refer to the IITG DIT MV database documentation for more details.