- 论坛徽章:
- 0
|
6f41f212-e96e-11e2-b0af-001517a36ca5: HEY h ey HEY hh ey
0d77a328-e427-11e2-b0af-001517a36ca5: ALLMAN aa l m ax n ALLMAN ao l m ax n
65defa18-e5de-11e2-a387-001517a3798d: OF ah v OF ax v; YOUR y ao r YOUR y ax r
39cbd594-e57e-11e2-87b7-001517a331f1: FOR f ax r FOR f ao r
b92e5823-ea7f-11e2-b869-001517a36ded: WHO h uw WHO hh uw
8eb46f0a-e9b5-11e2-a387-001517a3798d: WI d ah b ax l y uw ay WI w ay
eb993574-e9d3-11e2-b869-001517a36ded: WHAT w aa t WHAT w ah t; YEARS y ih r z YEARS y iy r z; WERE w er r WERE w ax r; THE dh ax THE dh ah; HOLOCAUST h aa l ah k ao s t HOLOCAUST hh aa l ax k ao s t
68c77618-e7ed-11e2-b0af-001517a36ca5: HUT h ah t HUT hh ah t; TEXAS t eh k s ax s TEXAS t eh k s ih s
dedf0820-e808-11e2-b0af-001517a36ca5: THE dh ax THE dh iy; OF ah v OF ax v; THE dh ax THE dh ah; WHAT w aa t WHAT hh w ah t
如上面的文件,格式为 ID: file1_word1 \t phone1 \t file2_word1 \t phone2;\s file1_word2 \t phone1 \t file2_word2 \t phone2, phone 里面是用空格隔开。
想找出 满足条件的 utt ID, 条件为 每个分号间的两个单词的 phone 的个数是一样, 比如 第一条,h ey = hh ey , 则第一个满足条件,
最后一个 w aa t < hh w ah t, 则最后一条 不满足条件。
我写的代码,老是循环不对,哈希里面多了一部分不满足条件的, 求改正。
my %samephCountUtt_hash;
my @diffphCountUtts;
foreach my $dtline(@dtLines)
{
chomp($dtline);
my ($uttID, $otherstr)=split/:\t+/, $dtline;
my @wdphCPs=split/\;/, $otherstr;
foreach my $wdphCP(@wdphCPs)
{
my @parts=split/\t+/,$wdphCP;
my @ttsPhones=split/\s+/, $parts[1];
my @srPhones=split/\s+/, $parts[3];
unless ($#ttsPhones eq $#srPhones)
{
$diffcount++;
push @diffphCountUtts, $dtline;
last;
}
$samephCountUtt_hash{$uttID}=$otherstr;
}
}
|
|