Chinaunix
标题:
[求助]找到某两列刚好顺序相反一组数据
[打印本页]
作者:
elaine2017
时间:
2018-05-06 13:12
标题:
[求助]找到某两列刚好顺序相反一组数据
数据格式如下:
1 E00548:177:HKH53CCXY:4:1101:10003:10029 TACAGACTGTGG CTCTCCTATAGC chr2 184106244 184106312
2 E00548:177:HKH53CCXY:4:1101:10003:10099 TGATACCGGACA GTGCCTCATCTA chr3 139790591 139790643
3 E00548:177:HKH53CCXY:4:1101:10003:10099 TGATACCGGACA GTGCCTCATCTA chr3 139790643 139790591
4 E00548:177:HKH53CCXY:4:1101:10003:10169 CTTCCATAGGCA AGAGTTCACGGA chr6 26713971 26713996
5 E00548:177:HKH53CCXY:4:1101:10003:10169 CTTCCATAGGCA AGAGTTCACGGA chr6 26713996 26713971
6 E00548:177:HKH53CCXY:4:1101:10003:10240 TAGACGTAGACG TCAAGGAGAACC chr14 37255539 37255588
7 E00548:177:HKH53CCXY:4:1101:10003:10240 TCAAGGAGAACC TAGACGTAGACG chr14 37255588 37255539
8 E00548:177:HKH53CCXY:4:1101:10003:15795 ACGACACTGCTA CTCTCCTATAGC chr14 96799778 96799778
9 E00548:177:HKH53CCXY:4:1101:10003:10029 CTCTCCTATAGC TACAGACTGTGG chr2 184106312 184106244
10 E00548:177:HKH53CCXY:4:1101:10003:15865 CTCTCCTATAGC TGATACCGGACA chr1 235974675 235974675
11 E00548:177:HKH53CCXY:4:1101:10003:15900 TACAGACTGTGG CAAGCAACCGAT chr5 112051485 112051747
12 E00548:177:HKH53CCXY:4:1101:10003:15900 TACAGACTGTGG CAAGCAACCGAT chr5 112051747 112051485
13 E00548:177:HKH53CCXY:4:1101:10003:17272 AGCGGATGAGTA AGCGGATGAGTA chr15 80282260 80282316
14 E00548:177:HKH53CCXY:4:1101:10003:17272 AGCGGATGAGTA AGCGGATGAGTA chr15 80282316 80282260
15 E00548:177:HKH53CCXY:4:1101:10003:17307 ACAGTGGCATGT ATGCGTACCACA chr1 243686846 699370
(第一列的序号是自己加的,原文件中没有)
比如第6、7行,他们的第2、3列顺序相反,第5、6列顺序也刚好相反,其余的字段内容相同
以及第1、9行,也是第2、3列顺序相反,第5、6列顺序也刚好相反,其余的字段内容相同
现在想要把这种成对的两行输出到一个文件,例如上述数据的输出结果应该是:
E00548:177:HKH53CCXY:4:1101:10003:10029 TACAGACTGTGG CTCTCCTATAGC chr2 184106244 184106312
E00548:177:HKH53CCXY:4:1101:10003:10029 CTCTCCTATAGC TACAGACTGTGG chr2 184106312 184106244
E00548:177:HKH53CCXY:4:1101:10003:10240 TAGACGTAGACG TCAAGGAGAACC chr14 37255539 37255588
E00548:177:HKH53CCXY:4:1101:10003:10240 TCAAGGAGAACC TAGACGTAGACG chr14 37255588 37255539
请问该怎么找出像这样满足条件的数据?可以帮我提供个思路嘛?
作者:
1cpuer
时间:
2018-05-07 15:28
回复
1#
elaine2017
# [ /home/soio/1bs/awks ] {2018-05-07 12:15:59}
: 1525666559:0;➜ awk 'a[$6]==$7;$6 ~ /6312$|5539$/;$7 ~ /6312$|5539$/{print}' 5665-k
1 E00548:177:HKH53CCXY:4:1101:10003:10029 TACAGACTGTGG CTCTCCTATAGC chr2 184106244 184106312
6 E00548:177:HKH53CCXY:4:1101:10003:10240 TAGACGTAGACG TCAAGGAGAACC chr14 37255539 37255588
7 E00548:177:HKH53CCXY:4:1101:10003:10240 TCAAGGAGAACC TAGACGTAGACG chr14 37255588 37255539
9 E00548:177:HKH53CCXY:4:1101:10003:10029 CTCTCCTATAGC TACAGACTGTGG chr2 184106312 184106244
作者:
elaine2017
时间:
2018-05-07 16:10
回复
2#
1cpuer
数据很大,用awk的效率很低吧
作者:
elaine2017
时间:
2018-05-07 16:15
本帖最后由 elaine2017 于 2018-05-07 16:17 编辑
回复
2#
1cpuer
我试了你这个命令,并没有过滤掉啊
作者:
1cpuer
时间:
2018-05-07 19:39
回复
4#
elaine2017
# [ /home/soio/1bs/awks ] {2018-05-07 19:37:03}
: 1525693023:0;➜ grep '184106\|372555' 5665-k
1 E00548:177:HKH53CCXY:4:1101:10003:10029 TACAGACTGTGG CTCTCCTATAGC chr2 184106244 184106312
6 E00548:177:HKH53CCXY:4:1101:10003:10240 TAGACGTAGACG TCAAGGAGAACC chr14 37255539 37255588
7 E00548:177:HKH53CCXY:4:1101:10003:10240 TCAAGGAGAACC TAGACGTAGACG chr14 37255588 37255539
9 E00548:177:HKH53CCXY:4:1101:10003:10029 CTCTCCTATAGC TACAGACTGTGG chr2 184106312 184106244
作者:
elaine2017
时间:
2018-05-07 20:02
回复
5#
1cpuer
78G的数据,用grep?运行不出来的吧。。。。
作者:
1cpuer
时间:
2018-05-08 14:17
本帖最后由 1cpuer 于 2018-05-09 14:33 编辑
回复
6#
elaine2017
可以的
grep `awk 打印 1行 $7 ` > anfile
wc -l anfile && if == 2 (cat > towfile)
# [ /home/soio/1bs/awks ] {2018-05-09 14:26:59}
: 1525847219:0;➜ grep `awk '{print $7}' 5665-k | sed -n 1p` 5665-k
1 E00548:177:HKH53CCXY:4:1101:10003:10029 TACAGACTGTGG CTCTCCTATAGC chr2 184106244 184106312
9 E00548:177:HKH53CCXY:4:1101:10003:10029 CTCTCCTATAGC TACAGACTGTGG chr2 184106312 184106244
# [ /home/soio/1bs/awks ] {2018-05-09 14:27:54}
: 1525847274:0;➜ grep `awk '{print $7}' 5665-k | sed -n 2p` 5665-k
2 E00548:177:HKH53CCXY:4:1101:10003:10099 TGATACCGGACA GTGCCTCATCTA chr3 139790591 139790643
3 E00548:177:HKH53CCXY:4:1101:10003:10099 TGATACCGGACA GTGCCTCATCTA chr3 139790643 139790591
# [ /home/soio/1bs/awks ] {2018-05-09 14:28:26}
: 1525847306:0;➜ grep `awk '{print $7}' 5665-k | sed -n 6p` 5665-k
6 E00548:177:HKH53CCXY:4:1101:10003:10240 TAGACGTAGACG TCAAGGAGAACC chr14 37255539 37255588
7 E00548:177:HKH53CCXY:4:1101:10003:10240 TCAAGGAGAACC TAGACGTAGACG chr14 37255588 37255539
失败
作者:
flywithperl
时间:
2018-05-09 11:41
78G这么大的数据,不可能把全部数据先加载,再比较;
建议通过多次文件扫描来处理:
1. 先合并1,4列相同的到同一新文件
目的:
是减少第2步查找比较的次数,并能在有限内存中处理
问题:
如果1,4列相同的行数据太少,生成文件方式,会产生太多文件;可以直接使用缓存处理;
只有1,4列相同的行数可能会很多时,缓存会占用大量内存时,才使用中间文件方式
2. 在新文件中,从第一行开始,查找之后行中:第2、3列顺序相反,第5、6列顺序也刚好相反的两行
3. 把符合要求的行输出到结果文件
作者:
1cpuer
时间:
2018-05-23 17:18
本帖最后由 1cpuer 于 2018-05-23 17:37 编辑
回复
6#
elaine2017
# [ /home/soio/1bs/awks ] {2018-05-23 17:16:48}
: 1527067008:0;➜ awk '{print $3" "$6" "$1}{print $4" "$7" "$1}' 13.kf | awk '{a[$1" "$2]=a[$1" "$2]?a[$1" "$2]"!@#$"$3 : $0}END{for(i in a) print a
}' | grep -o '[0-9][0-9]*!@#$[0-9][0-9]*'
13!@#$14
1!@#$9
1!@#$9
13!@#$14
6!@#$7
6!@#$7
# [ /home/soio/1bs/awks ] {2018-05-23 17:16:58}
: 1527067018:0;➜ sed -n '1p;9p;6p;7p;13p;14p' 13.kf
1 E00548:177:HKH53CCXY:4:1101:10003:10029 TACAGACTGTGG CTCTCCTATAGC chr2 184106244 184106312
6 E00548:177:HKH53CCXY:4:1101:10003:10240 TAGACGTAGACG TCAAGGAGAACC chr14 37255539 37255588
7 E00548:177:HKH53CCXY:4:1101:10003:10240 TCAAGGAGAACC TAGACGTAGACG chr14 37255588 37255539
9 E00548:177:HKH53CCXY:4:1101:10003:10029 CTCTCCTATAGC TACAGACTGTGG chr2 184106312 184106244
13 E00548:177:HKH53CCXY:4:1101:10003:17272 AGCGGATGAGTA AGCGGATGAGTA chr15 80282260 80282316
14 E00548:177:HKH53CCXY:4:1101:10003:17272 AGCGGATGAGTA AGCGGATGAGTA chr15 80282316 80282260
# [ /home/soio/1bs/awks ] {2018-05-23 17:36:00}
: 1527068160:0;➜ awk '{print $3" "$6" "$1}{print $4" "$7" "$1}' 13.kf | awk '{a[$1" "$2]=a[$1" "$2]?a[$1" "$2]"!@#$"$3 : $0}END{for(i in a) print a
}' | grep -o '[0-9][0-9]*!@#$[0-9][0-9]*' | sort -nk 1 | sed 's/\!\@\#\$/p;/ ;s/$/p;/' | sed 's/;/&\n/' | awk '!a[$0]++' | sed ':1;N;s/\n//;t1'
1p;9p;6p;7p;13p;14p;
欢迎光临 Chinaunix (http://bbs.chinaunix.net/)
Powered by Discuz! X3.2