免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
12下一页
最近访问板块 发新帖
查看: 3745 | 回复: 10
打印 上一主题 下一主题

求脚本,python处理文件 [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2009-11-10 10:32 |只看该作者 |倒序浏览
文件格式为:

192.168.0.181 - - [04/Nov/2009:14:35:18 +0800] "CONNECT mail.google.com:443 HTTP/1.1" 200 11163 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:14:35:18 +0800] "GET http://www.jingoal.com/favicon.ico HTTP/1.1" 302 662 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:19 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:14:35:19 +0800] "GET http://www.jingoal.com/portal/publicity/manage_bbs/news/main.html HTTP/1.1" 200
1976 TCP_MEM_HIT:NONE
192.168.0.181 - - [04/Nov/2009:14:35:20 +0800] "GET http://www.jingoal.com/favicon.ico HTTP/1.1" 302 662 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:14:35:21 +0800] "GET http://www.jingoal.com/portal/pu ... gmaterial/main.html HTTP/1.1
"
200 3502 TCP_MEM_HIT:NONE
192.168.0.103 - - [04/Nov/2009:14:35:21 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:14:35:21 +0800] "GET http://www.jingoal.com/favicon.ico HTTP/1.1" 302 662 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:23 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:14:35:24 +0800] "GET http://www.jingoal.com/portal/cn/index.jsp HTTP/1.1" 200 5339 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:25 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 341 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:14:35:25 +0800] "GET http://www.jingoal.com/favicon.ico HTTP/1.1" 302 662 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:27 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 416 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:29 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:21:35:31 +0800] "CONNECT mail.google.com:443 HTTP/1.1" 200 1948 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:21:35:31 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:33 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 1198 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:35 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 341 TCP_MISS:DIRECT
192.168.0.181 - - [04/Nov/2009:14:35:36 +0800] "CONNECT mail.google.com:443 HTTP/1.1" 200 165 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:14:35:37 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 341 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:21:35:39 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.193 - - [04/Nov/2009:14:35:39 +0800] "POST http://www.jingoal.com/portal/pu ... stration/result.jsp HTTP
/1.1"
200 333 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:21:35:41 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.193 - - [04/Nov/2009:21:35:41 +0800] "GET http://www.jingoal.com/favicon.ico HTTP/1.1" 302 662 TCP_MISS:DIRECT
192.168.0.127 - - [04/Nov/2009:14:35:42 +0800] "GET http://www.tpy100.com/product.aspx? HTTP/1.1" 200 11045 TCP_MISS:DIRECT
192.168.0.103 - - [04/Nov/2009:21:35:43 +0800] "POST http://207.46.124.200/gateway/gateway.dll? HTTP/1.1" 200 342 TCP_MISS:DIRECT
192.168.0.193 - - [04/Nov/2009:01:35:45 +0800] "GET http://www.jingoal.com/portal/cn/mobile/main.jsp HTTP/1.1" 200 3563 TCP_MISS:DIR
ECT


要求:
       1、按照ip排序,统计每个ip访问域名次数的前20个。
       2、过滤时间,只统计9:00-12:00,13:00-18:00的,其他时间不统计。

[ 本帖最后由 tony_413 于 2009-11-10 10:34 编辑 ]

论坛徽章:
0
2 [报告]
发表于 2009-11-10 11:09 |只看该作者
自己用shell写了一个,但是文件一大,处理速度太慢。只实现了对ip排序和对每个ip访问的域名统计20个,没有排序,也没有对时间过滤。

发出来,请高手指点一下。

#!/bin/sh
SLog="/home/tony/shell/log/test.log"
IPs=`awk '{ print $1 }' $SLog | sort | uniq`
Doms=`awk -F"/" '{ print $5 }' $SLog | sort | uniq | grep -E "^[a-zA-Z0-9][a-z0-9]{0,}.\W.{1,}(com|cn|com.cn|net)$"`
#Doms=`awk '{ print $2 }' $SLog | sort | uniq`
#echo $Doms
echo -e "+-----------------+-----------------------------+-------+"
echo -e "|      IP         |      Site and Domain        | Count |"
echo -e "+-----------------+-----------------------------+-------+"
for ip in $IPs
do
        #ip_total=`grep "$ip" $SLog | wc -l`
        #echo -e "$ip\t$ip_total"
        for dom in $Doms
        do
                i=1
                count=`grep "$ip" $SLog | grep "$dom" | wc -l`
                if [ "$i" -lt 20 ]
                then
                        if [ "$count" -gt 0 ]
                        then
                                echo -e "|  $ip  |    $dom    |   $count   |"
                                echo -e "+-----------------+-----------------------------+-------+"
                        fi
                        ((i++))
                fi
        done
done

论坛徽章:
0
3 [报告]
发表于 2009-11-10 12:54 |只看该作者
说下思路:

while True:
   #读取一行
   #如果时间是9:00-12:00,13:00-18:00则继续,否则continue
   #以IP为dict的key,然后把域名加入到value里去(value可以嵌套网址)

或者你把这些log直接都split下,存在数据库,然后想怎么查就怎么查把

论坛徽章:
0
4 [报告]
发表于 2009-11-10 13:21 |只看该作者
先谢谢楼上了。

怎么把读入一行中的ip和域名取出来呀,我在python中执行awk命令老是报错。

while True:
        os.system("awk '{ print $1}'" + line)

论坛徽章:
1
天蝎座
日期:2013-10-23 21:11:03
5 [报告]
发表于 2009-11-10 13:57 |只看该作者

回复 #4 tony_413 的帖子

正则表达式自己处理就可以
re模块

论坛徽章:
0
6 [报告]
发表于 2009-11-10 14:06 |只看该作者
我对python的re模块不熟悉,麻烦楼上给个例子呗。

论坛徽章:
1
天蝎座
日期:2013-10-23 21:11:03
7 [报告]
发表于 2009-11-10 14:48 |只看该作者
随便找本书看看就可以
像Python核心编程、Programming Python
网上应该有现成的
http://www.baidu.com/s?word=+pyt ... 3&wd=+python+re

论坛徽章:
0
8 [报告]
发表于 2009-11-10 14:52 |只看该作者
看了几个,都没看明白。楼上的最好给个例子。

比如: GET [url]http://www.tpy100.com[/url]

提取出www.tpy100.com

[ 本帖最后由 tony_413 于 2009-11-10 15:00 编辑 ]

论坛徽章:
0
9 [报告]
发表于 2009-11-10 14:54 |只看该作者
awk '{w=substr($4,14,2);if(w>9&&w<1{print $0}}' ww
awk  '{a[$1]++}END{for (i in a )if(a>20){print i,a}}' ww
awk 处理这种事情比较好吧,可以把上面两个合并一下。

[ 本帖最后由 jiang_ocean 于 2009-11-10 15:00 编辑 ]

论坛徽章:
0
10 [报告]
发表于 2009-11-10 17:08 |只看该作者
执行:awk  '{a[$1]++}END{for (i in a )if(a>20){print i,a}}'
报错:
awk: (FILENAME=log/test.log FNR=1359859) fatal: attempt to use array `a' in a scalar context
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP