免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 1949 | 回复: 3
打印 上一主题 下一主题

[文本处理] 求助!!!关键字搜索处理 [复制链接]

论坛徽章:
1
2015元宵节徽章
日期:2015-03-06 15:51:33
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2015-09-21 20:31 |只看该作者 |倒序浏览
我想在android源码中 搜索关键字(关键字可能在注释中)并把关键字所在的文件打印出来

思路:
1. 利用opengrok搜索,然后重定向的文本中。//不清楚opengrok实现原理,我目前实现不了
2. 利用ctags索引,然后搜索。 //没有找到命令

因为文件太大grep时间太长。实现不了。

请问,
1.利用ctags可以实现吗?能给小弟指点下吗?
如果用ctags实现,那么在注释中的关键字要如何处理?

2.还有别的其他的好办法吗?


论坛徽章:
22
处女座
日期:2014-10-11 13:33:292015亚冠之塔什干火车头
日期:2015-07-20 19:59:042015亚冠之塔什干火车头
日期:2015-07-26 10:59:31程序设计版块每日发帖之星
日期:2015-08-05 06:20:00每日论坛发贴之星
日期:2015-08-05 06:20:00程序设计版块每日发帖之星
日期:2015-08-07 06:20:00每日论坛发贴之星
日期:2015-08-07 06:20:002015亚冠之阿尔纳斯尔
日期:2015-10-01 15:23:28白银圣斗士
日期:2015-12-07 17:17:06操作系统版块每日发帖之星
日期:2015-12-27 06:20:002015亚冠之广州富力
日期:2015-07-08 15:48:31程序设计版块每日发帖之星
日期:2015-06-11 22:20:00
2 [报告]
发表于 2015-09-21 23:25 |只看该作者
grep -rl实现都很困难,那就不能使用命令了,至于有没有高效语言开发的工具我也不知道了,等大神回复。

论坛徽章:
1
2015元宵节徽章
日期:2015-03-06 15:51:33
3 [报告]
发表于 2015-09-22 09:04 |只看该作者
I have a large electronic library (over 15,000 books) and I was looking for a way to cope with this mass of information. I didn't like the idea of a special catalog, since it would take a lot of manual work to enter the metadata. Besides, my books are in various formats, from HTML, to RTF, to DOC, to PDF, to DjVU. These files lack metadata way too often and I thought a local indexing service with a full-text search might solve my problem. I knew there are more options to choose from than just Google, but I could not find a good modern comparison. Even the table in Wikinfo's Comparison of desktop search software contained too many errors, as I discovered.

I had to compare them myself.

My task imposed certain restrictions on the one hand, but made the others irrelevant on the other hand. So, I was especially interested in a wide gamut of file types, in the ability to add new ones (Epub, fb2, html.zip) and in extensive query language. All software, except for GDS and DocFetcher, was installed from Ubuntu 9.10 repositories.

I have no special preferences regarding the backend, it may be Xapian- or Lucene-based tool, or even a custom backend. On the other hand, Xapian usually requires more disk space, and there is never too much space on desktops.
Beagle

beagle-project.org
The list of supported file types is quite large, and Beagle includes typical office files, source code, LaTeX source, images, audio and video files, RPM and DEB packages, e-mail from Evolution, Thunderbird and Kmail, IM and IRC logs, RSS feeds and many more. Plus, you are free to extend it. I could add new file types by editing one file: /etc/beagle/external-filters.xml.

The indexing process can run in two ways: CPU-lenient and CPU-intensive (using EXERCISE_THE_DOG environment variable). The search engine is based on Lucene.Net. I have no idea why the developers chose this exotic platform to implement Beagle, but Beagle works, and it works well.
Beagle understands limited (very limited, actually) regexps (*). You can search for phrases, exclude words (-word), use OR operator, specify dates when the file was created (on, before, after and between!), limit the search with a file type and define the directory where to look for the files. Unfortunately, you cannot point at the directory under which Beagle should search.

You can even use the metadata of audio and image files, as in the examples from the manual:
artist:Beatles ext:mp3 OR extgg -album:"Abbey Road"
You can specify to search in mail attachments, to search by music genres, mailing lists, IM correspondents and much, MUCH more.
Beagle tends to create huge log files in ~/.beagle/Logs.

Beagle has a web interface. It's very easy to start using it, but not so easy to make use of it, since the alleged links to the results are not exactly links.
The Beagle web site includes information on the query syntax and extending Beagle, but finding the information is next to impossible unless you use Google. Description of query syntax is here.

The index for a 45-Gb home partition was only about 700 Mb.

Google Desktop Search

desktop.google.com/linux/
Google Desktop Search supports OpenOffice.org and MS Office files, PDF, HTML, TXT, audio and image files, and email from Thunderbird. Strangely enough it does not index zipped archives.

I could not add new file types, not even plain text with a different extension. I was pretty sure that GDS supports stemming, but not regexps. To my surprise, stemming did not work in GDS. Nor did regexps. It does not even support AND and OR keywords.

Otherwise, the query syntax is acceptable. You can point at the directory where the file you are looking for is located, or the directory, under which the file is supposed to be. You can search for phrases or exclude words. I was using GDS for some years and it works great as long as you use it in the way Google intended it to work. While suitable for and average office cubicle, it was next to useless for my purposes.
The index size was about 1.7 Gb for 50 Gb of data.

Recoll

VV VV VV.lesbonscomptes.com/recoll/
A large number of file types is supported natively, including plain text, HTML, maildir and mailbox files, OpenOffice.org, MS Office 2007, Abiword, LyX, Kword and Scribus files, and GAIM logs. Many more are supported with external helpers: DOC, XLS, PDF, DjVU, MP3, image files, and so on. Feel free to add to the list, it's easy: one file establishes associations between extension and mime type, another one specifies how the data is extracted from a file of a certain MIME type, and the third one defines applications used to open MIME types.

Recoll is built around the Xapian engine.
I had an impression that the indexing process takes much longer with Recoll than with the other tools. When indexing RTF with unrtf, Recoll created a heap of WMF files in my home directory. Recoll has no indexing daemon that would run in the background all the time. Instead, Recollindex is to be launched from time to time (with cron, for example).

The manual mentions stemming support, but also points that this is done the other way round. Stemming is not included in the database, as in other indexing engines, but the query is stemmed instead. Unfortunately, my version gave different results when searching for plural 'notebooks' and singular 'notebook,' so I assume stemming does not work in my installation of Recoll. Recoll understands regexps pretty well, which to a certain degree compensates for the problems with stemming.
Rich query language, modeled after Xesam End User Search language (see here). Like with Beagle, you can use the dir: prefix to limit the search path to one directory, but you cannot specify a directory tree. Alas! Other useful prefixes include title, author, ext (for file type), etc.

The search client, recoll, is a GUI program, but with the -t option it runs in text mode. It means that instead of specifying a directory tree, I can just grep the results for a string, like this:
recoll -t -q \"jack london\"|grep /library/fiction/adventure
Note that for the command line client, you have to escape quotation marks to denote a phrase search.
Recoll, unlike some other tools, has a decent user manual, containing information on query syntax and adding support for new file types.

The index size threw a damper on me. For a 50-Gb home directory it was more than 5 Gb.
Strigi

strigi.sourceforge.net/
Strigi supports regular expressions. Theoretically, Strigi should support plain text files, PDF, DEB, and RPM packages, OpenOffice.org documents, and zipped files. Besides, Strigi was the only program that successfully indexed EPUB files without customization, interpreting them as just plain ZIP-archives with HTML, NCX, etc. inside.

There's little I can say about this program. The daemon kept crashing when I tested it so I could not even finish building the index for my home directory. The client erroneously classified a lot of hits as being "email."
The incomplete(?) index size was about 750Mb.

Tracker

projects.gnome.org/tracker
Tracker is a part of GNOME Project and it tries to adhere to various useless technologies, like DBus. Tracker introduces the concept of file tags, thus overcomplicating the task of file management. I admit that the notion of file tags might be reasonable, but only if it is supported universally, if tags are freely backed up, copied, etc. Now, fortunately, the tags are not obligatory for Tracker.

The full list of supported file types is unavailable, but the web site talks about image, audio, video, text files, source code, applications, playlists, IM conversations, and so on. No email, nor bookmarks, nor contacts as yet, though. The indexing daemon would segfault occasionally and I could not finish indexing.
As a matter of fact, Tracker was designed as a metadata search tool (and its full name is MetaTracker), but the normal use case is just full text search. Tracker was written to work well even on machines with 128 or 256 Mb RAM. Judging by the slowness of indexing, this statement could be true. I was wrong, Recoll was not the slowest indexer, it was Tracker.

I could not find a good user manual.
DocFetcher

docfetcher.sourceforge.net/en/index.html
Supported file types: HTML, plain text, PDF, Microsoft Office (doc, xls, ppt), Microsoft Office 2007 (docx, xlsx, pptx), OpenOffice.org Writer, Calc, Draw, and Impress, RTF, AbiWord (abw, abw.gz, zabw), CHM, Visio, SVG.

DocFetcher is written in Java. Fast and CPU-sparing indexing. DocFetcher comes in two flavors: a binary installable package and a "portable" version, which you can run right from your home directory.

DocFetcher supports regular expressions (at least * and ?). Phrase search, AND and OR keywords, search in content or in metadata: author and title fields are supported. It does not index zipped files. It is easy to add new filename extensions that are treated as yet another text file or HTML, but I could not add a new file type which is to be treated in a special way. For me this means that I cannot process custom XML to convert the content to the proper charset. It's a problem.
An interesting query feature is boosting terms: "You can assign custom weights to words, thus increasing or decreasing the level of matching for a particular document if the weighted word occurs in it. This allows you to influence the relevance sorting of the result page. Example: dog^4 cat will bring up the documents with "dog" in it on the top of the result page.

The manual can be found in the downloaded archive, but it is very brief.
Pinot

pinot.berlios.de/
Like Tracker and Strigi, Pinot is built for DBus. Its indexing engine uses the same Xapian engine as Recoll, so I could use Pinot text-mode client to query the database built by Recoll indexer. Pinot can use other databases, but I was not interested in this option. The crawler takes a huge share of RAM and CPU. It ate up 70% of RAM on my PC, causing some other programs to crash, so I had to leave it for a night to complete indexing.

The documentation consists of one Readme file and a couple of web pages. Quoting these web pages, "The following document types are supported internally :
plain text
HTML
XML
mbox, including attachments and embedded documents
MP3, Ogg Vorbis, FLAC
JPEG
common archive formats (tar, Z, gz, bzip2, deb)
ISO 9660 images
"The following document types are supported through external programs:
PDF (pdftotext required)
RTF (unrtf required)
OpenDocument/StarOffice files (unzip required)
MS Word (antiword required)
PowerPoint (catppt required)
Excel (xls2csv required)
DVI (catdvi required)
DjVu (djvutext required)
RPM (rpm required)"
Indeed, new file types are defined in the file external-filters.xml very similar (but not identical, Pinot developers warn) the the file with the same name used by Beagle.

I have to say that these external programs made indexing of PDF, RTF, and other files a difficult task. Indexing a PDF document took up to two minutes.
Conclusion

Recoll and Pinot may be considered good alternatives to Beagle, but the size of the Xapian index database leaves just one choice for me: Beagle.

论坛徽章:
1
2015元宵节徽章
日期:2015-03-06 15:51:33
4 [报告]
发表于 2015-09-22 10:33 |只看该作者
我用recoll 实现了。
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP