论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2015-01-27 12:16 |只看该作者 |倒序浏览

需要注意的一些事情
本文使用的为 Java 语言解决方案。

搜索方法上，按照个人理解，有两点关键:

分词

将类似 “women” 这样的词分解成 “wo’men”。经过这样的分词，在搜索时无论通过全拼音匹配，还是首字母匹配，处理起来都比较方便。

词库构建

目前我接触过的常见搜索场景有两种，它们对词库的建立有如下影响:

1.通讯录搜索
对词库有频繁的添加 / 删除操作。

2.站内搜索
使用预先维护好的词库。

如果是通讯录搜索，可以在新建联系人的时候，把联系人的名字转为拼音，存入数据库中另一个记录拼音的字段，在搜索的时候通过拼音来匹配。
如果是站内搜索，可以直接建立拼音词库，搜索的时候通过拼音来匹配。

实现
首先是词库的构建。毫无疑问，我们需要为词库建一张至少包含中文词汇以及对应的拼音词汇的表。
通讯录搜索，可能会使用一些例如 pinyin4j 的第三方库，将中文字转换为拼音。当然也我们可以自己实现转换拼音的功能，如果这样做，则需要在代码中设定好中文字库和对应的拼音，同时特别注意多音字的处理。

下面是使用 pinyin4j 库的例子。使用 pinyin4j 库可以支持简体 / 繁体中文转换为拼音，而且支持带声调的格式化输出。
我们使用 Maven 来添加 pinyin4j，在 pom.xml 中添加

<dependencies>
<dependency>
<groupId>com.belerweb</groupId>
<artifactId>pinyin4j</artifactId>
<version>2.5.0</version>
</dependency>
</dependencies>

复制代码

实现类:

public class Chinese {
private HanyuPinyinOutputFormat format = null;
private String[] pinyin;
public Chinese() {
format = new HanyuPinyinOutputFormat();
format.setToneType(HanyuPinyinToneType.WITHOUT_TONE);
pinyin = null;
}
//转换单个中文字符
public String getCharacterPinYin(char c) {
try {
pinyin = PinyinHelper.toHanyuPinyinStringArray(c, format);
} catch(BadHanyuPinyinOutputFormatCombination e) {
e.printStackTrace();
}
// 如果c不是汉字，返回null
if(null == pinyin)
return null;
// 多音字取第一个值
return pinyin[0];
}
//转换一个字符串
public String getStringPinYin(String str) {
StringBuilder sb = new StringBuilder();
for(int i = 0; i < str.length(); ++i) {
String tmp = getCharacterPinYin(str.charAt(i));
if(null == tmp) {
// 如果str.charAt(i)不是汉字，则保持原样
sb.append(str.charAt(i));
} else {
sb.append(tmp);
//分词
if ( i < str.length() - 1 && null != getCharacterPinYin(str.charAt(i + 1))) {
sb.append("\'");
}
}
}
return sb.toString().trim();
}
public static void main(String[] args) {
Chinese chinese = new Chinese();
String str = "哈哈，我爱 Coding";
String pinYin = chinese.getStringPinYin(str);
System.out.println(pinYin);
}
}

复制代码

输出的结果

ha'ha，wo'ai Coding

复制代码

每当有通讯录联系人更新时，采用上述方法更新数据库中的拼音字段即可。

站内搜索，词库来源可以使用搜狗标准词库和细胞词库等。
搜狗标准词库

下载的搜狗词库可以使用深蓝词库转换器来转换成 txt 文件，或者其他输入法的标准格式。
深蓝词库转换 2.0

转换为 txt 文件后，处理以后，修改成为 insert 语句，插入数据库。

最后的关键，就是将输入的拼音分词，然后与数据库中的拼音字段匹配，分词采用正则表达式实现。

分词实现：

public class PinyinUtils {
//分词正则表达式
public static String regEx = "[^aoeiuv]?h?[iuv]?(ai|ei|ao|ou|er|ang?|eng?|ong|a|o|e|i|u|ng|n)?";
public static String split(String input) {
int tag = 0;
StringBuffer sb = new StringBuffer();
String formatted = "";
List<String> tokenResult = new ArrayList<String>();
for (int i = input.length(); i > 0; i = i - tag) {
Pattern pat = Pattern.compile(regEx);
Matcher matcher = pat.matcher(input);
boolean rs = matcher.find();
sb.append(matcher.group());
sb.append("\'");
tag = matcher.end() - matcher.start();
tokenResult.add(input.substring(0, 1));
input = input.substring(tag);
}
if (sb.length() > 0) {
formatted = sb.toString().substring(0, sb.toString().length() - 1);
}
return formatted;
}
public static void main(String[] args) {
String str = "koudingboke";
System.out.println(PinyinUtils.split(str));
}
}

复制代码

输出结果

kou'ding'bo'ke

复制代码

根据此输出结果，在词库中匹配即可。

根据以上方式搜索出的词汇，会比较固定。如果有按照搜索频率对搜索结果排序的需求，可以针对每个词汇的查询计数。具体实现这里不再赘述。

个人一点粗浅经验，欢迎各位大牛一起交流。

本文来自 Coding 官方技术博客（blog.coding.net），如需转载请注明出处，谢谢。

文库|博客

返回列表

Chinaunix › 论坛 › 程序设计 › Java › 谈谈我做拼音搜索的一点经验

谈谈我做拼音搜索的一点经验 [复制链接]

浏览过的版块