Chinaunix

标题: utf8的问题 [打印本页]

作者: X-Bio    时间: 2009-07-01 18:27
标题: utf8的问题
我有一个问题:
我有一个文件是UTF8格式的,我想将其中的所有UTF8编码的换行符去掉怎么写正则表达式?
比如:

Integrins play a central role in mediating lymphocyte
adhesion to a number of surfaces. LFA-1 interacts with ICAMs
1-3 that are typically expressed on other immune system cells.
ICAM-4 also interacts with LFA-1, and is known to be
expressed on telencepahlic neurons.<p><p>VCAM-1 regulates
lymphocyte adhesion to activated endothelial cells via Very

他们其实是一行,因为含有UTF8换行符,所以在Excel中换行了.

我使用:
=~ m/\w+/g
可以过滤掉换行符,但是效率太低?

怎么可以通过 ~s///g: 操作去除所有的换行符?


谢谢大家.
作者: cheese_lee    时间: 2009-07-01 18:38
先说一下,UTF8的换行符与ASCII的是兼容的
作者: X-Bio    时间: 2009-07-01 18:40
原帖由 cheese_lee 于 2009-7-1 18:38 发表
先说一下,UTF8的换行符与ASCII的是兼容的


我使用 ~s/\n//g; 处理过,不过不行.
作者: cheese_lee    时间: 2009-07-01 18:43
想看看你的那个文件

你是怎样那个文件处理的?
perl -i~ -pe "s/\n//g" 文件名
这样吗?

[ 本帖最后由 cheese_lee 于 2009-7-1 18:46 编辑 ]
作者: X-Bio    时间: 2009-07-01 18:48
标题: 回复 #4 cheese_lee 的帖子
##################

<bpathway rdf:ID="Regulation_of_Apoptosis">
    <bp:pathwayComponent rdf:resource="#Regulation_of_activated_PAK_2p34_by_proteasome_mediated_degradation1" />
    <bp:pathwayComponent rdf:resource="#Regulation_of_PAK_2p34_activity_by_PS_GAP_RHG10" />
    <bp:pathwayOrder rdf:resource="#Regulation_of_PAK_2p34_activity_by_PS_GAP_RHG10Step" />
    <bp:pathwayOrder rdf:resource="#Regulation_of_activated_PAK_2p34_by_proteasome_mediated_degradationStep" />
    <bprganism rdf:resource="#Homo_sapiens" />
    <bp:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Regulation of Apoptosis</bp:name>
    <bp:displayName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Regulation of Apoptosis</bp:displayName>
    <bpref rdf:resource="#Reactome169911" />
    <bpref rdf:resource="#REACT_13648.2" />
    <bp:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">A regulated balance between cell survival and apoptosis is essential for normal&#xD;
development and homeostasis of multicellular organisms  (see Matsuzawa, 2001).  Defects in control of this balance may contribute  to autoimmune disease, neurodegeneration and cancer.  Protein ubiquitination and degradation is one of the major mechanisms that regulate apoptotic cell death (reviewed in Yang and Yu 2003).</bp:comment>
    <bpref rdf:resource="#Pubmed_11432772" />
    <bpref rdf:resource="#Pubmed_12724336" />
    <bpref rdf:resource="#regulation_of_apoptosis" />
    <bp:dataSource rdf:resource="#ReactomeDataSource" />
    <bp:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Authored: Jakobi, R, 2008-02-05 11:04:14</bp:comment>
    <bp:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Reviewed: Chang, E, 2008-05-21 00:05:41</bp:comment>
    <bp:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Edited: Matthews, L, 2008-02-12 16:13:24</bp:comment>
    <bp:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Edited: Matthews, L, 2008-06-12 00:23:53</bp:comment>
  </bpathway>

####################
里面含有 &#xD; 应该是UTF8的换行符吧.

我通过DOM 获取了comment的所有字符换,接下来是想将这种换行符去除.
作者: X-Bio    时间: 2009-07-01 19:07
在控制符可见模式下发现CR,原来不是UTF8的问题,是回车换行惹得祸害;

~s/\r|\n//g;

这样就可以解决我的问题了,我一直以为是UTF8编码的问题.




欢迎光临 Chinaunix (http://bbs.chinaunix.net/) Powered by Discuz! X3.2