免费注册	查看新帖 \|


平台论坛博客文库

› 论坛 › 综合交流区 › IT图书与评论 › The Art of Unix Programming

1 ... 11 12 13 141516 17 18 19 ... 52 / 52 页下一页

最近访问板块

发新帖

楼主: haoji

上一主题

下一主题

The Art of Unix Programming [复制链接]

论坛徽章:: 0

141楼 [报告]

发表于 2008-05-18 02:33 |只看该作者

That 9% of putative inefficiency buys us a lot. It avoids putting an arbitrary limit on the range of the
numeric fields. It gives us the ability to modify the password file with any old text editor of our choice,
rather than having to build a specialized tool to edit a binary format (though in the case of the password file
itself, we have to be extra careful about concurrent edits). And it gives us the ability to do ad-hoc searches
and filters and reports on the user account information with text-stream tools such as grep(1).

The fact that structural information is conveyed by field position rather than an explicit tag makes this
format faster to read and write, but a bit rigid. If the set of properties associated with a key is expected to
change with any frequency, one of the tagged formats described below might be a better choice.

Economy is not a major issue with password files to begin with, as they're normally read only once per user
session at login time and infrequently modified. Interoperability is not an issue, since various data in the
file (notably user and group numbers) are not portable off the originating machine. For password files, it's
therefore quite clear that going where the transparency criterion leads was the right thing.

Case study: .newsrc format

Usenet news is a worldwide distributed bulletin-board system that anticipated today's P2P networking by
two decades. It uses a message format very similar to that of RFC822 electronic-mail messages, except that
instead of personal recipients messages are sent to topic groups. Articles posted at any participating site are
broadcast to each site that it has registered as a neighbor, and eventually flood-fill to all news sites.

Almost all Usenet news readers understand the .newsrc file, which records which Usenet messages have
been seen by the calling user. Though it is named like a run-control file, it is not only read at startup but
typically updated at the end of the newsreader run. The .newsrc format has been fixed since the first
newsreaders around 1980. Example 5.2 is a representative section from a .newsrc file.

Example 5.2. A .newsrc example

rec.arts.sf.misc! 1-14774,14786,14789rec.arts.sf.reviews! 1-2534rec.arts.sf.written: 1-876513news.answers! 1-199359,213516,215735news.announce.newusers! 1-4399news.newusers.questions! 1-645661news.groups.questions! 1-32676news.software.readers! 1-95504,137265,137268,137274,140059,140091,140117alt.test! 1-1441498

Each line sets properties for the newsgroup named in the first field. The name is immediately followed by a
character which indicates whether the owning user has subscribed to the group or not; a colon indicates
subscription, and an exclamation mark indicates non-subscription. The remainder of the line is a sequence
of comma-separated article numbers or ranges of article numbers, indicating which articles the user has
seen.

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

142楼 [报告]

发表于 2008-05-18 02:33 |只看该作者

Non-Unix programmers might have automatically tried to design a fast binary format in which each
newsgroup status was described by either a long but fixed-length binary record, or a sequence of self-
describing binary packets with internal length fields. The main point of such a binary representation would
be to express ranges with binary data in paired word-length fields, in order to avoid the overhead of parsing
all the range expressions at startup.

Such a layout could be read and written faster than a textual format, but it would have other problems. A
naive implementation in fixed-length records would have placed artificial length limits on newsgroup
names and (more seriously) on the maximum number of ranges of seen-article numbers. A more
sophisticated binary-packet format would avoid the length limits, but could not be edited with the user's
eyeballs and fingers 鈥

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

143楼 [报告]

发表于 2008-05-18 02:34 |只看该作者

[21] Confusingly, PNG supports a different kind of transparency 鈥

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

144楼 [报告]

发表于 2008-05-18 02:35 |只看该作者

Data file metaformats

Prev Chapter 5. Textuality

Next

Data file metaformats

A data file metaformat is a set of syntactic and lexical conventions that is either formally standardized or sufficiently
well established by practice that there are standard service libraries to handle marshalling and unmarshalling it.

Unix has evolved or adopted metaformats suitable for a wide range of applications. It is good practice to use one of
these (rather than an idiosyncratic custom format) wherever possible. The benefits begin with the amount of custom
parsing and generation code that you may be able to avoid writing by using a service library. But the most important
benefit is that developers and even many users will instantly recognize these formats and feel comfortable with
them, which reduces the friction costs of learning new programs.

In the following discussion, when we refer to 鈥渢raditional Unix tools鈥

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

145楼 [报告]

发表于 2008-05-18 02:35 |只看该作者

either of these names will turn up the relevant standards.)

In this metaformat, record attributes are stored one per line, named by tokens resembling mail header-field names
and terminated with a colon followed by whitespace. Field names do not contain whitespace; conventionally a dash
is substituted instead. The attribute value is the entire remainder of the line, exclusive of training whitespace and
newline. A physical line that begins with tab or whitespace is interpreted as a continuation of the current logical line.

A blank line may be interpreted either as a record terminator or as an indication that unstructured text follows.

Under Unix, this is the traditional and preferred textual metaformat for attributed messages or anything that can be
closely analogized to electronic mail. Usenet news uses it; so do the HTTP 1.1 (and later) formats used by the World
Wide Web. It is very convenient for editing by humans. Traditional Unix search tools are still good for attribute
searches, through finding record boundaries will be a little more work than in a record-per-line format.

For examples of this format, look in your mailbox.

Fortune-cookie format

Fortune-cookie format is used by the fortune(1) program for its database of random quotes. It is appropriate for
records that are just bags of unstructured text. It simply uses % followed by newline (or sometimes %% followed by
newline) as a record separator. Example 5.3 is an example section from a file of email signature quotes:

Example 5.3. A fortune file example

"Among the many misdeeds of British rule in India, history will lookupon the Act depriving a whole nation of arms as the blackest."
-- Mohandas Gandhi, "An Autobiography", pg 446%
The people of the various provinces are strictly forbidden to have in theirpossession any swords, short swords, bows, spears, firearms, or other typesof arms. The possession of unnecessary implements makes difficult thecollection of taxes and dues and tends to foment uprisings.
-- Toyotomi Hideyoshi, dictator of Japan, August 1588%
"One of the ordinary modes, by which tyrants accomplish their purposeswithout resistance, is, by disarming the people, and making it anoffense to keep arms."
-- Constitutional scholar and Supreme Court Justice Joseph Story, 1840

It is good practice to accept whitespace after % when looking for record delimiters. This helps cope with human
editing mistakes.

Fortune-cookie record separators combine well with the RFC-822 metaformat for records. If you need a textual
format that will support multiple records with a variable repertoire of explicit fieldnames, one of the least surprising
and human-friendliest ways to do it would look like Example 5.4.

Example 5.4. Three planets in an RFC822-like format

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

146楼 [报告]

发表于 2008-05-18 02:36 |只看该作者

Planet: MercuryOrbital-Radius: 57,910,000Diameter: 4,880 kmMass: 3.30e23 kg%
Planet: VenusOrbital-Radius: 108,200,000 kmDiameter: 12,103.6 kmMass: 4.869e24 kg%
Planet: EarthOrbital-Radius: 149,600,000Diameter: 12,756.3 kmMass: 5.972e24 kgMoons: Luna

Of course, the record delimiter could be a blank line, but a line consisting of "%\n" is more explicit and less likely to
be introduced by accident during editing. In a format like this it is good practice to simply ignore blank lines.

XML

XML is well-suited for complex data formats (the sort of things that the old-school Unix tradition would use an RFC-
822-like stanza format for) though overkill for simpler ones. It is especially appropriate for formats that have a
complex nested or recursive structure of the sort that the RFC-822 metaformat does not handle well. For a good
introduction to the format, see XML In A Nutshell [Harold&Means].

XML has a very simple syntax resembling HTML's 鈥

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

147楼 [报告]

发表于 2008-05-18 02:36 |只看该作者

</filterarg>
<filterarg name="turn"
description="Image rotation"
format="-%value" type="list" default="auto">
<value name="auto" description="Automatic" />
<value name="noturn" description="None" />
<value name="turn" description="90 deg" />
</filterarg>
<filterarg name="scale"
description="Image scale"
format="-scale %value"
type="float" min="0.0" max="1.0" default="1.000" />
<filterarg name="dpi"
description="Image resolution"
format="-dpi %value"
type="int" min="72" max="1200" default="300" />
</filterargs>
<filterinput>
<filterarg name="file" format="%in" />
<filterarg name="pipe" format="" />
</filterinput>
<filteroutput>
<filterarg name="file" format="> %out" />
<filterarg name="pipe" format="" />
</filteroutput>
</kprintfilter>

One advantage of XML is that it has it is often possible to setect ill-formed, corrupted, or incorrectly-generated data
through a syntax check, without knowing the semantics of the data.

The most serious problem with XML is that it doesn't play well with traditional Unix tools. Software that wants to
read an XML format needs an XML parser; this means bulky, complicated programs, and may even restrict your
choice of language when you write programs that want to read or generate your format.

One application area where XML is clearly winning is in markup formats for document files (we'll have more to say
about this in Chapter 16 (Documentation)). Tagging in such documents tends to be relatively sparse among large
blocks of plain text; thus, traditional Unix tools still work fairly well for simple text searches and transformations.

One interesting bridge between these worlds is PYX format 鈥

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

148楼 [报告]

发表于 2008-05-18 02:37 |只看该作者

Many Microsoft Windows programs use a textual data format that looks like Example 5.6. This example associates
optional resources named 鈥榓ccount鈥

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

149楼 [报告]

发表于 2008-05-18 02:38 |只看该作者

backspace, \f for formfeed, \onn or \0nn for the octal character with value nn, \xnn for the hex character with
value nn, \\ for a literal backslash.
l
In one-record-per-line formats, use colon as a field separator. This convention seems to have originated with
the Unix password file. If your fields must contain colons, use a backslash as the prefix to escape them.
l
Do not allow the distinction between tab and whitespace to be significant. This is a recipe for serious
headaches when the tab settings on your users' editors are different; more generally, it's confusing to the eye.
Using tab as a field separator is especially likely to cause problems.
l
Favor hex over octal. Hex-digit pairs and quads are easier to eyeball-map into bytes and words than octal
digits of three bits each; also marginally more efficient. This rule needs emphasizing because some older
Unix tools such as od(1) violate it; that's a legacy from the field sizes in PDP-11 machine language.
l
For complex records, use a 鈥榮tanza鈥

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

论坛徽章:: 0

150楼 [报告]

发表于 2008-05-18 02:38 |只看该作者

ranges).

In Chapter 10 (Configuration) we will discuss a different set of conventions used for program run-control files.

Prev Up Next

The Importance of Being Textual

Home Application protocol design

实战分享：从技术角度谈机器学习入门| 【大话IT】RadonDB低门槛向MySQL集群下战书 | ChinaUnix打赏功能已上线！ | 新一代分布式关系型数据库RadonDB知多少？

1 ... 11 12 13 141516 17 18 19 ... 52 / 52 页下一页

发新帖

Chinaunix › 论坛 › 综合交流区 › IT图书与评论 › The Art of Unix Programming

北京盛拓优讯信息技术有限公司. 版权所有京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号：11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员联系我们：huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP