论坛徽章:: 0

电梯直达

1楼 [收藏(0)] [报告]

发表于 2007-07-30 10:58 |只看该作者 |倒序浏览

Abstract: Applications with many shared dependencies can suffer from a long startup time。 This article provides a few pointers on how you can reduce the application startup time for ELF executables。 It also explains the intricacies of where time is spent by the runtime linker when processing an application。
Introduction
Application startup time is measured from the moment a user clicks on an icon to launch an application, or executes it on the command line, to the time an application becomes visually available。 Fast application startup time is a desirable feature to have in a desktop environment。 This article will provide some insight on how you can lower your application startup time。
Applications come in two common flavors。 They are generally shell scripts or ELF executables。 Every application needs an interpreter to execute。 Shell scripts begin with the format :
#!pathname [arg]
where pathname is the full path to the interpreter。 For ELF executables, the interpreter is the runtime linker ld。so。1(1)。
(There are other kinds of applications such as Java programs。 These kinds of applications are not discussed in this article。)
The term linker describes two kinds of linkers。 The first kind of linker is the link-editor (also called linker or ld(1)), and you use it to link object files into shared objects or executables。 The link-editor can either be directly executed by the user, or it can be invoked by the compiler。 The second kind is the runtime linker (also called loader or ld。so。1(1))。 The runtime linker plays an important role in the execution of the program。
This paper will focus on using the link-editor, runtime linker, and other tools and techniques to reduce startup time for ELF executables。
What happens when you execute an application?
Before delving into the techniques of reducing application startup time, let's examine how an application is loaded and processed when it is executed。 When you execute an application on the command line, the shell (for example, /usr/bin/sh, /usr/bin/csh) calls one of the exec(2) family of system calls。 The exec(2) system call does a number of things such as bookkeeping, setting up initial values for signal handlers, setting up the default locale, and more (see exec(2) for more details)。 It also loads the executable and locates the required interpreter。 The kernel loads the correct interpreter into memory and passes control to this interpreter。
The actions carried out by the runtime linker (also known as ld。so。1(1))。 can be categorized into the following steps。
1.    On getting control, the runtime linker first reads its configuration file (/var/ld/ld。config), if it exists。 (Applications can have their own runtime linker configuration file, but typical programs don't。)
2.    The runtime linker then starts processing the executable by determining the dependencies of the executable。
You can find the libraries the executable depends on by examining the structures in the 。dynamic section of the ELF executable。 You can use the elfdump(1) command as follows to examine the dependencies:
3.          [pastwatch](28)> /usr/ccs/bin/elfdump -d
4.             /usr/lib/libX11。so |/usr/bin/grep NEEDED
5.          [0]  NEEDED          0x4cc0          libXext。so。0
6.          [1]  NEEDED          0x4ccd          libsocket。so。1
7.          [2]  NEEDED          0x4cdc          libnsl。so。1
8.          [3]  NEEDED          0x4ce8          libdl。so。1
9.          [4]  NEEDED          0x4cf3          libc。so。1
10.    [pastwatch](29)>
You can find a complete list of dependencies by using the ldd(1) command (/usr/bin/ldd myapplication)。1
11.  The runtime linker locates these libraries2 by using several rules。 It first searches the path contained in the LD_LIBRARY_PATH environment variable。 Then it searches in the RPATH value recorded in the ELF executable, and finally the default system library locations。
12.  The libraries are then loaded by ld。so。1(1)。
13.  For each dependency loaded, the linker analyzes the dependency and loads additional dependencies for these libraries。 ld。so。1(1) maintains a map of all the dependencies, and their associated symbols and other information as a linked list (also known as link-maps)。
14.  Once you have loaded all the dependencies for the executable, the runtime linker updates the memory image of the executable and its dependencies to reflect the real addresses for data and function references。 This is also known as relocation processing (see more on this in the section on relocation processing。)
15.  The runtime linker then invokes any initialization functions and initialization sections specified for the shared objects。 The 。preinit_array, 。init_array, and 。init sections are processed to execute a pre-initialization code。 The 。preinit_array and 。init_array sections contain arrays of pointers to initialization functions。 The Linkers and Libraries Guide on docs。sun。com has examples on writing your own init and fini functions。
16.  In the final step, the runtime linker passes control to the executable。 The program entry point is a part of the ELF header and contains a pointer to the startup routine。 For C programs, the startup routine may be "_start"。 For other languages or compilers, the start point may be _main, or main。
How does this apply to application startup time?
The application execution steps outlined above are carried out each time an application is executed。 A considerable amount of startup time is spent performing symbolic relocations。 Generally a lot more time is spent relocating symbols from dependency objects than relocating symbols from the executable itself。 To gain noticeable reduction in startup time, you have to decrease the amount of relocation processing。 The relocation processing section provides more details on how you can do this。 A little tweaking on your part can optimize some of the steps carried out while executing an application。 For example, having a lot of directories in the LD_LIBRARY_PATH environment variable will cause the runtime linker to search for dependencies in all of those directories。 This can be time consuming, and is not recommended。 Not having any 。init sections may help in reducing the load time of the dependency, as there are no initialization functions to be executed。
When you want considerable improvement in startup time, consider the following suggestions or tips。
1.    Use dynamic object loading。 You can load the minimal application and then use the dlopen(3DL) call to load all the additional libraries。 For example, consider a text editor whose functionality is provided by a set of modules (one of which is syntax highlighting)。 The application can load the main window, and then dlopen the module that provides syntax highlighting。 If your application is one big file, you can break it up into many libraries, and then load these libraries as needed。 In this case, the same amount of time is required to load libraries, but you are distributing the time throughout the life cycle of the application。
2.    Use lazy loading of libraries。 Using lazy loading defers the loading of the library。 Thus the relocation of a library is delayed until the library is needed。 Lazy loading is described in detail below。
3.    Reduce the number of relocations that the runtime linker performs。 The section on Relocation Processing explains this in greater depth。
Loading libraries dynamically using dlopen
The Solaris Operating Environment provides a set of APIs collectively known as the Runtime Linker Programming Interface。 Using the runtime linker programming interface, applications can load additional dependencies during the lifetime of the application。 This is particularly useful when an application uses a dependency just once or in bursts。
A simple example to illustrate this is an application that runs for three hours and then displays the output on a window。 The code to display the results to the window is not used until the calculations are done。 So instead of suffering the penalty for loading the GUI library at the start of the application, you can load it only after performing the calculations。 A similar type of functionality is provided with lazy loading, which is described below。
A successful dlopen(3DL) call locates and loads a shared object in the running application。 Any dependencies of the shared object are located and loaded。 The shared objects loaded are then relocated by the runtime linker and initialization sections are executed。 You can locate symbols in the dlopened shared object using the dlsym(3DL) function call。 You can then call functions and data inside the shared object using these symbols。 Then you can unload the dlopened shared object using the dlclose(3DL) function call。 The memory occupied by the shared object is freed after loading。
dlopen(3DL), when not used with the mode RTLD_GLOBAL, creates a unique group that effectively reduces relocation costs。 The groups symbol tables will not be looked at by other dlopen() groups。
One of the advantages of using the dlopen(3DL) call is that you can force an object to be unloaded after you use it, leading to better memory consumption。 This is not possible with lazy loading。 Refer to the Linkers and Libraries Guide for more information。
The downside to using dlopen(3DL) is that you can only access the object by using the dlsym(3DL) call。 If you have many objects, or if the library is complex and has many entry points, this becomes cumbersome。 The other major consideration is that this will introduce OS-specific code into the product。 This problem goes away when you use lazy loading。
Lazy Loading
Lazy loading postpones the loading of libraries until the libraries are actually used。 When a library is lazy loaded, it is not loaded into memory until a relocation can be satisfied by the library。 This generally happens when an application calls a function in the library。 If you don't reference a function from that library, it never gets loaded。 Data relocations defeat the purpose of lazy loading as they need to be done immediately as part of object initialization。 In the worst case, the application will take the same amount of time that it takes for normal loading。 Because the libraries here are loaded on demand, you do not pay a penalty for not using a library。
Let's assume that you are writing an editor that has the functionality to read gzipped files along with normal text files。 Let us assume that the functionality for reading gzipped files is provided in a library libreadgzip。so。 If you use the default style of linking, libreadgzip。so gets loaded and relocated every time you execute the application。 When using lazy loading, it doesn't get loaded until you actually try to read a gzipped file。 Thus for sessions where you are not reading gzipped files, this module never gets loaded。
A majority of applications can benefit from lazy loading。 Lazy loading does not get any noticeable improvement in application startup time when the library is a core part of the application, or for small applications (for example, your hello-world program) that mainly depend on /usr/lib/libc。so。
You can mark libraries to be lazy loaded by using the -z lazyload flag of the link-editor。 An application that lazy-loads libraries has a 。SUNW_syminfo section。 This sections maps symbols to the object in which they were defined at link time。 This mapping is used to lazy load the object when the runtime linker does a symbol lookup。
[pastwatch](37)> cc -o lazyload_ex driver。c -z lazyload -lfoo -z nolazyload -lbar
The above example lazyloads libfoo。so, but not libbar。so。 You can find the libraries that are lazy loaded for an application by using the elfdump command。 For the example shown below, libfoo。so and libc。so are lazy loaded。
[pastwatch](47)>/usr/ccs/bin/elfdump -d foo | head
Dynamic Section:  。dynamic
      index  tag             value
      [0]  POSFLAG_1       0x1             [ LAZY ]
      [1]  NEEDED          0x1d1          libfoo。so
      [2]  POSFLAG_1       0x1             [ LAZY ]
      [3]  NEEDED          0x1db          libc。so。1
      [4]  INIT          0x10910
      [5]  FINI          0x10994
      [6]  HASH          0x1016c
[pastwatch](48)>
You can combine several lazy loaded libraries into a group by using the -zgroupperm flag of the link-editor。 This option instructs the link-editor to group these libraries together into a unique group。 The symbol tables of these libraries are available only to those in the group and no one else。 Because fewer objects will be searched for symbol matches, there is a gain in performance。
Lazy loading is not recommended for poorly written applications that depend on the loading of libraries in a specific pattern, or those which assume that the 。init sections are executed only at application startup。 Lazy loading these applications may have unexpected results。 Even though these kind of applications are rare, the linker is conservative in not making lazy loading the default mode of loading libraries in the Solaris Operating Environment。
You should also take care that the lazy loaded library is actually available on disk when it gets referenced, because the runtime linker will only throw an error while it actually tries to load the unavailable library (which may be sometime well into the execution of the application and not at startup)。 This can also be considered to be an advantage, as it allows a product to ship without all of its dependencies。
Relocation Processing
Because libraries and other ELF objects can be loaded into any part of the system memory, access to data is offset from a base value。 These offsets (also known as relocations) need to be converted at application startup to actual memory locations。
The list of relocations (also known as relocation records) is stored in the 。rel[name] and 。rela[name] sections of the ELF object and are processed and interpreted by the runtime linker during application startup。 This process is known as relocation processing。
Relocations can be categorized into two basic types3: symbolic and non-symbolic。
A symbolic relocation is a relocation that requires a lookup in the symbol table。 The runtime linker optimizes symbol lookup by caching successive duplicate symbols。 These cached relocations are called "cached symbolic" relocations, and are faster than plain symbolic relocations。
A non-symbolic relocation is a simple relative relocation that requires the base address at which the object is mapped to perform the relocation。 Non-symbolic relocations do not require a lookup in the symbol table。
Before we proceed to describe more about relocation processing, it is important to know about the procedure linkage table。 The procedure linkage table holds addresses of all functions accessed by the object。 Each entry (known as a plt entry) in the table corresponds to a function, and is stored in the 。plt section of the ELF file。 Initial function calls are routed through the plt by calling the plt entry corresponding to the function。 The plt entry then invokes the runtime linker which then does the actual binding。 If the function is defined by a library that is lazy loaded, the runtime linker will load the library。 When the library is loaded, the plt entry is rewritten to call the actual function。 This process allows the linker to defer resolving function references until they are actually used, and is known as lazy binding。
Why are relocations expensive?
Relocation processing must be performed before application startup。 During relocation processing, the runtime linker examines all the relocations and rewrites the process image with the relocated address。 During application startup, the runtime linker creates a link map for each object that it loads (refer to the What really happens when you execute an application? section)。 These link maps are maintained as a linked list。 For symbolic relocations, the runtime linker needs to traverse the whole link-map list, looking in each object's symbol table to find the required symbol definition。 This is a time-consuming process, because there can be many link maps containing many symbols。 The process of looking up symbol values needs to be done only for symbolic relocations that reference data。 Symbolic entries from the 。plt section are not relocated at startup because they are relocated on demand (see lazy binding, above)。
Non-symbolic relocations do not require a lookup and thus are not expensive and do not affect the application startup time。
Because relocation processing can be the most expensive operation during application startup, it is desirable to have fewer symbols that can be relocated。
Finding the number of relocations
You can find the number of relocations that the linker will perform by using the following command :
[pastwatch] (4)> /usr/ccs/bin/elfdump -r
/usr/lib/libX11。so \
                           | /usr/bin/grep  -v NONE
                           | /usr/bin/grep -c R_
1931

To find the number of non-symbolic relocations, you can use :
[pastwatch] /tmp (6)> /usr/ccs/bin/elfdump -r
/usr/lib/libX11。so  \
                           |/usr/bin/grep -c RELATIVE
1265
The number of symbolic relocations is calculated by subtracting the number of non-symbolic relocations from the total number of relocations。 This number also includes the relocations in the procedure linkage table。
You can use the Perl script in the Appendix to determine the number of relocations for ELF objects。 This script was used to determine the number of relocations for a typical X11 binary /usr/X/bin/xterm
The results are tabulated in this table:
Non-symbolic relocations
12,356

PLT symbolic relocations
3,444

Other symbolic relocations
2,232

Total relocations
18,032

The results are tabulated in this table:
Reducing the number of relocations
One way of reducing the relocations is to have fewer symbols visible outside your application or library。 You should declare locally used functions and global data private to the application/library。 For C programs, you can do this by using the static keyword as a function type。 This reduces the scope of the function to local and the symbol will not appear in the dynamic symbol table (。dynsym )。 You can use the nm(1) command to examine the symbol table of an object file。 You will observe below that local_function has a Bind value of LOCL and is a local function。
[pastwatch] linkers > cat foo。c
static void local_function()
{
}
void global_function()
{
}
[pastwatch] linkers > /opt/SUNWspro/bin/cc -c foo。c
[pastwatch] linkers > /usr/ccs/bin/nm  foo。o
foo。o:
[Index] Value  Size Type  Bind  Other Shndx  Name
[2]    |    0|    0|OBJT |LOCL |0 |3    |Bbss。bss
[3]    |    0|    0|OBJT |LOCL |0 |4    |Ddata。data
[4]    |    0|    0|OBJT |LOCL |0 |5    |Drodata。rodata
[1]    |    0|    0|FILE |LOCL |0 |ABS |foo。c
[6]    |    56| 20|FUNC |GLOB |0 |2    |global_function
[5]    |    16| 20|FUNC |LOCL |0 |2    |local_function
Using mapfile to reduce symbol scope
You can also use the mapfile option to control the scope of functions and symbols。 The link-editor uses a mapfile pointed to by the -M option。 For example, /usr/ccs/bin/ld -M /tmp/Mapfile instructs the link-editor to use the mapfile called /tmp/Mapfile。 The following example is taken from the Linkers and Libraries Guide:
$ cat mapfile
lib。so。1。1
{
      global:
            foo;
      local:
            *;
};
$ cc -o lib。so。1 -M mapfile -G foo。c bar。c
$ nm -x lib。so。1 | egrep "foo$|bar$|str$"
[30] |0x00000370|0x00000028|FUNC |LOCL |0x0  |6  |bar
[31] |0x00010428|0x00000004|OBJT |LOCL |0x0  |12 |str
[35] |0x00000348|0x00000028|FUNC |GLOB |0x0  |6  |foo
This example instructs all symbols other than foo to be local to the object, and hence there is only one entry in the dynamic symbol table。
-Bdirect
The -Bdirect flag of the link-editor was introduced in Solaris 8, and can be used to reduce relocation costs。 You can use it with single or multiple libraries。 When you use the -Bdirect flag, the link-editor records for each symbolic relocation the dependency (library) from which the symbol is defined。 Because the runtime linker knows the object where the symbol is provided, it will search only the symbol table of that object。 Because symbol tables of individual objects are always smaller then the all the symbol tables combined, symbol lookups are much faster。 You can also pre-load objects with a shared library built with the -Bdirect flag。
-Bsymbolic
You may be tempted to save some relocation processing by building your shared object with the -Bsymbolic flag of the link-editor。 Even though the savings is possible, the risk of using -Bsymbolic outweighs the performance gain you achieve。 When you use the -Bsymbolic flag with the link-editor, it will first search within the object itself for symbol addresses and pre-bind them at link time。 At first glance, this seems like it will speed up object relocation because some of the symbols were already pre-bound in the link stage。 But the -Bsymbolic flag is not recommended for general usage with shared libraries because it ensures that symbols within the object have greater scope than global symbols。 This is non-intuitive, and has the side effect of not letting you pre-load functions。
For example, if you have your own implementation of malloc, and you build it with the -Bsymbolic flag, you cannot pre-load (interpose) another malloc such as libwatchmalloc。so for debugging purposes。 In some cases, it may inhibit the use of C++ exceptions across library boundaries。 The Linkers and Libraries Guide also warns you about using this flag。 However, it is interesting to note that for an executable, -Bsymbolic is the default。
The combreloc flag
You can use the -z combreloc flag of the link-editor to instruct the link-editor to combine all relocation sections into one section called 。SUNW_reloc。 Because all relocations are now in one section, the runtime linker can quickly find the number of relocations to be performed。 It does not need to go through the 。rel[a] sections to find all relocations。 The second advantage is that you can process relocations with the benefits of the symbol-caching mechanism of the runtime linker。
Even though the -z combreloc flag is beneficial to most programs, it is not the default。 This can be attributed to the linker being conservative and the fact that the Generic Application Binary Interface (gABI) defines each section as requiring a relocation table。 Using the -z combreloc flag means you're taking a slight deviation from the standard。 It is very rare that the use of the -z combreloc flag breaks a program。 In fact, we have never found a program that breaks just by using the -z combreloc flag。
Reducing relocations using position-independent code (PIC)
PIC stands for position-independent code, and is generally the kind of code used for shared objects。 When the compiler generates PIC, it generates a table called the GOT or global offset table。 The global offset table is stored in the 。got section of the ELF object file。 Data items are referenced indirectly through the GOT。 You can generate PIC by using the -KPIC (to generate 32-bit addresses) or the -Kpic (to generate 13-bit addresses) options of the Forte Developer Compiler。
The major advantage of using PIC is that memory pages are sharable between processes。 In an ELF object that contains PIC, all code is stored in the 。text segments。 The 。data segment holds all the data used by the PIC object, and also holds the GOT。 Because the runtime linker does not need to do any relocation processing on the object with PIC (it only relocates the GOT), 。text segments are memory-mapped with a read-only tag。 The 。data sections are mapped read-write into memory。 Because pages that are read-only can be shared between processes, and are only loaded on demand, this leads to better sharing of system memory。
For example, if a PIC shared library libfoo。so。1 is already loaded into memory, and another process that also uses libfoo。so。1 is executed, that process uses the already loaded 。text segment of libfoo。so。1。 The 。data segments are private to each process。
Using PIC has the nice side effect of reducing the number of relocations in your shared object。 The following example illustrates the gain。
Example: Consider a sample code fragment below。 Assume foo, and gVal is defined in some other library (which implies that it is defined in the symbol table of the other library):
void bar() {
      gVal = 0;
      foo();
      gVal = 1;
      foo();
}
For normal types of code, the runtime linker should look up the value for the symbols gVal (three times: once to set the high-order bits of the register that will be used to access gVal and twice to assign values to gVal), and foo (two times) from the symbol table。 The procedure linkage table will only be relocated when foo is first called。 It is thus safe to ignore the cost of two relocations associated with relocating foo。 Thus you'll need three symbolic relocations during startup。 The runtime linker will then have to rewrite the executable image at three places in memory to represent the actual addresses。 The function in memory will then look like:
void bar(){
      mov 0x98765AB, 0;
      call plt[1]; /* plt is procedure linkage
         table */
      mov 0x98765AB, 1;
      call plt[1];
}
If we used PIC, the code for function bar will look something like:
void bar() {
      mov *got[1], 0; /* Move 0 into address pointed
         by got[1] */
      call plt[1]; /* plt is procedure load table */
      mov *got[1], 1;
      call plt[1];
}
Because the runtime linker relocates only the GOT for ELF objects with PIC, it will only rewrite the addresses once (got[1])。 You are thus saving two relocation costs by using PIC in this small example。 Typically shared libraries have more data relocations, and using PIC reduces the relocation costs。
Shared libraries that contain objects that are not PIC are forced to use read-write pages for the text segment。 This is considered bad, and you should avoid it at all costs。 Shared libraries that are not PIC tend to have more relocations as they do not have a PLT。 (The link-editor currently does not generate PLT for shared objects with text relocations。) Thus each function call incurs a relocation cost。 You can find this out by using dump(1) and inspecting the output for a TEXTREL entry。 For example:
dump -Lv libfoo。so。1 | grep TEXTREL
One way to prevent text relocations in shared objects is to use the -z text flag of the link-editor。 This flag causes a fatal error while linking if text relocations remain the same。
However, it should be noted that using PIC introduces a level of indirection while accessing data and functions。 This may cause a slowdown in runtime performance in certain cases。 But the combined effect of the gains from sharable memory pages and reduced relocation processing favor the use of PIC。
crle(1)
crle(1) (pronounced "curly") is a new tool introduced in the Solaris 8 Operating Environment。 crle(1) is especially useful, because it provides the functionality to perform many of the optimizations we mentioned above。 crle(1) has the ability to create runtime linker configuration files that allow the runtime linker to maintain default search paths, directory cache, and alternative objects。 You can find more information on the crle(1) man page (docs。sun。com)。
Alternate objects are pre-relocated objects。 You can instruct crle(1) to generate alternate objects for an ELF executable and all its dependencies。 crle(1) creates alternate objects by actually loading the application (in a similar way that ldd(1) does), and then dumping the image using dldump(3DL)。 The dldump(3DL) call creates a new dynamic object from a loaded object belonging to the current application。 Because the application was loaded into memory, and then dumped, all relocations use the base address at which the shared object was loaded。
You can also use crle(1) to pre-relocate shared libraries。 These pre-relocated shared libraries can then be shared among multiple processes。
crle(1) provides you with a fine-grained control over what relocations are to be carried out。 When you dump the application using the RTLD_REL_ALL flag, no relocations remain in the dumped object。 You can use the RTLD_REL_RELATIVE flag to only pre-relocate relative (non-symbolic) relocations。 You can use other flags too, such as RTLD_MEMORY, to dump the alternate object using the memory image of the loaded object。 This allows for data modified in the loaded object to be captured in the alternate object。 You can find a complete list of flags and their descriptions in the dldump(3DL) man page。
crle(1) does not process relocations in objects that cannot be dumped4 (for example, in filter libraries such as /usr/lib/libdl。so)。 These are generally few and non-expensive。
Because much of the relocations are already carried out when using alternate objects, the runtime linker does not need to relocate many symbols。 This greatly reduces the relocation costs associated with application startup。
Because crle(1) processes the application to load at a fixed address, and the cache is specific to the platform it was crle'd on, it may not be possible to distribute crle'd applications。
The following steps illustrate how crle(1) (see docs。sun。com) was used with Mozilla5 (available from
http://www
。mozilla。org) to reduce the number of relocations, and thus reduce startup time。
1.    Ensure that all dependencies of the mozilla binary (mozilla-bin) are resolved。 ldd mozilla-bin must not return "not found" for any of its dependencies
2.    Execute the crle(1) command as shown below。 Assume the mozilla distribution is installed in /export/home1/mozilla and we want to create the pre-relocated binary at /export/home1/mozilla-crle
3.          crle -c \
4.          /export/home1/mozilla-crle/ld。config。mozilla-bin \
5.          -f RTLD_REL_ALL \
6.          -l /usr/lib:/opt/gnome/lib:/export/home1/mozilla \
7.          -G /export/home1/mozilla/mozilla-bin
8.
/export/home1/mozilla-crle now contains mozilla-bin and its dependencies with many of the relocations already applied to them。 The following table lists the number of relocations for the original mozilla-bin and the pre-relocated mozilla-bin and all of its dependencies。
Binary
Symbolic
Non-symbolic
Total

mozilla-bin
9781
16745
26526

mozilla-bin (processed with crle(1))
135
0
135

The above table illustrates that, except for relocations in filter libraries (1% overall), all the other relocations (99% overall) of the application and its dependencies were computed and cached。
The reduction in application startup time is shown below
Application
Real(s)
User(s)
Sys(s)

mozilla-bin
0。97
0。79
0。06

mozilla-bin (processed with crle(1))
0。72
0。57
0。09

The above experiment was carried out on an Ultra 60 (two UltraSparc processors at 360 MHz and 1024MB of RAM) with the DISPLAY variable turned off。
It is clear that crle(1) helps reduce relocation costs by caching the relocations of the executable and its dependencies。 crle(1) is particularly useful for applications with lots of dependencies and thousands of relocations。 Normal applications will have minimal reduction in application startup time because they typically do not have many relocations to be carried out。
Other things to note
1.    You should maximize the text segment and minimize the data segment。 Refer to the Linkers and Libraries Guide for detailed information on doing this。
2.    Make sure all non-mutating strings are declared using the const keyword。 You store const strings in the 。text segment which is read-only。
3.    Use the strings(1) command on the object to find if it has duplicate strings。 Eliminating duplicate strings will decrease the size of the symbol table, leading to faster lookups。
4.    Some compilers place non-mutating strings in the read-only 。text segment。 The -xstrconst flag of the Forte Developer C compiler places all string literals into the read-only data segment of the text section。 This option also combines duplicate strings into one string。 The Forte Developer C++ compiler supports constant strings when you use the -features=conststrings flag。 But it does not combine duplicate strings。 The appendix contains an example on finding duplicate strings and using the -xstrconst flag to eliminate them。
5.    Make sure all your libraries are built with the -R/path/to/library flag。 This will decrease the search space to find the library, and not force you to use LD_LIBRARY_PATH。
6.    Make sure LD_LIBRARY_PATH is not set, and make sure that your application does not rely on LD_LIBRARY_PATH being set。 The runtime linker locates a library by searching through the directories listed in the LD_LIBRARY_PATH environment variable。
7.    Forte Analyzer has an option to dump a mapfile after profiling the application。 This mapfile will contain information that will help you create an executable with a smaller working set size or more effective instruction cache behavior, or both。 You can use this mapfile to re-link your application。
Conclusion
Even though application startup time is a small percentage of the overall application execution time, it plays a big role in how the user perceives the performance of the application。 Getting significant improvements in startup time will often involve restructuring the application, and this may not be possible or desirable in many cases。 Some of the ideas in this article can help you to reduce application startup time。
During application development, you should consider using mapfiles, and the -Bdirect and -z combreloc options of the link-editor。 They aid in reducing the startup time for applications, with no changes to the source code。 We also highly recommend that you use positional independent code (PIC) to build your shared libraries。
Lazy loading is very useful in reducing the application startup time, and does not suffer the drawbacks of dynamically loading objects using dlopen(3DL)。 You can use lazy loading with minimal changes in the source code。 However, because data references across libraries cause shared libraries to be loaded immediately, they should be avoided。
crle(1) is a very useful tool, which you can use to reduce application startup time。 You should carefully consider the overall effects of using crle(1) before using it。
The language that the application is written in also plays an important part in the startup time。 C++ shared objects are known to have expensive 。init functions, and a lot more relocations than shared objects written in other languages。 Also, because C++ uses name-mangling, it is difficult to use mapfiles to control the scope of the symbols。 Although the methods described in this paper are valid for all languages, they may have little effect because of the language implementation。 You can use link-editor flags, mapfiles, declaring functions that are not exported static, PIC code, and crle(1) to offset some of the relocation costs, but you still have to deal with the costly 。init functions。
This article also provides a brief introductory description of process execution, and the runtime linker。 It is important to know how the runtime linker processes the application during startup。This knowledge will help you minimize startup costs during application development。
References
1.    Making C++ Ready for the Desktop
Even though the issues and problems in Bastian's paper are somewhat well known in the community, the paper started a series of detailed email discussions between developers of the GNU project。 You can find the complete discussion here。
Bastian's paper has a number of suggestions on reducing relocation costs, including pre-linking shared libraries so that no relocation processing needs to be carried out at startup time。 This will require having a registry which allots memory addresses to the shared libraries。 This approach comes with its own set of problems, such as libraries tending to outgrow their space, or changes in the mapping between different vendors, or conflicts between libraries for memory。 It is interesting to note that every section header in an ELF object contains a field called sh_addr that is just there for the purpose of pre-linking。 However, it is not used on most UNIX systems。
Some of the pre-linking functionality described in Bastian's paper is already provided by the crle(1) command on Solaris (although in a different way -- there is no registry)。
2.    Linkers and Libraries Guide
The Linkers and Libraries Guide is Sun's recommended book for information about linkers and libraries。 If you are concerned about application performance on the Solaris Operating Environment, you should read the Linkers and Libraries Guide。 The following sections in the Linkers and Libraries guide are especially useful。
1.    Link-Editor Quick Reference。
2.    Shared Objects - Performance Considerations。
3.    Runtime Linker - Relocation Processing。
4.    Runtime Linker - Runtime Linking Programming Interface。
5.    Lazy Loading of Dynamic Dependencies。
6.    Dynamic Linking and the Runtime Linker。
3.    Proceedings of the USENIX Summer 1993 Technical Conference, 1993
This paper from Sun's Michael Nelson and Graham Hamilton from the '93 USENIX conference contains useful information about how the Spring OS reduces relocation costs by pre-linking the shared libraries。
[I also want to thank Mike Walker and Rod Evans from the linkers group at Sun Microsystems。 This article would not have been possible without their help。 The email discussions I had with them served as the starting point for this article。]
Footnotes
1.    ldd(1) is different from the method of examining the 。dynamic section of the ELF object in that it lists out the complete set of dependencies required, and any dependencies that these dependencies have, and so on。
2.    "Dependencies" and "libraries" are used interchangeably in this article。
3.    For a complete set of relocations, please refer to /usr/include/sys/elf_SPARC。h。
4.    You can find out if a library is non-dumpable by examining the 。dynamic section for the NODUMP flag。 (/usr/ccs/bin/elfdump -dv libfoo。so |grep NODUMP
5.    A popular browser with millions of lines of C++ code。
Appendix
1.    This is a script that shows an example on finding duplicate strings and using the -xstrconst flag to eliminate them。
2.          $ cat str_const。c
3.          main(){
4.                printf("Hello World\n");
5.                printf("Hello World\n");
6.                printf("Hello World\n");
7.                printf("Hello World\n");
8.                printf("Hello World\n");
9.          }
10.    $ cc -o str_const str_const。c
11.    $ strings str_const |sort|uniq -c|sort -rn
12.       5 Hello World
13.    $ cc -o str_const -xstrconst str_const。c
14.    $ strings str_const |sort|uniq -c|sort -rn
15.       1 Hello World
16.  This is a script for finding the number of relocations of an executable。 This script uses ldd(1) to find the dependencies of the executable, and finds their relocations as well。 It does not calculate the number of cached symbolic relocations。 To use a script that calculates cached symbolic relocations, please see the next item。
17.    #!/bin/bash
18.
19.    ## Print out the number of relocations the runtime linker
20.    ## will carry out for the executable specified in $1
21.
22.    declare -i totalrel relrel pltrel symrel
23.    ELFDUMP="/usr/ccs/bin/elfdump -r"
24.    GREP="/usr/bin/grep"
25.    USAGE="usage: elfinfo filename"
26.    if [ $# -ne 1 ]
27.    then
28.             echo $USAGE
29.             exit 1
30.    exit 1
31.    fi
32.
33.
34.    #first find symbols for the executable, then for
35.    # all its dependencies
36.
37.    totalrel=`$ELFDUMP -r $1 |$GREP -v NONE | $GREP -c R_`
38.    relrel=`$ELFDUMP -r $1 |$GREP -c RELATIVE`
39.    pltrel=`$ELFDUMP -r $1 |$GREP -c JMP_SLOT`
40.
41.    for j in `/usr/bin/ldd $1`; do
42.             IS_VALID_LIB=`echo $j | awk '/^\// {print "yes";}' `
43.             if [ "$IS_VALID_LIB" == "yes" ]; then
44.                totalrel=$totalrel+`$ELFDUMP -r $j |$GREP -v
45.                      NONE | $GREP -c R_`
46.                         relrel=$relrel+`$ELFDUMP -r $j
47.                            |$GREP -c RELATIVE`
48.                         pltrel=$pltrel+`$ELFDUMP -r $j
49.                            |$GREP -c JMP_SLOT`
50.             fi
51.    done
52.    symrel=$totalrel-$relrel-$pltrel
53.
54.    echo "Summary of relocations for" $1
55.    echo "======================="
56.
57.    echo "Relative Relocations    " $relrel
58.    echo "PLT relocations(symbolic)  " $pltrel
59.    echo "Other Symbolic relocations " $symrel
60.    echo "                         -------"
61.    echo "Total                   " $totalrel
62.
63.  This Perl script has similar functionality as the shell script above, but it also calculates cached symbolic relocations。
64.    #!/usr/bin/perl
65.    #
66.    # Copyright (c) 2000 by Sun Microsystems, Inc。
67.    # All rights reserved。
68.    #
69.    # ident "@(#)elfinfo。pl 1。2 00/07/26 SMI"
70.    #
71.    # This scripts lists out the relocations and strings
72.    # of an ELF object。
73.
74.    require 'getopts。pl';
75.
76.    $doAll = 0;
77.
78.    $Gsymrel = 0;
79.    $Gsymcrel = 0;
80.    $Gpltrel = 0;
81.    $Grelarel = 0;
82.
83.    $Glocobj = 0;
84.    $Glocfunc = 0;
85.    $Glocsect = 0;
86.    $Glocnoty = 0;
87.
88.    $Gglobobj = 0;
89.    $Gglobfunc = 0;
90.    $Gglobnoty = 0;
91.
92.    $Gglobundefobj = 0;
93.    $Gglobundeffunc = 0;
94.    $Gglobundefnoty = 0;
95.
96.    $Gsymcnt = 0;
97.
98.    $Gglobsymcnt = 0;
99.    $Gminsymlen = 0;
100. $Gmaxsymlen = 0;
101. $Gtotsymlen = 0;
102.
103.
104. sub print_reloctable {
105. local($file, $symrel, $symcrel, $pltrel, $relarel) = @_;
106.
107. print("\n$file Relocations:\n");
108. printf(" Symbolic Cached-Symbolic Non-symbolic
109.    Total\n");
110. printf(" Start-up = %4d %4d %4d %4d\n",
111. $symrel, $symcrel, $relarel, $symrel + $symcrel + $relarel);
112. printf(" Plt = %4d %4d\n",
113. $pltrel, $pltrel) ;
114. printf(" %4d\n",
115. $symrel + $symcrel + $relarel + $pltrel);
116.
117. }
118.
119. sub getRelocs {
120. local($file) = @_;
121. local($symrel, $symcrel, $pltrel, $relarel, $psym);
122.
123. $symrel = 0;
124. $symcrel = 0;
125. $pltrel = 0;
126. $relarel = 0;
127.
128. open(reloc_list, "elfdump -r $file |");
129.
130. $psym = "";
131. while () {
132. chop;
133. $reltype = $_;
134. $sym = $_;
135. $reltype =~ s/^\s*(\w+)\s+。*/$1/o;
136. $sym =~ s/。*\s+(\w+)$/$1/o;
137. if ((!($reltype =~ /^R_/)) ||
138. ($reltype =~ /NONE$/)) {
139. next;
140. }
141. if ($reltype =~ /RELATIVE$/) {
142. $relarel++;
143. } elsif (($reltype =~ /JMP_SLOT/) ||
144. ($reltype =~ /IPLTLSB/)) {
145. $pltrel++;
146. } else {
147. if ($psym ne $sym) {
148. $psym = $sym;
149. $symrel++;
150. } else {
151. $symcrel++;
152. }
153. }
154. }
155. close(reloc_list);
156.
157. print_reloctable($file, $symrel, $symcrel, $pltrel,
158.    $relarel);
159.
160. $Gsymrel += $symrel;
161. $Gsymcrel += $symcrel;
162. $Gpltrel += $pltrel;
163. $Grelarel += $relarel;
164. }
165.
166. sub print_symboltable {
167. local($file, $locobj, $locfunc, $locnoty, $locsect,
168. $globobj, $globfunc, $globnoty,
169. $globundefobj, $globundeffunc, $globundefnoty,
170. $minsymlen, $maxsymlen, $totsymlen,
171. $symcnt, $globsymcnt) = @_;
172.
173. printf("\n$file Symbols:\n");
174. printf(" OBJT FUNC SECT NOTY TOTAL\n");
175. printf(" Locals %5d %5d %5d %5d %6d\n",
176. $locobj, $locfunc, $locsect, $locnoty,
177. $locobj + $locfunc + $locsect + $locnoty);
178. printf(" GlobDef %5d %5d %5d %6d\n",
179. $globobj, $globfunc, $globnoty,
180. $globobj + $globfunc + $globnoty);
181. printf(" GlobUDef %5d %5d %5d %6d\n",
182. $globundefobj, $globundeffunc, $globundefnoty,
183. $globundefobj + $globundeffunc + $globundefnoty);
184. printf(" %6d\n",
185. $symcnt);
186. printf(" MinSymbolName Length: %4d characters\n",
187.    $minsymlen);
188. printf(" MaxSymbolName Length: %4d characters\n",
189.    $maxsymlen);
190. printf(" AvgSymbolName Length: %4d characters\n",
191. $totsymlen / $globsymcnt);
192. }
193.
194. sub getSymbols {
195. local($file) = @_;
196. local($burn, $type, $bind, $shndx, $sname);
197. local($locobj, $locfunc, $locnoty, $locsect);
198. local($globobj, $globfunc, $globnoty);
199. local($globundefobj, $globundeffunc, $globundefnoty);
200. local($minsymlen, $maxsymlen, $totsymlen, $slen);
201. local($symcnt, $globsymcnt);
202.
203. $first_symtab = 1;
204.
205. $locobj = 0;
206. $locfunc = 0;
207. $locnoty = 0;
208. $locsect = 0;
209. $globobj = 0;
210. $globfunc = 0;
211. $globnoty = 0;
212. $globundefobj = 0;
213. $globundeffunc = 0;
214. $globundefnoty = 0;
215. $minsymlen = 0;
216. $maxsymlen = 0;
217. $totsymlen = 0;
218. $symcnt = 0;
219. $globsymcnt = 0;
220.
221.
222. open(symbols, "elfdump -s $file |");
223. while () {
224. if (/^Symbol Table:/) {
225. if ($first_symtab) {
226. $first_symtab = 0;
227. next;
228. }
229. #
230. # Elfdump will display both the 。dynsym &
231.
232. # 。symtab。 We only gather information on the
233. # first one listed。
234. #
235. last;
236. }
237. if (! /^[ \t]+\[[0-9]+\]/) {
238. next;
239. }
240. chop;
241. ($burn, $burn, $burn, $burn, $type, $bind,
242. $burn, $shndx, $sname) =
243. split(/[ \t]+/, $_, 9);
244.
245. $symcnt++;
246.
247. if ($bind =~ /LOCL/) {
248. if ($type =~ /OBJT/) {
249. $locobj++;
250. } elsif ($type =~ /FUNC/) {
251. $locfunc++;
252. } elsif ($type =~ /SECT/) {
253. $locsect++;
254. } else {
255. $locnoty++;
256. }
257. } else { # GLOB & WEAK together
258. $slen = length($sname);
259. if (($minsymlen == 0) || ($minsymlen > $slen)) {
260. $minsymlen = $slen
261. }
262. if ($maxsymlen  $minsymlen)) {
313. $Gminsymlen = $minsymlen;
314. }
315. if ($Gmaxsymlen ) {
329. print $_;
330. }
331. close(string_list);
332. }
333.
334.
335. if ((Getopts('rsy') == 0) || ($#ARGV < 0)) {
336. print("Usage: elfinfo [-r] [-s] [-y] filename 。。。\n");
337. print("\t-r\tdisplay relocation information\n");
338. print("\t-s\tdisplaying string information\n");
339. print("\t-y\tdisplay Symbol Table information\n");
340. exit 1;
341. }
342.
343.
344. if (!$opt_r && !$opt_s && !$opt_y) {
345. $doAll = 1;
346. }
347.
348.
349. $cnt = 0;
350.
351. while ($cnt <= $#ARGV) {
352. $file=$ARGV[$cnt];
353. $cnt++;
354.
355. if (! -f $file) {
356. printf("$file does not exist\n");
357. next;
358. }
359.
360. $file_type = `file $file`;
361. if (!($file_type =~ /ELF/)) {
362. print("$file not a ELF file - skipping\n");
363. print($file_type);
364. next;
365. }
366.
367. if ($doAll || $opt_r) {
368. getRelocs($file);
369. }
370.
371. if ($doAll || $opt_y) {
372. getSymbols($file);
373. }
374.
375. if ($doAll || $opt_s) {
376. getStrings($file);
377. }
378. }
379.
380.
381. if (($#ARGV == 0) || (!$doAll && !$opt_r && !$opt_y)) {
382. exit 0;
383. }
384.
385. printf("\n\n======================================\n");
386. printf("Summary Information:\n\n");
387.
388. printf("Number of objects examined = %d\n\n", $#ARGV + 1);
389.
390. if ($doAll || $opt_r) {
391. print_reloctable("Summary", $Gsymrel, $Gsymcrel,
392.    $Gpltrel, $Grelarel);
393. }
394.
395. if ($doAll || $opt_y) {
396. print_symboltable("Summary", $Glocobj, $Glocfunc,
397. $Glocnoty, $Glocsect,
398. $Gglobobj, $Gglobfunc, $Gglobnoty,
399. $Gglobundefobj, $Gglobundeffunc, $Gglobundefnoty,
400. $Gminsymlen, $Gmaxsymlen, $Gtotsymlen,
401. $Gsymcnt, $Gglobsymcnt);
402.
403. }
404.
405. printf("==========================================\n");
406.
407. exit 0;
408.
About the Author
Neelakanth Nadgir is a software engineer in Sun's Market Development Engineering organization。 He works with tool vendors to develop "best of breed" applications on Sun systems。 He also volunteers for the GNU project。 In his spare time, he likes to go hiking in Big Basin State Park。

本文来自ChinaUnix博客，如果查看原文请点：http://blog.chinaunix.net/u/30686/showart_349832.html

文库|博客

返回列表

Chinaunix › 论坛 › 操作系统 › Linux新手园地 › Linux文档专区 › Reducing Application Startup Time in the Solaris 8

Reducing Application Startup Time in the Solaris 8 [复制链接]

浏览过的版块