SPARC: -Kpic and -KPIC Options
For SPARC binaries, a subtle difference between the -K pic option and an alternative -K PIC option affects references to global offset table entries. See "Global Offset Table (Processor-Specific)".
The global offset table is an array of pointers, the size of whose entries are constant for 32-bit (4 bytes) and 64-bit (8-bytes). The code sequence to make reference to an entry under -K pic is something like:
ld [%l7 + j], %o0 ! load &j into %o0 |
Where %l7 is the precomputed value of the symbol _GLOBAL_OFFSET_TABLE_ of the object making the reference.
This code sequence provides a 13-bit displacement constant for the global offset table entry, and thus provides for 2048 unique entries for 32-bit objects, and 1024 unique entries for 64-bit objects. If an object is built that requires more than the available number of entries, the link-editor produces a fatal error:
$ cc -Kpic -G -o lobfoo.so.1 a.o b.o ... z.o ld: fatal: too many symbols require `small' PIC references: have 2050, maximum 2048 -- recompile some modules -K PIC. |
To overcome this error condition, compile some or all of the input relocatable objects with the -K PIC option. This option provides a 32-bit constant for the global offset table entry:
sethi %hi(j), %g1 or %g1, %lo(j), %g1 ! get 32-bit constant GOT offset ld [%l7 + j], %o0 ! load &j into %o0 |
You can investigate the global offset table requirements of an object using elfdump(1) with the -G option. You can also examine the processing of these entries during a link-edit using the link-editors debugging tokens -D got,detail.
Ideally, any frequently accessed data items will benefit from using the -K pic model. You can reference a single entry using both models. However, determining which relocatable objects should be compiled with either option can be time consuming, and the performance improvement realized small. Recompiling all relocatable objects with the -K PIC option is typically easier.
Maximizing Shareability
As mentioned in "Underlying System", only a shared object's text segment is shared by all processes that use it. The object's data segment typically is not shared. Each process that uses a shared object usually generates a private memory copy of its entire data segment as data items within the segment are written to. Reduce the data segment, either by moving data elements that will never be written to the text segment, or by removing the data items completely.
The following sections describe several mechanisms that can be used to reduce the size of the data segment.
Move Read-Only Data to Text
Data elements that are read-only should be moved into the text segment using const declarations. For example, the following character string will reside in the .data section, which is part of the writable data segment:
char * rdstr = "this is a read-only string"; |
In contrast, the following character string will reside in the .rodata section, which is the read-only data section contained within the text segment:
const char * rdstr = "this is a read-only string"; |
Reducing the data segment by moving read-only elements into the text segment is admirable. However, moving data elements that require relocations can be counterproductive. For example, examine the following array of strings:
char * rdstrs[] = { "this is a read-only string", "this is another read-only string" }; |
A better definition might seem to be:
const char * const rdstrs[] = { ..... }; |
This definition ensures that the strings and the array of pointers to these strings are placed in a .rodata section. Unfortunately, although the user perceives the array of addresses as read-only, these addresses must be relocated at runtime. This definition therefore results in the creation of text relocations. Representing it as:
const char * rdstrs[] = { ..... }; |
insures the array pointers are maintained in the writable data segment where they can be relocated. The array strings are maintained in the read-only text segment.
Note - Some compilers, when generating position-independent code, can detect read-only assignments that will result in runtime relocations. These compilers will arrange for placing such items in writable segments (for example, .picdata).
Collapse Multiply-Defined Data
Data can be reduced by collapsing multiply-defined data. A program with multiple occurrences of the same error messages can be better off by defining one global datum, and have all other instances reference this. For example:
const char * Errmsg = "prog: error encountered: %d"; foo() { ...... (void) fprintf(stderr, Errmsg, error); ...... |
The main candidates for this sort of data reduction are strings. String usage in a shared object can be investigated using strings(1). The following example will generate a sorted list of the data strings within the file libfoo.so.1. Each entry in the list is prefixed with the number of occurrences of the string.
$ strings -10 libfoo.so.1 | sort | uniq -c | sort -rn |
Use Automatic Variables
Permanent storage for data items can be removed entirely if the associated functionality can be designed to use automatic (stack) variables. Any removal of permanent storage will usually result in a corresponding reduction in the number of runtime relocations required.
Allocate Buffers Dynamically
Large data buffers should usually be allocated dynamically rather than being defined using permanent storage. Often this will result in an overall saving in memory, as only those buffers needed by the present invocation of an application will be allocated. Dynamic allocation also provides greater flexibility by enabling the buffer's size to change without affecting compatibility.
Minimizing Paging Activity
Any process that accesses a new page will cause a page fault, which is an expensive operation. Because shared objects can be used by many processes, any reduction in the number of page faults generated by accessing a shared object will benefit the process and the system as a whole.
Organizing frequently used routines and their data to an adjacent set of pages will frequently improve performance because it improves the locality of reference. When a process calls one of these functions, the function might already be in memory because of its proximity to the other frequently used functions. Similarly, grouping interrelated functions will improve locality of references. For example, if every call to the function foo() results in a call to the function bar(), place these functions on the same page. Tools like cflow(1), tcov(1), prof(1) and gprof(1) are useful in determining code coverage and profiling.
Isolate related functionality to its own shared object. The standard C library has historically been built containing many unrelated functions. Only rarely, for example, will any single executable use everything in this library. Because of widespread use, determining what set of functions are really the most frequently used is also somewhat difficult. In contrast, when designing a shared object from scratch, maintain only related functions within the shared object. This will improve locality of reference and has the side effect of reducing the object's overall size.
Relocations
In "Relocation Processing", the mechanisms by which the runtime linker relocates dynamic executables and shared objects to create a runable process was covered. "Symbol Lookup" and "When Relocations Are Performed" categorized this relocation processing into two areas to simplify and help illustrate the mechanisms involved. These same two categorizations are also ideally suited for considering the performance impact of relocations.
Symbol Lookup
When the runtime linker needs to look up a symbol, by default it does so by searching in each object. The runtime linker starts with the dynamic executable, and progresses through each shared object in the same order that the objects are loaded. In many instances, the shared object that requires a symbolic relocation will turn out to be the provider of the symbol definition.
In this situation, if the symbol used for this relocation is not required as part of the shared object's interface, then this symbol is a strong candidate for conversion to a static or automatic variable. A symbol reduction can also be applied to removed symbols from a shared objects interface. See "Reducing Symbol Scope" for more details. By making these conversions, the link-editor will incur the expense of processing any symbolic relocation against these symbols during the shared object's creation.
The only global data items that should be visible from a shared object are those that contribute to its user interface. Historically this has been a hard goal to accomplish, because global data are often defined to allow reference from two or more functions located in different source files. By applying symbol reduction, unnecessary global symbols can be removed. See "Reducing Symbol Scope". Any reduction in the number of global symbols exported from a shared object will result in lower relocation costs and an overall performance improvement.
The use of direct bindings can also significantly reduce the symbol lookup overhead within a dynamic process that has many symbolic relocations any many dependencies. See "Direct Binding".
When Relocations are Performed
All immediate reference relocations must be carried out during process initialization before the application gains control. However, any lazy reference relocations can be deferred until the first instance of a function being called. Immediate relocations typically result from data references. Therefore, reducing the number of data references also reduces the runtime initialization of a process.
Initialization relocation costs can also be deferred by converting data references into function references. For example, you can return data items by a functional interface. This conversion usually results in a perceived performance improvement because the initialization relocation costs are effectively spread throughout the process's execution. Some of the functional interfaces might never be called by a particular invocation of a process, thus removing their relocation overhead altogether.
The advantage of using a functional interface can be seen in the section, "Copy Relocations". This section examines a special, and somewhat expensive, relocation mechanism employed between dynamic executables and shared objects. It also provides an example of how this relocation overhead can be avoided.
Combined Relocation Sections
Relocations by default are grouped by the sections against which they are to be applied. However, when an object is built with the -z combreloc option, all but the procedure linkage table relocations are placed into a single common section named .SUNW_reloc. See "Procedure Linkage Table (Processor-Specific)".
Combining relocation records in this manner enables all RELATIVE relocations to be grouped together. All symbolic relocations are sorted by symbol name. The grouping of RELATIVE relocations permits optimized runtime processing using the DT_RELACOUNT/DT_RELCOUNT .dynamic entries. Sorted symbolic entries help reduce runtime symbol lookup.
Copy Relocations
Shared objects are usually built with position-independent code. References to external data items from code of this type employs indirect addressing through a set of tables. See "Position-Independent Code" for more details. These tables are updated at runtime with the real address of the data items. These updated tables enable access to the data without the code itself being modified.
Dynamic executables, however, are generally not created from position-independent code. Any references to external data they make can seemingly only be achieved at runtime by modifying the code that makes the reference. Modifying a read-only text segment is to be avoided. The copy relocation technique can solve this reference.
Suppose the link-editor is used to create a dynamic executable, and a reference to a data item is found to reside in one of the dependent shared objects. Space is allocated in the dynamic executable's .bss, equivalent in size to the data item found in the shared object. This space is also assigned the same symbolic name as defined in the shared object. Along with this data allocation, the link-editor generates a special copy relocation record that will instruct the runtime linker to copy the data from the shared object to this allocated space within the dynamic executable.
Because the symbol assigned to this space is global, it will be used to satisfy any references from any shared objects. The dynamic executable inherits the data item. Any other objects within the process that make reference to this item will be bound to this copy. The original data from which the copy is made effectively becomes unused.
The following example of this mechanism uses an array of system error messages that is maintained within the standard C library. In previous SunOS operating system releases, the interface to this information was provided by two global variables, sys_errlist[], and sys_nerr. The first variable provided the array of error message strings, while the second conveyed the size of the array itself. These variables were commonly used within an application in the following manner:
$ cat foo.c extern int sys_nerr; extern char * sys_errlist[]; char * error(int errnumb) { if ((errnumb < 0) || (errnumb >= sys_nerr)) return (0); return (sys_errlist[errnumb]); } |
The application uses the function error to provide a focal point to obtain the system error message associated with the number errnumb.
Examining a dynamic executable built using this code shows the implementation of the copy relocation in more detail:
$ cc -o prog main.c foo.c $ nm -x prog | grep sys_ [36] |0x00020910|0x00000260|OBJT |WEAK |0x0 |16 |sys_errlist [37] |0x0002090c|0x00000004|OBJT |WEAK |0x0 |16 |sys_nerr $ dump -hv prog | grep bss [16] NOBI WA- 0x20908 0x908 0x268 .bss $ dump -rv prog **** RELOCATION INFORMATION **** .rela.bss: Offset Symndx Type Addend 0x2090c sys_nerr R_SPARC_COPY 0 0x20910 sys_errlist R_SPARC_COPY 0 .......... |