- The GNU Assembler manual
- Intel opcode list (note that GNU tools produce AT&T opcodes, but the root names are the same)
The C code...
char * myarr = "foo"; x = myarr
... produces this:
.LC0: .string "foo" .data .align 8 .type myarr, @object .size myarr, 8 myarr: .quad .LC0 ... (in main) movq myarr(%rip), %rax addq $1, %rax movzbl (%rax), %eax
The location is used as an offest of the instruction pointer. Is it a quad data point taking the location of .LC0 as its data.
How does char myarr = "foo" compile?
.globl myarr .data .align 32 .type myarr, @object .size myarr, 666 myarr: .string "foo" .zero 662 .text ... (in main) movzbl myarr+1(%rip), %eax movsbl %al,%eax
"zbl" is zero-extended byte to long. It takes one byte from myarr+1(%rip) and zero-extends it, storing the result in a long. "sbl" sign-extends the same byte from the a register. Note that "myarr(%rip)" here resolves to the data at the location "myarr:" which is the first bytes of the string. Since we are working with data and not an address, no further dereference is needed.
How about malloc? In this example, char *myarr is declared outside main and malloced inside main.
(in main) subq $16, %rsp # Create space on the stack? It is not used. movl $257, %edi # Malloc is for 257 bytes. call malloc # Malloc takes %edi as number of bytes. movq %rax, myarr(%rip) # malloc overwrites rax with its output ... (after main) .LFE5: .size main, .-main .comm myarr,8,8
.comm defines the myarr symbol as having length 8, which is the pointer size. Why are there two 8s? ELF allows a third argument as an "alignment" flag" to specify the number of least significant bits that should be zero. The 8 here means three bits because 2^3 = 8.
How about using a second variable as an offset?
movl $1, -8(%rbp) # int i = 1 movl -8(%rbp), %eax # copy 1 into eax cltq # Expand eax to quadword rax movq %rax, %rdx # Copy rax to rdx movq myarr(%rip), %rax # Put address of myarr into rax leaq (%rdx,%rax), %rax # Load effective address of rax+rdx into rax movzbl (%rax), %eax # Load from *rax into eax and zero-extend movsbl %al,%eax # Sign-extending eax movl %eax, -4(%rbp) # Store result in x
With all that in mind, I made a very simple program to just loop a lot, and I ran it through the time program as I built each part of it to see how long it took.
.comm myarr,8,8 .globl main main: pushq %rbp # All programs start with these two lines movq %rsp, %rbp movl $65536, %edi # Malloc the string call malloc movq %rax, myarr(%rip) movl $0, %ecx # Initialize the counter. movq $0, %rax outerloop: # A useless movl movl %eax, %edx # Move the pointer around to discourage hardware caching. movzwl %ax, %ebx # Dereference the pointer. Just because. movq myarr(%rip), %rax addq %rbx, %rax; # bug; no guarantee rbx's upper bytes are 0 movzbl (%rax), %eax # Run a comparison. Just because. cmpl $42, %eax je here here: # Loop logic. The loop starts at zero and is immediately decremented, # so it has to underflow and wrap around to get back to zero. subl $1, %ecx jnz outerloop done: leave ret
Observed costs over a 2^32 loop (in seconds):
1.438 -- A single subl and jnz loop 1.371 -- A single je instruction 0.066 -- A cmpl before the je 0.065 -- movzwl. 1.635 -- movl between two registers 1.471 -- A memory dereference (without moving the pointer) 4.959 -- A memory dereference (with a moving pointer)
There is a purpose to this playing around; I was hoping to determine the cause for a difference in speed between two versions of a program where the only difference in the assembly code is that a movl is replaced by a movzwl+movswl pair in two places.
145c145,146 < movl (%rax), %edx --- > movzwl (%rax), %eax > movswl %ax,%edx
The movzwl+movswl version runs in half the time.
It surprised me to find in my testing that movl is so expensive, but that correlates to what I observed with this program. The chip is a Xeon 5160. Conclusion: using shorts can be significantly faster than using longs.