Warrior Tang (tangaroa) wrote,
Warrior Tang

Some simple assembly

Good resources:

The C code...

char * myarr = "foo"; 
x = myarr[1]

... produces this:

        .string "foo"
        .align 8
        .type   myarr, @object
        .size   myarr, 8
        .quad   .LC0

(in main)
        movq    myarr(%rip), %rax
        addq    $1, %rax
        movzbl  (%rax), %eax

The location is used as an offest of the instruction pointer. Is it a quad data point taking the location of .LC0 as its data.

How does char myarr[666] = "foo" compile?

.globl myarr
        .align 32
        .type   myarr, @object
        .size   myarr, 666
        .string "foo"
        .zero   662

(in main)
        movzbl  myarr+1(%rip), %eax
	movsbl  %al,%eax

"zbl" is zero-extended byte to long. It takes one byte from myarr+1(%rip) and zero-extends it, storing the result in a long. "sbl" sign-extends the same byte from the a register. Note that "myarr(%rip)" here resolves to the data at the location "myarr:" which is the first bytes of the string. Since we are working with data and not an address, no further dereference is needed.

How about malloc? In this example, char *myarr is declared outside main and malloced inside main.

(in main)
        subq    $16, %rsp	# Create space on the stack? It is not used.
        movl    $257, %edi	# Malloc is for 257 bytes. 
        call    malloc		# Malloc takes %edi as number of bytes. 
        movq    %rax, myarr(%rip)  # malloc overwrites rax with its output

(after main)
        .size   main, .-main
        .comm   myarr,8,8

.comm defines the myarr symbol as having length 8, which is the pointer size. Why are there two 8s? ELF allows a third argument as an "alignment" flag" to specify the number of least significant bits that should be zero. The 8 here means three bits because 2^3 = 8.

How about using a second variable as an offset?

        movl    $1, -8(%rbp)		# int i = 1
        movl    -8(%rbp), %eax		# copy 1 into eax
        cltq				# Expand eax to quadword rax
        movq    %rax, %rdx		# Copy rax to rdx 
        movq    myarr(%rip), %rax	# Put address of myarr into rax
        leaq    (%rdx,%rax), %rax	# Load effective address of rax+rdx into rax
        movzbl  (%rax), %eax		# Load from *rax into eax and zero-extend
        movsbl  %al,%eax		# Sign-extending eax
        movl    %eax, -4(%rbp)		# Store result in x

With all that in mind, I made a very simple program to just loop a lot, and I ran it through the time program as I built each part of it to see how long it took.

      .comm   myarr,8,8

.globl main
        pushq   %rbp	# All programs start with these two lines
        movq    %rsp, %rbp

        movl $65536, %edi	# Malloc the string
        call malloc
        movq %rax, myarr(%rip)

        movl $0, %ecx	# Initialize the counter.
        movq $0, %rax


# A useless movl
        movl %eax, %edx

# Move the pointer around to discourage hardware caching.
        movzwl %ax, %ebx

# Dereference the pointer. Just because. 
        movq myarr(%rip), %rax
        addq %rbx, %rax; # bug; no guarantee rbx's upper bytes are 0
        movzbl (%rax), %eax

# Run a comparison. Just because.
        cmpl $42, %eax
        je here

# Loop logic. The loop starts at zero and is immediately decremented,
# so it has to underflow and wrap around to get back to zero. 
        subl $1, %ecx
        jnz outerloop

Observed costs over a 2^32 loop (in seconds):

1.438 -- A single subl and jnz loop
1.371 -- A single je instruction
0.066 -- A cmpl before the je
0.065 -- movzwl.
1.635 -- movl between two registers
1.471 -- A memory dereference (without moving the pointer) 
4.959 -- A memory dereference (with a moving pointer)

There is a purpose to this playing around; I was hoping to determine the cause for a difference in speed between two versions of a program where the only difference in the assembly code is that a movl is replaced by a movzwl+movswl pair in two places.

<       movl    (%rax), %edx
>       movzwl  (%rax), %eax
>       movswl  %ax,%edx

The movzwl+movswl version runs in half the time.

It surprised me to find in my testing that movl is so expensive, but that correlates to what I observed with this program. The chip is a Xeon 5160. Conclusion: using shorts can be significantly faster than using longs.

  • Post a new comment


    default userpic

    Your IP address will be recorded