Warrior Tang (tangaroa) wrote,
Warrior Tang

Some simple assembly

Good resources:

The C code...

char * myarr = "foo"; 
x = myarr[1]

... produces this:

        .string "foo"
        .align 8
        .type   myarr, @object
        .size   myarr, 8
        .quad   .LC0

(in main)
        movq    myarr(%rip), %rax
        addq    $1, %rax
        movzbl  (%rax), %eax

The location is used as an offest of the instruction pointer. Is it a quad data point taking the location of .LC0 as its data.

How does char myarr[666] = "foo" compile?

.globl myarr
        .align 32
        .type   myarr, @object
        .size   myarr, 666
        .string "foo"
        .zero   662

(in main)
        movzbl  myarr+1(%rip), %eax
	movsbl  %al,%eax

"zbl" is zero-extended byte to long. It takes one byte from myarr+1(%rip) and zero-extends it, storing the result in a long. "sbl" sign-extends the same byte from the a register. Note that "myarr(%rip)" here resolves to the data at the location "myarr:" which is the first bytes of the string. Since we are working with data and not an address, no further dereference is needed.

How about malloc? In this example, char *myarr is declared outside main and malloced inside main.

(in main)
        subq    $16, %rsp	# Create space on the stack? It is not used.
        movl    $257, %edi	# Malloc is for 257 bytes. 
        call    malloc		# Malloc takes %edi as number of bytes. 
        movq    %rax, myarr(%rip)  # malloc overwrites rax with its output

(after main)
        .size   main, .-main
        .comm   myarr,8,8

.comm defines the myarr symbol as having length 8, which is the pointer size. Why are there two 8s? ELF allows a third argument as an "alignment" flag" to specify the number of least significant bits that should be zero. The 8 here means three bits because 2^3 = 8.

How about using a second variable as an offset?

        movl    $1, -8(%rbp)		# int i = 1
        movl    -8(%rbp), %eax		# copy 1 into eax
        cltq				# Expand eax to quadword rax
        movq    %rax, %rdx		# Copy rax to rdx 
        movq    myarr(%rip), %rax	# Put address of myarr into rax
        leaq    (%rdx,%rax), %rax	# Load effective address of rax+rdx into rax
        movzbl  (%rax), %eax		# Load from *rax into eax and zero-extend
        movsbl  %al,%eax		# Sign-extending eax
        movl    %eax, -4(%rbp)		# Store result in x

With all that in mind, I made a very simple program to just loop a lot, and I ran it through the time program as I built each part of it to see how long it took.

      .comm   myarr,8,8

.globl main
        pushq   %rbp	# All programs start with these two lines
        movq    %rsp, %rbp

        movl $65536, %edi	# Malloc the string
        call malloc
        movq %rax, myarr(%rip)

        movl $0, %ecx	# Initialize the counter.
        movq $0, %rax


# A useless movl
        movl %eax, %edx

# Move the pointer around to discourage hardware caching.
        movzwl %ax, %ebx

# Dereference the pointer. Just because. 
        movq myarr(%rip), %rax
        addq %rbx, %rax; # bug; no guarantee rbx's upper bytes are 0
        movzbl (%rax), %eax

# Run a comparison. Just because.
        cmpl $42, %eax
        je here

# Loop logic. The loop starts at zero and is immediately decremented,
# so it has to underflow and wrap around to get back to zero. 
        subl $1, %ecx
        jnz outerloop

Observed costs over a 2^32 loop (in seconds):

1.438 -- A single subl and jnz loop
1.371 -- A single je instruction
0.066 -- A cmpl before the je
0.065 -- movzwl.
1.635 -- movl between two registers
1.471 -- A memory dereference (without moving the pointer) 
4.959 -- A memory dereference (with a moving pointer)

There is a purpose to this playing around; I was hoping to determine the cause for a difference in speed between two versions of a program where the only difference in the assembly code is that a movl is replaced by a movzwl+movswl pair in two places.

<       movl    (%rax), %edx
>       movzwl  (%rax), %eax
>       movswl  %ax,%edx

The movzwl+movswl version runs in half the time.

It surprised me to find in my testing that movl is so expensive, but that correlates to what I observed with this program. The chip is a Xeon 5160. Conclusion: using shorts can be significantly faster than using longs.


  • Post a new comment


    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.