Ehren's Blog

Mystery bytes

Posted in Seneca by ehren on December 15, 2009

As previously reported, I’ve been trying to determine why my gcc plugin to optimize away virtual functions that always return zero has resulted in a rather mediocre code size reduction. In fact, the problem is far worse that I previously realized.

It turns out the plugin has been adding 12240 bytes to libxul.so! This was truly horrible, but I have found the cause. As I mentioned before, the vast majority of the ‘optimizations’ performed by the plugin were quite useless. Any code of the form return call(); is considered by GCC at the GIMPLE level to be a ‘value returning gimple call’ which will end up with a 3 address code representation like so:
    t1 = call();
    return t1;

The plugin then uselessly converts this to:
    dummy_var = call();
    t1 = 0;
    return t1;

The problem is that this adds a significant amount of cruft to the generated code. It turns out GCC, and probably most compilers, when encountering return call();, will simply jump to the callee, letting the epilogue of the callee do the work of the caller. For example, I previously used the example of nsHTMLModElement::GetAttributeNS which looks like this:

 
NS_SCRIPTABLE NS_IMETHOD
GetAttributeNS(const nsAString & namespaceURI, const nsAString & localName,
	       nsAString & _retval NS_OUTPARAM)
{
  return _to GetAttributeNS(namespaceURI, localName, _retval);
}

(See this post for a GIMPLE breakdown)

Without my plugin, GCC generates this code for the function:

0000000000000000 <_ZN16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_>:
   0: e9 00 00 00 00        jmpq   5 <_ZThn56_N16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_>

Here the prologue is left off as well since all behavior is deferred to the callee. With the ‘optimization’ however the return value is explicitly zeroed out, and so GCC must generate code for a complete procedure:

0000000000000000 <_ZN16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_>:
   0: 48 83 ec 08           sub    $0x8,%rsp
   4: e8 00 00 00 00        callq  9 <_ZN16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_+0x9>
   9: 31 c0                 xor    %eax,%eax
   b: 48 83 c4 08           add    $0x8,%rsp
   f: c3                    retq

So what is to be done?

My first thought was to abandon this approach to the optimization entirely and go back to something like my 0.1 release. I was even able to build a do nothing implementation of the propagation engine by copying tree-ssa-propagate.h into the GCC plugin include directory. Unfortunately, writing one of these passes requires a lot of code because you have to do all of the folding yourself. I also don’t think it’s possible to just copy all of the code from the conditional constant propagation pass into a plugin (I’ve tried… GCC doesn’t export enough to the plugin framework).

Another approach was to add a couple of ad hoc conditions to the plugin so that statements of the form return call(); are not ‘optimized’. One thing I noticed was that if a function has only one return of the form return call(); , that return will be in the same basic block as the call. Therefore I could use code like this to weed out the boring stuff:

static bool
boring_block (basic_block bb)
{
  gimple ret_stmt = gimple_seq_last_stmt (bb_seq (bb));
  return gimple_code (ret_stmt) == GIMPLE_RETURN;
}

This worked pretty well. The number of ‘optimized’ call sites went down to 55 (from 1305) which was just what I wanted given that the majority of those call sites are not optimization worthy. Unfortunately, I was still increasing increasing the size of libxul.so, except this time by 14 bytes.

One possibility was that functions with multiple returns end up looking like this (this is after the ‘optimization’ has taken place):

<bb 6>:
  D.74138_10 = &this_6(D)->D.72191;
  dummy_var.24_17 = nsSVGGFrame::AttributeChanged (D.74138_10, aNameSpaceID_2(D), aAttribute_4(D), aModType_11(D));
  D.74137_12 = 0;

<bb 7>:
  # D.74137_1 = PHI <0(5), D.74137_12(6)>
  return D.74137_1;
}

(See the code for this function here. Note that it has one 0-constant return and one ‘alwayszero call’ return. It doesn’t fit my boring_block condition above, but this is still a useless optimization.)

So I moved onto the next approach. It appears that any time the return value is checked the basic block of the call will contain a conditional statement of some form (see the GIMPLE code for nsSplittableFrame::AttributeChanged in this post, for an example). This led me to a new condition for only performing useful optimizations:

static bool
is_interesting_block (gimple_stmt_iterator gsi)
{
  for (; !gsi_end_p (gsi); gsi_next (&gsi))
    {
      gimple stmt = gsi_stmt (gsi);
      enum gimple_code code = gimple_code (stmt);
      if (code == GIMPLE_COND || code == GIMPLE_SWITCH)
        return true;
    }
  return false;
}

With this, the number of optimized call sites is reduced to 26, and I’m certain all of these optimizations result in useful code pruning (although I haven’t run a tree-dump build yet). Here’s the weird thing though: the size results are exactly the same as when I was optimizing 55 callsites. That is, the plugin is still adding 14 bytes to libxul.so! Argh.

Here’s an even bigger mystery: I ran a build using a ‘disabled’ version of the plugin. It registers a new pass, but no optimizations are performed. The results are identical to when I was running the plugin with 55 callsites optimized and with 26 callsites optimized: 14 bytes are added to libxul.so. My first thought is that adding a new pass somehow interferes with existing GCC optimizations, especially since the plugin must set the TODO_update_ssa flag (since a new statement is added). Why do my 26 useful optimizations not have any affect though?

I’m going to have to run another tree-dump build of mozilla-central and then play around a bit more with objdump. Something is very strange here.

Advertisements

3 Responses

Subscribe to comments with RSS.

  1. Taras said, on December 15, 2009 at 2:38 pm

    That’s actually very interesting. Do you mean that gcc generates crappier code for
    nsresult rv = call();
    return rv;

    than for return call()?

  2. ehren said, on December 15, 2009 at 7:41 pm

    I would guess that if nsresult is modified after the call than the code will be crappier but if the return value depends solely on the value of the call than the epilogue can be shared between functions.

    edit: if the return value is used after the call before being returned than the epilogue can’t be shared

    eg

    int foo() { int rv = call(); if (rv) printf(“blah”); else printf(“bleh”); return rv; } won’t get the optimization but

    int foo() { int rv = call(); return rv; } will

  3. Marsha said, on September 17, 2013 at 4:50 pm

    Hi there everybody, here every one is sharing these experience, so it’s pleasant to read this webpage, and I used to pay a visit this website all the time.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: