Ehren's Blog

Mystery bytes

Posted in Seneca by ehren on December 15, 2009

As previously reported, I’ve been trying to determine why my gcc plugin to optimize away virtual functions that always return zero has resulted in a rather mediocre code size reduction. In fact, the problem is far worse that I previously realized.

It turns out the plugin has been adding 12240 bytes to libxul.so! This was truly horrible, but I have found the cause. As I mentioned before, the vast majority of the ‘optimizations’ performed by the plugin were quite useless. Any code of the form return call(); is considered by GCC at the GIMPLE level to be a ‘value returning gimple call’ which will end up with a 3 address code representation like so:
    t1 = call();
    return t1;

The plugin then uselessly converts this to:
    dummy_var = call();
    t1 = 0;
    return t1;

The problem is that this adds a significant amount of cruft to the generated code. It turns out GCC, and probably most compilers, when encountering return call();, will simply jump to the callee, letting the epilogue of the callee do the work of the caller. For example, I previously used the example of nsHTMLModElement::GetAttributeNS which looks like this:

 
NS_SCRIPTABLE NS_IMETHOD
GetAttributeNS(const nsAString & namespaceURI, const nsAString & localName,
	       nsAString & _retval NS_OUTPARAM)
{
  return _to GetAttributeNS(namespaceURI, localName, _retval);
}

(See this post for a GIMPLE breakdown)

Without my plugin, GCC generates this code for the function:

0000000000000000 <_ZN16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_>:
   0: e9 00 00 00 00        jmpq   5 <_ZThn56_N16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_>

Here the prologue is left off as well since all behavior is deferred to the callee. With the ‘optimization’ however the return value is explicitly zeroed out, and so GCC must generate code for a complete procedure:

0000000000000000 <_ZN16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_>:
   0: 48 83 ec 08           sub    $0x8,%rsp
   4: e8 00 00 00 00        callq  9 <_ZN16nsHTMLModElement14GetAttributeNSERK18nsAString_internalS2_RS0_+0x9>
   9: 31 c0                 xor    %eax,%eax
   b: 48 83 c4 08           add    $0x8,%rsp
   f: c3                    retq

So what is to be done?

My first thought was to abandon this approach to the optimization entirely and go back to something like my 0.1 release. I was even able to build a do nothing implementation of the propagation engine by copying tree-ssa-propagate.h into the GCC plugin include directory. Unfortunately, writing one of these passes requires a lot of code because you have to do all of the folding yourself. I also don’t think it’s possible to just copy all of the code from the conditional constant propagation pass into a plugin (I’ve tried… GCC doesn’t export enough to the plugin framework).

Another approach was to add a couple of ad hoc conditions to the plugin so that statements of the form return call(); are not ‘optimized’. One thing I noticed was that if a function has only one return of the form return call(); , that return will be in the same basic block as the call. Therefore I could use code like this to weed out the boring stuff:

static bool
boring_block (basic_block bb)
{
  gimple ret_stmt = gimple_seq_last_stmt (bb_seq (bb));
  return gimple_code (ret_stmt) == GIMPLE_RETURN;
}

This worked pretty well. The number of ‘optimized’ call sites went down to 55 (from 1305) which was just what I wanted given that the majority of those call sites are not optimization worthy. Unfortunately, I was still increasing increasing the size of libxul.so, except this time by 14 bytes.

One possibility was that functions with multiple returns end up looking like this (this is after the ‘optimization’ has taken place):

<bb 6>:
  D.74138_10 = &this_6(D)->D.72191;
  dummy_var.24_17 = nsSVGGFrame::AttributeChanged (D.74138_10, aNameSpaceID_2(D), aAttribute_4(D), aModType_11(D));
  D.74137_12 = 0;

<bb 7>:
  # D.74137_1 = PHI <0(5), D.74137_12(6)>
  return D.74137_1;
}

(See the code for this function here. Note that it has one 0-constant return and one ‘alwayszero call’ return. It doesn’t fit my boring_block condition above, but this is still a useless optimization.)

So I moved onto the next approach. It appears that any time the return value is checked the basic block of the call will contain a conditional statement of some form (see the GIMPLE code for nsSplittableFrame::AttributeChanged in this post, for an example). This led me to a new condition for only performing useful optimizations:

static bool
is_interesting_block (gimple_stmt_iterator gsi)
{
  for (; !gsi_end_p (gsi); gsi_next (&gsi))
    {
      gimple stmt = gsi_stmt (gsi);
      enum gimple_code code = gimple_code (stmt);
      if (code == GIMPLE_COND || code == GIMPLE_SWITCH)
        return true;
    }
  return false;
}

With this, the number of optimized call sites is reduced to 26, and I’m certain all of these optimizations result in useful code pruning (although I haven’t run a tree-dump build yet). Here’s the weird thing though: the size results are exactly the same as when I was optimizing 55 callsites. That is, the plugin is still adding 14 bytes to libxul.so! Argh.

Here’s an even bigger mystery: I ran a build using a ‘disabled’ version of the plugin. It registers a new pass, but no optimizations are performed. The results are identical to when I was running the plugin with 55 callsites optimized and with 26 callsites optimized: 14 bytes are added to libxul.so. My first thought is that adding a new pass somehow interferes with existing GCC optimizations, especially since the plugin must set the TODO_update_ssa flag (since a new statement is added). Why do my 26 useful optimizations not have any affect though?

I’m going to have to run another tree-dump build of mozilla-central and then play around a bit more with objdump. Something is very strange here.

More analysis troubleshooting

Posted in Seneca by ehren on December 9, 2009

I wrote a little Dehydra script to determine why my optimization has resulted in a mediocre size reduction. As I suspected, most of the call sites are of the form return call(); but I can now confirm this with this script:

function isAlwaysZero(c)
{
  if (!c.attributes)
    return false;
  for each (let a in c.attributes)
    if (a.name == 'user' && a.value == 'NS_alwayszero')
      return true;
  return false;
}

function process_function(decl, body)
{
  // for calls that are simply returned
  let boringCalls = 0;
  // for calls where the return value is assigned to a variable
  let interestingCalls = 0;

  for (v in iterate_vars(body)) {
    if (v.isFcall && v.isReturn && isAlwaysZero(v))
      // the function call is returned
      ++boringCalls;
    else if (v.assign && v.assign[0].isFcall && isAlwaysZero(v.assign[0]))
      // the function call is assigned to a variable
      ++interestingCalls;
  }

  if (interestingCalls == 0 && boringCalls > 0)
    warning("boring alwayszero function (boring calls: " + boringCalls +
            "). name: " + decl.name, decl.loc);
  else if (interestingCalls > 0)
    warning("interesting alwayszero function (interesting calls: " +
            interestingCalls + ", boring calls: " + boringCalls + "). name: " +
            decl.name, decl.loc);
}

There may be a couple of corner cases I’m excluding here eg something like if (call()) but this should take of every instance when NS_FAILED is involved.

Anyway, here are the results:

The number of interesting call sites ie places where the return value of the call is assigned to a variable is only 67. Surprisingly, the number of boring call sites ie places where the return value of the call is simply returned is a whopping 2495. Here’s the results for all the boring callsites and all the interesting callsites, if you’re interested.

One thing I’m a bit confused about is that although there are 2495+67 = 2562 call sites where my plugin should be invoked, it is only being invoked 1305 times, which I have previously reported here (I can also confirm this from running a Mozilla build with -fdump-tree-all). Whatever the answer though, I’m pretty certain that the poor results are due to a small number of checks made on calls to functions I have patched.

Possible reasons for poor binary size reduction

Posted in Seneca by ehren on December 4, 2009

Just for a bit of context here’s the basic reason why I was only able to achieve 24 bytes worth of optimization with the current number of functions patched.

Consider GetAttributeNS which is defined in a macro here

It looks like this:

NS_SCRIPTABLE NS_IMETHOD
GetAttributeNS(const nsAString & namespaceURI, const nsAString & localName,
	       nsAString & _retval NS_OUTPARAM)
{
  return _to GetAttributeNS(namespaceURI, localName, _retval);
}

Here’s what it looks like (in gimple ssa form) after being run through the plugin:

nsHTMLModElement::GetAttributeNS(nsAString_internal const&, nsAString_internal const&, nsAString_internal&) (struct nsHTMLModElement * const this, const struct nsAString_internal & namespaceURI, const struct nsAString_internal & localName, struct nsAString_internal & _retval)
{
  nsresult dummy_var.33;
  struct nsGenericElement * D.44812;
  nsresult D.44811;

<bb 2>:
  D.44812_2 = &this_1(D)->D.43447.D.33567.D.30722.D.30660;
  dummy_var.33_9 = nsGenericElement::GetAttributeNS (D.44812_2, namespaceURI_3(D), localName_4(D), _retval_5(D));
  D.44811_6 = 0;
  return D.44811_6;
}

The optimization’s performed (because it’s a value returning gimple call) but there’s no real code elimination. Even though this function now knows that the call to an alwayszero function is 0 it doesn’t really help because it just returns the value. And the function returning the value is itself virtual!

(Note: It’s somewhat shameless to post the above because I just ripped it out of a pastebin from an irc conversation, but there it is.)

Are there any significant optimizations?

The next question is whether or not my plugin is able to cut out any unnecessary NS_FAILED calls since that’s kind of the point. In fact, it has in some cases.

Consider nsImageFrame::AttributeChanged

It looks like this:

NS_IMETHODIMP
nsImageFrame::AttributeChanged(PRInt32 aNameSpaceID,
                               nsIAtom* aAttribute,
                               PRInt32 aModType)
{
  nsresult rv = nsSplittableFrame::AttributeChanged(aNameSpaceID,
                                                    aAttribute, aModType);
  if (NS_FAILED(rv)) {
    return rv;
  }
  if (nsGkAtoms::alt == aAttribute)
  {
    PresContext()->PresShell()->FrameNeedsReflow(this,
                                                 nsIPresShell::eStyleChange,
                                                 NS_FRAME_IS_DIRTY);
  }

  return NS_OK;
}

nsSplittableFrame::AttributeChanged is an alwayszero function so nsImageFrame::AttributeChanged will be processed by the plugin. Here’s what that processing will do:

nsImageFrame::AttributeChanged(int, nsIAtom*, int) (struct nsImageFrame * const this, PRInt32 aNameSpaceID, struct nsIAtom * aAttribute, PRInt32 aModType)
{
  nsresult dummy_var.291;
  struct nsIPresShell * D.96275;
  struct nsStyleContext * D.96272;
  struct nsRuleNode * D.96271;
  struct nsPresContext * D.96269;
  struct nsPresContext * D.96269;
  nsresult rv;
  int (*__vtbl_ptr_type) (void) D.93660;
  int (*__vtbl_ptr_type) (void) * D.93659;
  int (*__vtbl_ptr_type) (void) * D.93658;
  struct nsIFrame * D.93655;
  struct nsIAtom * alt.65;
  long int D.93648;
  long int D.93647;
  int rv.64;
  struct nsFrame * D.93645;

<bb 2>:
  D.93645_3 = &this_2(D)->D.53019.D.51667;
  dummy_var.291_32 = nsFrame::AttributeChanged (D.93645_3, aNameSpaceID_4(D), aAttribute_5(D), aModType_6(D));
  rv_7 = 0;
  rv.64_8 = (int) rv_7;
  D.93647_9 = rv.64_8 < 0;
  D.93648_10 = __builtin_expect (D.93647_9, 0);
  if (D.93648_10 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

<bb 3>:
  rv_11 = rv_7;
  goto <bb 7>;

<bb 4>:
  alt.65_12 = alt;
  if (alt.65_12 == aAttribute_5(D))
    goto <bb 5>;
  else
    goto <bb 6>;

<bb 5>:
  D.93655_13 = &this_2(D)->D.53019.D.51667.D.51581.D.46508;
  D.96272_14 = D.93655_13->mStyleContext;
  D.96272_26 = D.96272_14;
  D.96271_27 = D.96272_26->mRuleNode;
  D.96269_28 = D.96271_27->mPresContext;
  D.96269_30 = D.96269_28;
  D.96269_15 = D.96269_30;
  D.96275_16 = D.96269_15->mShell;
  D.96275_29 = D.96275_16;
  D.96275_31 = D.96275_29;
  D.96275_17 = D.96275_31;
  D.93658_18 = D.96275_17->D.15138.D.14613._vptr.nsISupports;
  D.93659_19 = D.93658_18 + 200;
  D.93660_20 = *D.93659_19;
  D.93655_21 = &this_2(D)->D.53019.D.51667.D.51581.D.46508;
  OBJ_TYPE_REF(D.93660_20;D.96275_17->25) (D.96275_17, D.93655_21, 2, 1024);

<bb 6>:
  rv_22 = 0;

<bb 7>:
  # rv_1 = PHI <rv_11(3), 0(6)>
  return rv_1;
}

Clearly the optimization has been performed but has GCC propagated the 0, cutting out the NS_FAILED?

It has, in fact (here’s the IR after a later pass):

nsImageFrame::AttributeChanged(int, nsIAtom*, int) (struct nsImageFrame * const this, PRInt32 aNameSpaceID, struct nsIAtom * aAttribute, PRInt32 aModType)
{
  struct nsIPresShell * D.96275;
  struct nsStyleContext * D.96272;
  struct nsRuleNode * D.96271;
  struct nsPresContext * D.96269;
  struct nsPresContext * D.96269;
  int (*__vtbl_ptr_type) (void) D.93660;
  int (*__vtbl_ptr_type) (void) * D.93659;
  int (*__vtbl_ptr_type) (void) * D.93658;
  struct nsIFrame * D.93655;
  struct nsIAtom * alt.65;
  struct nsFrame * D.93645;

<bb 2>:
  D.93645_3 = &this_2(D)->D.53019.D.51667;
  nsFrame::AttributeChanged (D.93645_3, aNameSpaceID_4(D), aAttribute_5(D), aModType_6(D));
  alt.65_12 = alt;
  if (alt.65_12 == aAttribute_5(D))
    goto <bb 3>;
  else
    goto <bb 4>;

<bb 3>:
  D.96272_14 = this_2(D)->D.53019.D.51667.D.51581.D.46508.mStyleContext;
  D.96271_27 = D.96272_14->mRuleNode;
  D.96269_28 = D.96271_27->mPresContext;
  D.96275_16 = D.96269_28->mShell;
  D.93658_18 = D.96275_16->D.15138.D.14613._vptr.nsISupports;
  D.93659_19 = D.93658_18 + 200;
  D.93660_20 = *D.93659_19;
  D.93655_21 = &this_2(D)->D.53019.D.51667.D.51581.D.46508;
  OBJ_TYPE_REF(D.93660_20;D.96275_16->25) (D.96275_16, D.93655_21, 2, 1024);

<bb 4>:
  return 0;
}

Note that dummy_var is gone too.

I would conclude that the optimization works for Firefox. It’s just that so far I’ve patched some really boring functions!

Binary size results after optimization

Posted in Seneca by ehren on December 4, 2009

Here’s a quick rundown of the binary size reduction from running my plugin on mozilla-central:

These libraries had a minor code size decrease:

without-plugin 93 ./dist/lib/libfreebl3.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/freebl/Linux_SIN
with-plugin 90 ./dist/lib/libfreebl3.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/freebl/Linux_SINGLE_SH
without-plugin 68 ./dist/lib/libnss3.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/nss/libnss3.so
with-plugin 65 ./dist/lib/libnss3.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/nss/libnss3.so
without-plugin 75 ./dist/lib/libnssckbi.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/nssckbi/libnssck
with-plugin 72 ./dist/lib/libnssckbi.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/nssckbi/libnssckbi.so
without-plugin 74 ./dist/lib/libnssdbm3.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/nssdbm/libnssdbm
with-plugin 71 ./dist/lib/libnssdbm3.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/nssdbm/libnssdbm3.so
without-plugin 76 ./dist/lib/libnssutil3.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/nssutil/libnssu
with-plugin 73 ./dist/lib/libnssutil3.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/nssutil/libnssutil3.s
without-plugin 72 ./dist/lib/libsmime3.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/smime/libsmime3.s
with-plugin 69 ./dist/lib/libsmime3.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/smime/libsmime3.so
without-plugin 76 ./dist/lib/libsoftokn3.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/softokn/libsoft
with-plugin 73 ./dist/lib/libsoftokn3.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/softokn/libsoftokn3.s
without-plugin 68 ./dist/lib/libssl3.so -> /home/ehren/mozilla-central-without-plugin/objdir/nss/ssl/libssl3.so
with-plugin 65 ./dist/lib/libssl3.so -> /home/ehren/mozilla-central-with-plugin/objdir/nss/ssl/libssl3.so

Total savings: 24 bytes across 8 libraries.

Everything else was unchanged. The full results can be viewed here.

Note: this is after patching 1677 member functions which resulted in 1305 optimizations performed (this means there are about 1305 value returning calls to these 1677 functions).

Btw I obtained this data running find . -name '*.so' | xargs ls -l > ~/with-plugin.txt on both build directories, and then by sorting and merging the results ie cat without-plugin.txt with-plugin.txt | sort -k 8,15 > merged-size-results.txt

Incidentally, I think the the code size increase I reported previously was just me reading some diff results wrong.

A Treehydra + GCC plugin combo for alwayszero functions

Posted in Seneca by ehren on December 4, 2009

Yesterday (err technically two days ago), I was able to cleanly add my user(("alwayszero")) attribute to about 1780 of the ~6300 functions I have identified as alwayszero within mozilla-central. I was even able to get through a build using my pluginified GCC 4.5. Further, by adding a bit of debug info right into the plugin I was able to count the number of times the optimization was executed ie every time the return value of a call to an alwayszero function is stored. Unfortunately, the results were somewhat disappointing in that the optimization was only executed 1300 times. This doesn’t take into account any additional constant propagation that resulted from these optimizations but, by comparing the sizes of the .so files in the object directory to the .so files of an equivalent non-plugin enabled GCC 4.5 build, there was no code size advantage to be had. In fact a handful of the files increased by 5 or so bytes!

I suppose it’s premature to derive any conclusions at this point, though, with less that a third of the functions patched with my attribute.

Anyway, the next step in completing the patching process was to clean up my analysis to:
1) remove all of the outparams.js code that’s unnecessary for determining the final return value
and
2) ensure that the attribute has not been misapplied (this will result in very bad/interesting things happening at runtime)

During this step, I somehow introduced some very bizarre errors into my Treehydra script that I’ve only now been able to resolve. At some point I’d like to examine the source of these errors but at the moment there are bigger issues.

Perhaps foolishly, I’ve now placed all of my code into bugzilla. The treehydra analysis certainly has a few issues, since, for one thing, if it’s placed into trunk as is (with FIND_NEW_FUNCTIONS = true) static-checking builds will contain an overwhelming amount of alwayszero identifying output. However, at this point, I think that if there are any major issues left (other than writing a Pork patch) they can probably best be discovered in the bug.

One thing I didn’t put in bugzilla was the insane technique I’ve used to identify and patch as many alwayszero functions as possible. For the non-squeamish, I’ll describe the process now.

I first create a typescript of a build running my analysis script. It spits out output like this:

alwayszero function found! location: /home/ehren/mozilla-central/xpcom/glue/nsCategoryCache.cpp:126:51 name: Observe class: /home/ehren/mozilla-central/xpcom/glue/nsCategoryCache.h:65:59 overloadCount: 0 args: nsCategoryObserver*, nsISupports*, const char*, const PRUnichar*

Next I grep for all of these lines removing any duplicates:
grep 'alwayszero function found!' typescript | sort | uniq > uniq.txt

Rather than patch my main mozilla-central repository, I typically create a copy and then rename all the directories in this file. For example with vim:
:%s/mozilla-central/mozilla-central-Copy/g

Overloaded methods are a big headache for the techniques I’m using because, as you’ll see, the matches I’m making are only by function name. So, after identifying them in my treehydra script, I simply remove them (there are only about 130 anyway):

grep 'overrideCount: 0' uniq.txt > no-overrides.txt

Now it’s time for the first shell script. I use the following to print out the (possible) location of the declaration of the function within its class:

# finddecl.sh
while read line ; do
  name=$(echo $line | cut -d' ' -f 7)
  path=$(echo $line | cut -d' ' -f 5)
  classpath=$(echo $line | cut -d' ' -f 9)
  lineno=$(echo $classpath | cut -d':' -f 2)
  classpath=$(echo $classpath | cut -d':' -f 1) # strip off line info
  # start at the beginning of the class and returns the first line containing the 
  # function's name. (actually it starts at line 1 but the d actions skips these lines)
  nameinfo=$(sed -n "1,$lineno d; /$name/{p;=;q;}" $classpath)
  echo $name path: $path classpath: $classpath nameinfo: $nameinfo
done < $1

I then run it like so:

./finddecl.sh no-overrides.txt > found-decls.txt

This output will contain a tonne of blank nameinfo entries because most of the functions I’ve found are behind macros. There are also just lots of phony results, mainly because my search is extremely imprecise. To cut down on the junk, therefore, I do this:

grep '\(virtual\)\|\(NS_IMETHOD \)' found-decls.txt > found-decls-trimmed.txt

All of this will result in output like so:

GetRoleInternal path: /home/ehren/mozilla-central-Copy/accessible/src/html/nsHTMLSelectAccessible.cpp:963:58 classpath: /home/ehren/mozilla-central-Copy/accessible/src/html/nsHTMLSelectAccessible.h nameinfo: virtual nsresult GetRoleInternal(PRUint32 *aRole); 235

Now we’re ready for the second shell script:

# patchdecls.sh
while read line ; do
  fnname=$(echo $line | cut -d' ' -f 1)
  classpath=$(echo $line | cut -d' ' -f 5)
  lineno=$(echo $line | sed -n "s/.*[^0-9]\([0-9]\+\)$/\1/p") # thanks chris!
  # make sure we haven't already applied the attribute
  match=$(sed -n "$lineno s/NS_ALWAYSZERO/&/p" $classpath)
  if [ "$match" == "" ]
  then
    # in-place replace
    sed -in "$lineno s/ $fnname/ NS_ALWAYSZERO&/" $classpath
  fi
done < $1

Running it will result in a patched mozilla-central-Copy:

./patchdecls.sh found-decls-trimmed.txt

(All of this could be done with one script but there are certain advantages to breaking it up)

Earlier, as I mentioned, I did run such a patched build with GCC 4.5 using my optimization plugin. However, after finalizing my treehydra analysis, I know that a number of the functions patched should not have been. The fact that I encountered no compile time errors (naturally I wouldn’t) is a reminder of the danger of this plugin without a complementary static analysis. I still have my crazy unsafe build which I am tempted to try in the CDOT tomorrow. Could be a new FOX special: “when functions return zero”.

Now that I recently enabled a ‘safety checking’ feature in the analysis, where an error raised if a function is inappropriately tagged as alwayszero, the above is not an issue though. In fact, I am more confident in my analysis now that it has identified a number of misapplied attributes than when it simply was only identifying zero-return functions. For those interested, here’s a ramshackle list of functions which my shellscript’s inappropriately patched. Notice that there’s only one syntax error. Note that even though there are only 22 of them, I had a miserable time manually fix them all/waiting for the compiler to identify them!

To wrap this up, I should mention the results of running a final static-checking build after all the patching had been finalized. As you can see here, there are only 4514 left to go. Woohoo!