Ehren's Blog

Dead code update (and problems finding uninitialized class members)

Posted in Seneca by ehren on February 25, 2010

Well, I’ve fallen way behind with this blog again. Unfortunately, what I’ve done so far hasn’t worked out exactly as expected either. Anyway, here’s an update.

Finding Dead Code
I’ve refined the script I posted previously to take into account the fact that indirect calls to virtual functions register in g++ as calls to the base method (among a few other things). It can be viewed here. Running this new analysis, I’ve learned that finding dead methods is even more difficult than finding dead functions. Why? A tonne of stuff is exported as part of the Mozilla public api. A function or even an entire class may seem dead but it’s likely used via javascript/xpcom inside the mozilla tree (or maybe in an extension).

As far as I know, anything that derives from a class defined in an idl file probably can’t be marked dead and I suspect detecting this case will be most difficult. One thing I could take into account, however, are methods marked as NS_SCRIPTABLE ie __attribute__((user("NS_script"))). It would be easy to add this info to the node table in callgraph. Also, in regard to plain functions, I think it would be prudent to register function local address taking (of functions).

On the bright side I think I’ve found a handful of unused functions (view sample output of the script showing these here). I may be able to gleam more info once I’m able to actually query my paths table though. Unfortunately, the latest incarnation of the script is only about 85% finished:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                         
28977 ehren     20   0 17804  12m 2132 R 100.1  0.4   2331:07 python

That’s CPU time!

Uninitialized class members
I’ve also put a bit of work into a more immediate analysis. Bug 525063 proposes an error for any class with uninitialized data members. It seemed like an interesting project perhaps as an introduction to ESP analysis (especially since the only state transitions are assignments and passing by reference). In fact, dmandelin’s already done work on an analysis for uninitialized variables in general which I originally attempted to adapt. This was not the easiest task however, although I have just recently learned why I was not successful (I think it relates to a malfunctioning of create_decl_set which I have yet to diagnose).

Giving up on ESP for the moment, I tried my hand doing things the simple way. That is, if there’s any code at all in the constructor that initializes a variable, then I consider that variable initialized.

To accomplish this, I first wrote a rather ugly and unreadable version of the script. I then polished it up a bit. My days of using JS arrays for maps and sets is hopefully over and, in fact, I think I’d be able to take another run at an ESP analysis now. Unfortunately though, running my simple script has uncovered more fundamental problems.

The bug comments mention classes with an overridden operator new (containing a memset(0) to initialize the data members) as one special case but this is nowhere near the largest problem. Consider this mdc page on xpcom:

You can provide an initialization function for your class. This will be called immediately after your class is allocated and the constructor is called. The init function takes 0 arguments, returns an nsresult, and must be public. You can call it anything you like, just reference it from NS_GENERIC_FACTORY_CONSTRUCTOR_INIT (discussed below).

Now we have the situation of variable initialization split across multiple functions. In fact, it’s even more complicated! Here’s an example:

mCSSAware (part of nsHTMLEditor) is one ‘uninitialized’ member turned up by the script. Rather than being directly initialized in Init() (or in the constructor), the initialization takes place in UpdateForFlags which is called from Init() here. Luckily in this case the initializations are all within the same translation unit although I suspect this is not always the case.

So, what would be the solution? If it’s worth doing, some kind of interprocedural analysis will have to be performed. I was thinking of collecting all the ‘uninitialized’ fields during compilation of both the constructor and any other functions identified as an inititializer. Initializers must include any member function called by an initializer (note that pass by reference is just simple initialization). If the set of uninitialized fields in a particular constructor intersected with the sets of fields from the other initializers is non-empty, you’ve found a genuine uninitialized field.

I was thinking that if every initialization takes place in the same translation unit it might be possible to do this within the current static-checking framework ie by compiling only once. I’d basically just have to identify the set of data members not initialized for every member function while building a mini call graph to eventually determine which of those member functions is really an initializer. It would then be a simple matter to take the intersection of the relevant sets of uninitialized fields.

However, I suspect there’s probably more than a few cases where member function definitions are spread across translation units (Actually that might be a neat dehydra analysis if this behaviour’s unwanted). To deal with this case I’d have to store the sets of ‘uninitialized’ data members until the entire build is complete while also determining the set of initializing functions. I’d then have to determine which data members are really initialized during a post-compilation analysis.

It might also be possible to produce the errors while the build process is still under way if I was able to query an sqlite database from within a treehydra script. I’m under the impression that the only way to use sqlite from js is by using xpcshell though so I don’t know if this suggestion would work. You’d still have to run a callgraph build prior to compilation anyway which takes quite some time.

There’s also the issue of the accuracy of any analysis when the call graph will likely have to be pretty dumb ie unfeasible code in a particular function that contains a call will end up registering as an edge. I suppose it would also be necessary to relate a functions parameter’s (and any global data) to whether or not a particular call within a function is feasible or not (and so on). The ESP paper does discuss a framework for interprocedural analysis but that’s way out of my scope. It would probably still be a worthwhile analysis if this kind of stuff is not taken into account, however it kind of defeats the purpose of performing an ESP analysis to find which members are uninitialized.

Anyway, I’ll report back (hopefully without such a long delay) when I’ve determined the next course of action.

A static analysis for fallthrough switch cases in C++

Posted in Seneca by ehren on February 8, 2010

For my 0.5 release in DPS911 I’d like to discuss some work on a Treehydra script for detecting fallthrough cases in a switch statement. I’ve actually been working on this off and on for some time, but it’s only been within the last few hours that I’ve arrived at something airtight.

Actually, I began work on such a script quite a while ago, after completing similar work on finding unreachable blocks in the control flow graph. Most of the serious bugs turned up by this analysis were all caused by forgetting the break statement (see bug 536438 and bug 536646) so a warning/error for this seemed natural. Also, I assumed this would be easy: just find all the basic blocks corresponding to the case labels of the switch and, if the successor of a block corresponds to the next case, you’ve got a fallthrough.

To make this more clear, here’s a simple switch statement that falls through:

int foo(int x) {
  switch(x) {
    case 0:
      x++;
    case 1:
      x--;
  }
}

And here’s a -fdump-tree-all printing of the cfg:

;; Function int foo(int) (_Z3fooi)

int foo(int) (x)
{
  # BLOCK 2
  # PRED: ENTRY (fallthru)
  switch (x)
    {
      case 0: goto <L0>;
      case 1: goto <L1>;
      default : goto <L2>;
    }
  # SUCC: 3 4 5

  # BLOCK 3
  # PRED: 2
<L0>:;
  x = x + 1;
  # SUCC: 4 (fallthru)

  # BLOCK 4
  # PRED: 2 3 (fallthru)
<L1>:;
  x = x + -1;
  # SUCC: 5 (fallthru)

  # BLOCK 5
  # PRED: 2 4 (fallthru)
<L2>:;
  return;
  # SUCC: EXIT
}

Notice something funny? There’s now a default case label! GCC will always insert this dummy default by the time you’ve reached the cfg and, even worse, there’s no way to distinguish a switch statement with a real default from one without, once you’ve reached this level. Unfortunately, I spent a great deal of time trying to do just that, at various times thinking I finally had the solution only to discover a new test case that would totally blow apart the analysis. I think I’ve tried maybe a dozen different approaches only to give up in disgust at various points. See here for one totally invalid early attempt.

So what’s the solution? I needed information from two separate passes of the compiler: The cfg, but also the initial c++ ast representation to find out if the default was really a default. To this end, I knew walk_tree would be needed. I was even able to find a bug in Treehydra during some initial experiments. However, as before, once I thought I had the solution, a more complex test case would ruin everything.

Anyway, here’s the finalish script which also employs a fallthrough() function/annotation to suppress the warnings. Once I’ve had a bit of time to sleep on it, I’m going to refactor a bit and then post to the bug. I also need to consider how to integrate this into a –with-static-checking build. There are a number of issues I had to take into account that I haven’t mentioned, as well, such as when the order of the cases is mixed up (eg case 1 before case 0) and also when labels and gotos come into the mix. (Properly dealing with switches embedded within switches … was the last hurdle.)

For a demonstration, here’s a nasty bit of code together with a sample run of the script.

So, for my 0.6 release, I’ll have to get back to dead code, hopefully being able to exploit callgraph to the fullest. Perhaps I can find another analysis to start work on as well.

Computing connected components with SQLite (and other attempts finding dead code)

Posted in Seneca by ehren on February 4, 2010

In the last couple of weeks I’ve put a bit of effort into identifying unused functions in mozilla-central. Unfortunately, I have not been keeping up with this blog, which is a major requirement in DPS911.

As I previously reported, I’ve been using a Treehydra generated call graph (callgraph) which records caller-callee relationships in a giant sqlite3 database. As mentioned, my first attempt was to ignore the directedness of each edge in this graph, and then compute the connected components. I believe this is equivalent to computing the weakly connected components in a directed graph.

I have based my code almost entirely off work done David MacIver who has an interesting post here where you can read more about his algorithm. It works by initially placing each verticex into its own component and then iteratively merging distinct components where there is an edge between two vertices in the respective components. Unfortunately, his Ruby written for Mysql is not directly usable for my purposes since SQLite, which I’m working with, does not support the update join syntax. Here’s a bit of MacIver’s code for example:

    update items
    join (
      select component1 source, min(component2) target
      from components_to_merge
      group by source
    ) new_components
    on new_components.source = component_id
    set items.component_id = least(items.component_id, target)

To create the equivalent code, I used a temporary table and a bit of kludge:

    INSERT INTO components_to_merge
    SELECT component2, component1 FROM components_to_merge;

    DROP TABLE IF EXISTS new_components;
    CREATE TABLE new_components
    AS 
    SELECT component1 source, min(component2) target
    FROM components_to_merge
    GROUP BY source;

    UPDATE node
    SET componentID = min((SELECT min(componentID, target)
                           FROM new_components
                           WHERE source = componentID),
                          componentID)
    WHERE componentID in
        (SELECT source FROM new_components);

That update statement is horribly inefficient but ultimately the job gets done. All told, it takes about an hour on one of the CDOT’s development machines. As MacIver noted in his original blog post, you’re basically done once the first merge has completed. The complete script, written in python, can be viewed here.

Note that a few domain specific refinements had to take place here. For example, any indirect calls to virtual functions will be registered as a call to the function in the base class. Callgraph already deals with this using an implementors table but I had to add an extra column for the return type of the function in order to accurately construct the name of both member functions. I then merge their components:

# set interface component to implementor component 
cursor.execute('SELECT implementor, interface, method, type \
                FROM implementors')
list = cursor.fetchall()
for row in list:
    # format: type interface::method
    basename = "%s %s::%s" %(row[3], row[1], row[2])
    # format: type implementor::method
    derivedname = "%s %s::%s" %(row[3], row[0], row[2])
    cursor.execute("UPDATE node \
                    SET componentID = (SELECT componentID \
                                       FROM node \
                                       WHERE name = '%s') \
                    WHERE name = '%s'" %(basename, derivedname))

Also, whenever a function is called outside of the compilation unit in which it’s defined, the call will point to the location of the forward declaration of the function (eg the .h file). To get around this, I simply merge all components that contain nodes with the same name:

SELECT n.name, min(nn.componentID)
FROM node n, node nn
WHERE n.name = nn.name AND
n.componentID != nn.componentID
GROUP by n.componentID

Of course, some nuance is required to interpret the data produced by this script. As I mentioned previously, the analysis is complicated by the fact that many functions “never called” are in fact called indirectly via a function pointer. This means that most connected components of size one aren’t particularly interesting. In fact, since I’m merging declarations and forward declarations, most components of size two also aren’t particularly interesting (at least those with elements containing the same name). This script takes the above into account and can be used to produce these results:

A list of connected components (that aren’t the largest component) with size greater than 3
and
A list of components of size 2 (that aren’t the largest component) that have members with distinct names.

The vast majority of these are false positives mostly for the reasons mentioned above, but In the last week I finally bit the bullet and filed a bug, along with a patch, to remove a bit of the dead code identified. Unfortunately, these functions are all in the cairo library, which means that applying such a patch would make tracking upstream changes unnecessarily difficult.

A number of issues have been identified here though. Previously, in connection with an analysis to find non-static functions called only within a particular compilation unit, Taras filed this bug to add -fdata-sections
-ffunction-sections -Wl,-gc-sections
to the build config in order to strip out dead symbols. Interestingly enough, running a few of my own tests with this configuration, I was not able to strip out all of these dead cairo functions. Even more bizarrely, the symbols of a number of dead static (!) functions seemed to work themselves into libxul?!

As an aside, the nm utilty would have been quite useful when working on the alwayszero project. (To get similar results I was previously running debug builds and then grepping the output of objdump).

A path based algorithm
There are issues with the above analysis though. For example, a function could be unused yet call functions that are used. This script begins at a particular node and iterates backwards, transitively adding all callers. If the callers to be added are exhausted within a reasonable number of iterations, we’ve found a dead path. Note that to narrow down this analysis, I’ve disregarded everything already identified by the component based analysis ie I do not start with any node not in the largest connected component.

One thing I should note is that this script requires explicit transaction management to run (using cursor.execute("BEGIN TRANSACTION") and connection.commit()). For anyone faced with mysterious insert/update anomalies in sqlite, this may be the answer.

Anyway, getting useful data out of this is a little bit more complicated. For example, to see info about every function in all paths that contain only functions, the following query can be used:

SELECT * FROM node n
JOIN path p ON (n.id = p.id)
WHERE pathID NOT IN 
(SELECT pathID FROM node n
 JOIN path p ON (p.id = n.id)
 WHERE isMethod = 0) 
ORDER BY pathID;

Has this analysis yielded any more candidates for removal though? Unfortunately, I have not taken indirect calls to overridden member functions into account eg Base* b = new Derived(); b->foo();, so an analysis for dead methods is currently out. There is an easy fix, however, it will require another 18 hr run of the script (building a path backward from a start set of 100000+ nodes = a boatload of time). Analyzing just functions though, ie using the query above, I have found some interesting possibilities. For example, there’s quite a bit of dead stuff in nameprep.c, but I suspect this is an even bigger nightmare to patch than cairo.

I will report back soon, however, when I have more concrete results.