Ehren's Blog

Computing connected components with SQLite (and other attempts finding dead code)

Posted in Seneca by ehren on February 4, 2010

In the last couple of weeks I’ve put a bit of effort into identifying unused functions in mozilla-central. Unfortunately, I have not been keeping up with this blog, which is a major requirement in DPS911.

As I previously reported, I’ve been using a Treehydra generated call graph (callgraph) which records caller-callee relationships in a giant sqlite3 database. As mentioned, my first attempt was to ignore the directedness of each edge in this graph, and then compute the connected components. I believe this is equivalent to computing the weakly connected components in a directed graph.

I have based my code almost entirely off work done David MacIver who has an interesting post here where you can read more about his algorithm. It works by initially placing each verticex into its own component and then iteratively merging distinct components where there is an edge between two vertices in the respective components. Unfortunately, his Ruby written for Mysql is not directly usable for my purposes since SQLite, which I’m working with, does not support the update join syntax. Here’s a bit of MacIver’s code for example:

    update items
    join (
      select component1 source, min(component2) target
      from components_to_merge
      group by source
    ) new_components
    on new_components.source = component_id
    set items.component_id = least(items.component_id, target)

To create the equivalent code, I used a temporary table and a bit of kludge:

    INSERT INTO components_to_merge
    SELECT component2, component1 FROM components_to_merge;

    DROP TABLE IF EXISTS new_components;
    CREATE TABLE new_components
    AS 
    SELECT component1 source, min(component2) target
    FROM components_to_merge
    GROUP BY source;

    UPDATE node
    SET componentID = min((SELECT min(componentID, target)
                           FROM new_components
                           WHERE source = componentID),
                          componentID)
    WHERE componentID in
        (SELECT source FROM new_components);

That update statement is horribly inefficient but ultimately the job gets done. All told, it takes about an hour on one of the CDOT’s development machines. As MacIver noted in his original blog post, you’re basically done once the first merge has completed. The complete script, written in python, can be viewed here.

Note that a few domain specific refinements had to take place here. For example, any indirect calls to virtual functions will be registered as a call to the function in the base class. Callgraph already deals with this using an implementors table but I had to add an extra column for the return type of the function in order to accurately construct the name of both member functions. I then merge their components:

# set interface component to implementor component 
cursor.execute('SELECT implementor, interface, method, type \
                FROM implementors')
list = cursor.fetchall()
for row in list:
    # format: type interface::method
    basename = "%s %s::%s" %(row[3], row[1], row[2])
    # format: type implementor::method
    derivedname = "%s %s::%s" %(row[3], row[0], row[2])
    cursor.execute("UPDATE node \
                    SET componentID = (SELECT componentID \
                                       FROM node \
                                       WHERE name = '%s') \
                    WHERE name = '%s'" %(basename, derivedname))

Also, whenever a function is called outside of the compilation unit in which it’s defined, the call will point to the location of the forward declaration of the function (eg the .h file). To get around this, I simply merge all components that contain nodes with the same name:

SELECT n.name, min(nn.componentID)
FROM node n, node nn
WHERE n.name = nn.name AND
n.componentID != nn.componentID
GROUP by n.componentID

Of course, some nuance is required to interpret the data produced by this script. As I mentioned previously, the analysis is complicated by the fact that many functions “never called” are in fact called indirectly via a function pointer. This means that most connected components of size one aren’t particularly interesting. In fact, since I’m merging declarations and forward declarations, most components of size two also aren’t particularly interesting (at least those with elements containing the same name). This script takes the above into account and can be used to produce these results:

A list of connected components (that aren’t the largest component) with size greater than 3
and
A list of components of size 2 (that aren’t the largest component) that have members with distinct names.

The vast majority of these are false positives mostly for the reasons mentioned above, but In the last week I finally bit the bullet and filed a bug, along with a patch, to remove a bit of the dead code identified. Unfortunately, these functions are all in the cairo library, which means that applying such a patch would make tracking upstream changes unnecessarily difficult.

A number of issues have been identified here though. Previously, in connection with an analysis to find non-static functions called only within a particular compilation unit, Taras filed this bug to add -fdata-sections
-ffunction-sections -Wl,-gc-sections
to the build config in order to strip out dead symbols. Interestingly enough, running a few of my own tests with this configuration, I was not able to strip out all of these dead cairo functions. Even more bizarrely, the symbols of a number of dead static (!) functions seemed to work themselves into libxul?!

As an aside, the nm utilty would have been quite useful when working on the alwayszero project. (To get similar results I was previously running debug builds and then grepping the output of objdump).

A path based algorithm
There are issues with the above analysis though. For example, a function could be unused yet call functions that are used. This script begins at a particular node and iterates backwards, transitively adding all callers. If the callers to be added are exhausted within a reasonable number of iterations, we’ve found a dead path. Note that to narrow down this analysis, I’ve disregarded everything already identified by the component based analysis ie I do not start with any node not in the largest connected component.

One thing I should note is that this script requires explicit transaction management to run (using cursor.execute("BEGIN TRANSACTION") and connection.commit()). For anyone faced with mysterious insert/update anomalies in sqlite, this may be the answer.

Anyway, getting useful data out of this is a little bit more complicated. For example, to see info about every function in all paths that contain only functions, the following query can be used:

SELECT * FROM node n
JOIN path p ON (n.id = p.id)
WHERE pathID NOT IN 
(SELECT pathID FROM node n
 JOIN path p ON (p.id = n.id)
 WHERE isMethod = 0) 
ORDER BY pathID;

Has this analysis yielded any more candidates for removal though? Unfortunately, I have not taken indirect calls to overridden member functions into account eg Base* b = new Derived(); b->foo();, so an analysis for dead methods is currently out. There is an easy fix, however, it will require another 18 hr run of the script (building a path backward from a start set of 100000+ nodes = a boatload of time). Analyzing just functions though, ie using the query above, I have found some interesting possibilities. For example, there’s quite a bit of dead stuff in nameprep.c, but I suspect this is an even bigger nightmare to patch than cairo.

I will report back soon, however, when I have more concrete results.

Advertisements

One Response

Subscribe to comments with RSS.

  1. Ceiling fans come in various designs, designs, form, and sizes.
    Wobbling and vibrating place significant stress on many parts of the fan and will ultimately shorten its life.
    Also check the minimum clearance from the mounting point to the closest wall surface.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: