On The Passing Of Richard “Cyberlot” Thomas

Jeff Moore makes a very nice post here about Richard’s passing. He is survived by his wife and daughter, among others. Please consider donating to their assistance fund.

I was first acquainted with Cyberlot by email and blog posts, and met him in person more than once at various conferences. He was always kind and congenial, and never showed any arrogance or bad attitude. He was a considerate and thorough contributor to the Solar project, and I have him to thank for the “percent-of-PHP” measurement statistic on the framework benchmarking project. I always give him credit for that in my talks, and so his name will continue live there.

And now, a practical note: A lot of PHP folk out there are freelancers or independent consultants, or are in other kinds of unstable job situations. If you are one of these, and you have a family, *please* consider purchasing term life insurance to take care of your loved ones if you pass suddenly. Get it even if you are very young. It is not expensive. It’s not the only thing you should do to prepare, but it’s an important thing.

Comparing Benchmark Tools

As I noted last week, I have moved my framework benchmarking project to GitHub. As part of the move, I updated the project to allow benchmarking using any of three tools: Acme http_load, Apache ab, or Joedog siege. (For reference, the old project will remain at GoogleCode.)

I thought it might be interesting to see what each of them reports for the baseline “index.html” and “index.php” cases on the new Amazon EC2 setup (using a 64-bit OS on an m1.large instance). The results follow (all are at 10 concurrent users, averaged over 5 one-minute runs):

ab                       |      rel |      avg |
------------------------ | -------- | -------- |
baseline-html            |   1.2660 |  3581.54 |
baseline-php             |   1.0000 |  2829.11 |

http_load                |      rel |      avg |
------------------------ | -------- | -------- |
baseline-html            |   1.2718 |  4036.24 |
baseline-php             |   1.0000 |  3173.56 |

siege                    |      rel |      avg |
------------------------ | -------- | -------- |
baseline-html            |   1.2139 |  5060.25 |
baseline-php             |   1.0000 |  4168.76 |

They all show very different “absolute” numbers of requests/second: ab thinks the server delivers about 3600 req/sec, http_load reports about 4000, and siege says about 5000.

Note that the ab and http_load relative scores are in line with each other, reporting about a 26-27% slowdown for invoking PHP. Siege thinks PHP is more responsive than that, with only a 21% slowdown.

Which of these is the most accurate? I don’t know. I ran the benchmarking tool on the same server as was being benchmarked, so the differences may result from how much processing power was being consumed by the benchmarking tools themselves.

One interesting point is that ab no longer appears to be over-reporting the baseline cases, as I noted in an earlier benchmark posting. There are two major changes between then and now: (1) the updated project uses Ubuntu 10.10 instead of 8.10, which means the packaged ab binary might have been flawed earlier, or that the new OS otherwise corrects some other issue; (2) the updated project uses an m1.large 64-bit instance instead of an m1.small 32-bit instance. Either of those differences might be sufficient to account for the disparity in ab reporting previously.

PHP Framework Benchmarks on Github

As part of “trying new things,” I have moved my web frameworks benchmark project over to Git on Github and away from Subversion on Google Code.

This project is often imitated and occasionally adopted. For all you framework fans who want to compare their preferred systems to the ones officially included in the project, you can now fork the repo and add your favorite. Who knows, some may make their way back onto the officially-included list.

Additionally, I have modified the project so that you can use one of three different benchmarking tools: Apache Benchmark, JoeDog siege, or ACME http_load. After you follow the setup instructions, you can run benchmarks using each of the different tools against the same benchmark targets:

./bench/ab.php targets/baseline.ini
./bench/siege.php targets/baseline.ini
./bench/httpload.php targets/baseline.ini

Comments or questions? Leave a note below.

Regarding Underscores

Today, PHPDeveloper.org referred to a post by Leszek Stachowski about underscore prefixes on non-public class elements.

The question which comes instantly to my mind is: why? Is there any reason why this convention should be kept when PHP object oriented programming has gone a long way since PHP 4 (when there was no access modifiers and such underscore was the only fast way to distinguish public from, hmm, not public methods and properties) ? Are, for instance (as one of major OOP languages), Java coding standards pushing towards such naming convention? No!

I think that we, as developers, should not stick to this silly convention. For the sake of progress, stop looking back (because that what in fact this convention is) and stop supporting this one, particular naming convention.

I think the underscore-prefix for protected and private is a good convention to keep. As with many things about programming, this convention is not for the program, it is for for the programmer. For example, limiting your line length to about 80 characters is a good idea, not for reasons of “tradition”, but because of cognitive limitations of human beings who have to read code.

Likewise, using the underscore prefix is an immediate and continuous reminder that some portions of the code are not public. Its purpose is as an aid to the programmer. The underscores make it obvious which parts of the program are internal, and which parts are externally available. (Note that I do not extend this argument to support the use of Hungarian Notation in PHP; if something like the underscore prefix is overused, it loses its obvious-ness and thus becomes less powerful.)

As an example, look at the following code:

<?php
class NoUnderscores
{
    protected $data = array(
        'item' => 'magic-data',
    );

    protected $item = 'property-value';

    public function __get($key)
    {
        return $this->data[$key];
    }

    protected function doSomething()
    {
        // do we want the magic public item,
        // or the internal protected item?
        return $this->item;
    }
}

Here we have magic __get() method that reads from the protected $data property. Any time you try to access a property that doesn’t exist, PHP will go to the __get() method and read from protected $data. Now look in the doSomething() method. Because the code executes inside the class, it has access ot the protected $item, so it’s not obvious if the programmer wanted the value of protected $item, or the magic $data['item'].

By way of comparison, take a look at the following modification to use the underscore prefix on private and protected elements:

<?php
class Underscores
{
    protected $_data = array(
        'item' => 'magic-data',
    );

    protected $_item = 'property-value';

    public function __get($key)
    {
        return $this->_data[$key];
    }

    protected function _doSomething()
    {
        // it is clear we want the internal protected item
        return $this->_item;
    }
}

Now the _doSomething() method is perfectly clear: the programmer wants the value of the internal protected property.

The Miserable Mathematics of the Man-Month

We’ve all heard of Fred Brooks’ law regarding the mythical man-month. (If you have not heard of it, stop reading this and go read The Mythical Man Month). The rule is this:

Adding manpower to a late software project makes it later.

He concludes his essay with these words:

The number of months of a project depends on its sequential constraints. The maximum number of men depends on the number of independent subtasks. From these two quantites one can derive schedules using fewer men and more months. (The only risk is product obsolescence.) One cannot, however, get workable schedules using more men and fewer months.

Once a development project has started, and appears to be taking longer than desired, adding more developers will make the project more-late. Brooks explains that there are two reasons for this:

  1. It takes at least one developer, often more, to brief, train, educate, and otherwise inculcate the new developers into the project. That developer is unproductive during this time, and so are the new developers.

  2. The amount of communication between everyone on the project increases exponentially with each new developer added, but the amount of productivity only increases linearly.

But exactly how bad is the falloff in productivity? “Adding people will make it more late”, but just how much later will it be? Is there a way to predict the schedule effects of adding developers?

Point 1 (the training of the new developers by one or more existing developers) is relatively easy to model. If you add one new developer, and it takes one existing developer a week to train him, then you just added at least a week to the schedule when neither of them is productive. The existing developer got no productive work done, and neither did the new one.

After that, the developers have to make up the lost time. To discover the effect of point 2 above (the communication costs involved once a new developer is in place), I plugged some numbers into a spreadsheet. The communication costs are quite dramatic.

Below is a table representing a one-developer project, along with the communication costs and resulting schedule compression when adding developers. I’ll give some narrative about the table afterwards, and then close with some very strong caveats.

Do not interpret this as science; at best, it is an initial guide to expectations, subject to further revision as I explore the topic more deeply.

working
devs
comm.
links
dev. +
comm
prod.
rate
prod.
output
time
factor
1 0 1 1.00 1.00 1.00
2 1 3 0.67 1.33 0.75
3 3 6 0.50 1.50 0.67
4 6 10 0.40 1.60 0.63
5 10 15 0.33 1.67 0.60
6 15 21 0.29 1.71 0.58
7 21 28 0.25 1.75 0.57
8 28 36 0.22 1.78 0.56
9 36 45 0.20 1.80 0.56
10 45 55 0.18 1.82 0.55

In the first row, we have the baseline case. One developer doesn’t have to talk with anyone else, so there are no communication costs. By default, his productivity rate is 100% of whatever he would normally do, so his output is also at 100%. If the one-man project is estimated to take 10 weeks, then it takes 10 weeks (a time factor of 100%).

Now, let’s say we add another developer. The first thought is that the project should now take half as long (5 weeks). But! Although we have now two developers, they have to spend time talking to each other and coordinating their activity. That means we have 2 “units” of development happening, but an additional “unit” of communication. Some of the combined effort of the two developers is spent talking to each other. Instead of getting 100% more productive output (two developers instead of one), they are together only about 33% more productive. That means the remaining project time won’t be cut to 50%; instead, it will be reduced to 75% at best. (And to boot, if we spent a week training the new developer, it means we need to add that week to the schedule as well.)

It only gets worse from there. 3 developers means 3 lines of communication (i.e., each one has to talk to the other two developers). That will cut the remaining project duration down to 67% of the previously scheduled time. Adding the third developer only gained us 8 percentage points of duration savings over having two developers. At this point we have tripled the cost of development, but only save 33% of remaining project time, not counting the time lost in training the new developers.

4 developers is 6 lines of communication, and a schedule compression of 63% (i.e., only 4 percentage points better than 3 developers). 5 developers is 10 lines of communication, for a schedule compression of 60%. No matter how many developers we add to the single-developer programming job, it looks like we will never cut the remaining project time in half.

I find this rather depressing.

Now, some very strong caveats:

  1. We presume the work being performed in the situation modeled above has to be done sequentially. If the work can be done concurrently or in parallel without additional planning, training, or communication, we can probably ignore the diminishing returns indicated by the table, since the developers don’t have to talk to each other.

  2. The amount of time spent in communication is assumed to be equal to the amount of time spent developing work product. We can game this a little bit by changing the “communication” column to some percentage of the number of communication links. This would show much better numbers, but my intuition tells me that it would not map to reality; it is my experience that just as much time is spent coordinating (and then regaining “flow”) as is spent developing work product.

  3. Each developer is presumed to be roughly as productive as each other developer. However, even the most productive developers will eventually be overwhelmed by the volume of communications, and if they spend time training new developers, their productivity drops to zero for that period.

  4. The developer(s) being added are presumed beforehand to be familiar with the project, its history, and its idiosyncrasies. If the developers are not already so familiar, it means the one or more of the developers already working on the project has to stop working and spend time bringing the new developer(s) up to speed. That makes the productivity reduction even more dramatic, and is a key factor in making the late project later.

Finally, some conclusions:

  • I think caveat #4 above is really important in relation to Brooks’ Law. It makes me think that the above table probably describes the best case scenario when adding developers; i.e., if nobody has to train the new developer, then you don’t lose the training time, but you still don’t get twice as much productivity from adding a second developer.

  • Caveat #4 also implies to me that, if we want to compress the development schedule, the time to add developers is at the beginning of a project, not later on, so that they all learn the project as they go. But even then, it won’t result in enough schedule compression to warrant adding more than one or two extra developers to a single-developer project.

I said earlier, and I’ll say again now: This is not science. It is the beginning of an exploration of how to reliably manage resources and make schedule predictions. If you have questions, concerns, insights, alternative analysis, or experiences you would like to share on this topic, please leave a comment below.

The Central Tension Of Programming

The central tension in the software process comes from the fact that we must go from an informally identified need that exists in-the-world to a formal model that operates in-the-computer.

From “Beyond Programming” by Bruce Blum, as quoted in “The Design of Design” by Frederick P. Brooks Jr.

The Perils of Error Reduction; or, Starbucks for Programmers

By taking advantage of an asynchronous approach Starbucks also has to deal with the same challenges that asynchrony inherently brings. Take for example, correlation. Drink orders are not necessarily completed in the order they were placed. This can happen for two reasons. First, multiple baristas may be processing orders using different equipment. Blended drinks may take longer than a drip coffee. Second, baristas may make multiple drinks in one batch to optimize processing time. As a result, Starbucks has a correlation problem. Drinks are delivered out of sequence and need to be matched up to the correct customer. Starbucks solves the problem with the same "pattern" we use in messaging architectures — they use a Correlation Identifier. In the US, most Starbucks use an explicit correlation identifier by writing your name on the cup and calling it out when the drink is complete. In other countries, you have to correlate by the type of drink.

via The Perils of Error Reduction – Business – The Atlantic.

Universal Constructor Sighting “In The Wild”

For those of you who don’t know, “universal constructor” is the name I give to PHP constructors that always and only take a single parameter. The parameter is an array of key-value pairs, which is then merged with a set of default keys and values. Finally, the array is unmarshalled, usually into object properties.

One benefit of the universal constructor is that it allows you to quickly and easily pass in configuration values from a config file (or other source) when building an object. You don’t have to remember the order of parameters, and you only need to specify the values that override the defaults.

I standardized on a universal constructor in the Solar framework for PHP. As far as I know, Solar was the first to standardize on this pattern and give it a name, and other PHP projects appear to be adopting the idea based on my advocacy. I saw a link today to a universal constructor “in the wild”, not the result of my direct advocacy, here: http://www.jqueryin.com/projects/mongo-session/.

It’s nice to see the idea is getting around.

Solar 1.1.1 Stable Released

On Thursday, I released version 1.1.0 of the Solar Framework for PHP. Due to a small but critical bug in the PostgreSQL adapter, I released version 1.1.1 with the necessary fix earlier today. Change notes are here for 1.1.0, and here for 1.1.1.

The single biggest new feature in this release of Solar is a Markdown plugin set for DocBook, along with a new make-docbook command to convertAPI documentation to DocBook files. Previously, the Solar API documentation was wiki-like; now, we take the Markdown-based comments in the codebase and convert them to DocBook, and render the DocBook files in to HTML using PhD. (Incidentally, I tried rendering with xsltproc; after three hours, the processing was less than one-third complete. With PhD, rendering takes under five minutes for the entire API documentation set.)

Also, the make-model command now recognizes a star at the end of the model name, indicating it should make one model class set for each table in the database. For example, this will make one model class from the table "foo_bar" …

./script/solar make-model Vendor_Model_FooBar

… but this will make a Vendor_Model class for each table in the database:

./script/solar make-model Vendor_Model_*

That kind of thing is helpful when getting started with an existing set of tables, or when you’re updating your models after schema changes.

Other highlights include a series of small fixes, better CLI output in non-TTY environments, improved automation of CSRF form elements.

Finally, we’ve added a new manual chapter on user authentication, roles, and access control. Find out how, with some config settings, you can instantiate a single object and let it automatically handle user login/logout, role discovery, and access permissions for you! And if you want more direct control over the process, browse on over to these blog entries from CoolGoose:

If you haven’t tried Solar yet, maybe now is the time: run through the Getting Started documentation and see how you like it!

(Cross-posted from the Solar blog.)