Hacking, Refactoring, Rewriting, and Technical Debt

(Not so much “lessons” here, as “observations and recollections.”)

I.

From ESR, How To Learn Hacking:

  • Hacking is done on open source. Today, hacking skills are the individual micro-level of what is called “open source development” at the social macro-level. [2] A programmer working in the hacking style expects and readily uses peer review of source code by others to supplement and amplify his or her individual ability.
  • Hacking is lightweight and exploratory. Rigid procedures and elaborate a-priori specifications have no place in hacking; instead, the tendency is try-it-and-find-out with a rapid release tempo.
  • Hacking places a high value on modularity and reuse. In the hacking style, you try hard never to write a piece of code that can only be used once. You bias towards making general tools or libraries that can be specialized into what you want by freezing some arguments/variables or supplying a context.
  • Hacking favors scrap-and-rebuild over patch-and-extend. An essential part of hacking is ruthlessly throwing away code that has become overcomplicated or crufty, no matter how much time you have invested in it.

This strikes me as the opposite of, or at best orthogonal to, the practice of refactoring a large and long-existing codebase, especially that last. “Scrap and rebuild” is essentially “rewrite from scratch” and that is almost always a losing proposition in terms of time and budget.

But then, the description also says hacking is “lightweight, exploratory, modular” — so perhaps it is possible to hack at components within a larger or critical system, and then refactor the system using that result. Cf. Hacking and Refactoring, also from ESR.

II.

I used to work for a consulting company with One Big Client (and a few smaller ones). This One Big Client generated a staggering amount of money; its business model was such that even a few minutes of downtime would result in large losses (technically “lost revenue” but the effect is the same).

The consultancy’s director of development for the One Big Company didn’t “believe” in technical debt. Or rather, he thought technical debt was a non-issue. He would write, or demand writing, some of the most horrific code, as long as it was written *quickly*, because the time constraints for the One Big Client were so overwhelming. There was no refactoring; he and his team would iterate fast, then wipe out the entire codebase and start all over again on a regular basis. (By “regular” I mean every 3-6 months). Any technical debt that was incurred was tiny in comparison to the monetary returns — although it did take a toll on the programmers themselves.

Given ESR above, that’s clearly a “hacking” approach to a critical codebase. Also, it’s clear that *sometimes* a full-on rewrite might be the right thing to do, provided the right context. But I’m betting the vast majority of developers (I’m talking “five nines” here) are not operating in the context of that One Big Client and that particular consultancy. These were “skilled developers knowingly writing low-quality code” — not to be confused with “*un*skilled developers *un*knowingly writing low-quality code.” (Cf. Mehdi Khalili: “On Bad Code.”)

How Terrible Code Gets Written By Perfectly Sane People

From Christian Maioli Mackeprang, the following:

Legacy code can be nasty, but I’ve been programming for 15 years and only a couple of times had I seen something like this. The authors had created their own framework, and it was a perfect storm of anti-patterns. …

And yet, it was not the code’s dismal quality that piqued my interest and led me to write this article. What I discovered after some months working there, was that the authors were actually an experienced group of senior developers with good technical skills. What could lead a team of competent developers to produce and actually deliver something like this?

I don’t necessarily buy all of these, but they’re worth thinking about:

  • Giving excessive importance to estimates: “Your developers might choose to over-promise instead, and then skip important work such as thinking about architectural problems, or how to automate tasks, in order to meet an unrealistic deadline.”

  • Giving no importance to project knowledge: “If you’re in a large project and there are modules for which you have no expert, no-one to ask about, that’s a big red flag.”

  • Focusing on poor metrics such as “issues closed” or “commits per day”: “When a measure becomes a target, it ceases to be a good measure. (Goodhart’s law)”

  • Assuming that good process fixes bad people: “[Fix your hiring.] Talent makes up for any other inefficiency in your team.”

  • Ignoring proven practices such as code reviews and unit testing: “Refraining from using tools such as modern IDEs, version control and code inspection will set you back tremendously.”

  • Hiring developers with no “people” skills: “It doesn’t matter that the other guy might be ten thousand miles away. One Skype call can turn a long coding marathon into a five-minute fix.”

Conclusion? “What you can’t assume is that just because you’ve signed up to apply Agile or some other tool, that nothing else matters and things will sort themselves out.”

Efficient use of mysqli_result::$num_rows

I frequently see this pattern in legacy applications using mysqli:

$result = $mysqli->query("
    SELECT *
    FROM table_name
    WHERE conditions = 'whatever'
");
if ($result && $result->num_rows > 0) {
    return;
}

The developer’s intent here is to see if there are any rows at all in the database that match a certain condition. He does so by issuing a query, then asking the result object how many rows it has. The developer doesn’t actually want any data from the result, and doesn’t care about the actual row-count itself; this is just a check to see if at least one row exists in the database.

This is a poor conservation of resources. The database does the work needed to select all the columns for all the rows matching the conditions, allocates memory for them, and returns them. But the developer discards all that immediately.

To accomplish the same end, it is less resource-intensive and just as effective to query for a single column and limit the results to a single row:

$result = $mysqli->query("
    SELECT col_name
    FROM table_name
    WHERE conditions = 'whatever'
    LIMIT 1
");
if ($result && $result->num_rows > 0) {
    return;
}

Now the database only does the work needed for a single column and a single row.

(As a side note, I find it interesting that I have not seen this pattern at all in projects using PDO. I’m not sure why this would be. Perhaps there is some originating example code for mysqli somewhere that has gained a life of its own through copying and reuse.)

UPDATE: Perhaps a better way to conserve resources, courtesy of Reddit user marcjschmidt, is to use a COUNT() in the query, then fetch the count of rows, something more like this …

$result = $mysqli->query("
    SELECT COUNT(*)
    FROM table_name
    WHERE conditions = 'whatever'
");
if ($result && $result->fetch_array()[0] > 0) {
    return;
}

… thereby avoiding the use of mysqli_result::$num_rows completely.

UPDATE 2: Lots of commentary in the above Reddit thread. To summarize this blog post: selecting all columns of all rows, then examining $num_rows, and then discarding the result set, is a terrible way of determining if there are any matching rows at all in the database. It is trivially easy to something much better, whether by using a LIMIT 1 and $num_rows, or some form of COUNT(), or perhaps some other approach.

Exporting Globals in PHP

I am currently modernizing a legacy PHP application for a client. (The codebase was written earlier this year, in fact; new code can be “legacy” from the outset.) The original developer pulled a dirty trick with global that I had not seen before, and I thought I had seen everything.

Legacy codebases often use global to import a variable into the local scope, usually a global function. For example, they might drag in a database connection:

<?php
// define $db in a config file somewhere
$db = new DatabaseConnection(...);

// this function uses the $db connection via global
function fetch_user_by_id($id)
{
    global $db;
    return $db->fetchAssoc("SELECT * FROM users WHERE id = ?", $id);
}
?>

I see that kind of thing all the time in legacy PHP. However, what I have not seen before is a function exporting a global.

Take a look at the following code. If the $bar variable is not already defined in the global scope, PHP will define it in the global scope for you automatically when you call foo().

<?php
error_reporting(E_ALL);

function foo()
{
    global $bar;
    $bar ++;
}

// $bar is not defined yet, so PHP will show an
// "undefined variable" notice
echo $bar. PHP_EOL;

// calling foo() defines $bar in the global scope,
// and increments it
foo();

// $bar is now available in the global scope, having
// been exported from function foo()
echo $bar. PHP_EOL;

The legacy developer did that because he wanted to keep the variable initialization outside of the global scope for some reason, even though he used the variable in the global scope elsewhere.

It is exceptionally difficult to track down where an exported global is coming from when refactoring a legacy application. If you must write legacy code using globals, initialize them in the global scope. Better yet, don’t use globals at all: pass values as function arguments, or use dependency injection techniques.

Why Do PHP Developers Think MVC Is An Application Architecture?

I’ve pointed out before that Model-View-Controller is a user interface pattern, not an application architecture. But why would PHP developers get the idea that MVC is an application architecture in the first place? (This may apply to all server-side developers, not just PHP folks.)

I used to think that MVC was an application architecture. Even after reading Fowler’s POEAA and seeing that MVC was for the user interface, I figured that meant I was doing “user interface applications.” But that was not quite right; it would have been more accurate to say that I had been mixing the concerns of user interface with the underlying core application.

Based on that experience, I think the reason MVC is mistaken for an application architecture is because PHP developers generally start with page scripts. In our first page scripts, we combine all the concerns in a single ball of mud: SQL queries are intermingled with HTML, and the business logic is scattered throughout. As far as we are concerned, the page script itself is “the application.” Later, as more functionality is required, we add another page script, and another, and another. We continue to see that collection of page scripts as “the application.”

Then, as we progress in our profession, and begin to learn how to better organize our work, we start separating the different concerns of “the application.” When we begin separating concerns, we separate them out of the page script, which we see as “the application.” Extracting the view concerns from the page script means extracting them from “the application.” Separating the model from the page script into its own layer means separating it from “the application.” Pulling the controller logic out of the page script means pulling it out of “the application.” So of course Model-View-Controller is seen as an application architecture – we separated the concerns of our application according to that pattern, right?

Except, in retrospect, it’s not. One of the big leaps we have to make is to realize that MVC is for the user interface portion of our systems, just like Fowler notes. We on the server side think the user interface is HTML, CSS, and Javascript, but it’s not. Instead, the user interface is the HTTP Request and Response. In other words, the template is not the View.

Once we make that conceptual leap, we begin to realize that the Model layer is the entry point to “the application”. That is where “the application” should live, or at least where it should look like it lives. The Controller should have little to do but take input from the Request and pass it to the Model; likewise, the View should do nothing but put together the Response using the output from the Model.

With that idea of server-side MVC, we then begin to see that a lot of what’s in server-side MVC frameworks is “too much.” Framework functionality that is not related to merely taking input from a Request and presenting output through a Response becomes entirely secondary to, perhaps even actively harmful to, building well-structured applications – applications that are independent of any particular user interface.

Afterword

The server-side MVC pattern, as descended through Sun’s Model 2 architecture, is so distinctive from the client-side MVC pattern as to be a completely different thing. I realized this as I was writing my book on modernizing legacy applications, and further research led me to write up Action-Domain-Responder as a web-specific alternative to MVC. Action-Domain-Responder places a much stronger emphasis on the HTTP Request and Response as the user interface elements. If you are interested in building better applications, you may wish to read the ADR essay, and try it out in your next project.

Configuration Values Are Dependencies, Too

As part of my consulting work, I get the opportunity to review lots of different codebases of varying modernity. One thing I’ve noticed with some otherwise-modern codebases is that they often “reach out” from inside a class to retrieve configuration values, instead of injecting those values into the class from the outside. That is, they use an equivalent of globals or service-location to read configuration, instead of using dependency injection.

Here is one generic example:

<?php
class Db
{
    // backend type, hostname, username, password, and database name
    protected $type, $host, $user, $pass, $name;

    public function __construct()
    {
        $this->type = getenv('DB_TYPE');
        $this->host = getenv('DB_HOST');
        $this->user = getenv('DB_USER');
        $this->pass = getenv('DB_PASS');
        $this->name = getenv('DB_NAME');
    }

    public function newConnection()
    {
        return new PDO(
            "{$this->type}:host={$this->host};dbname={$this->name}",
            $this->user,
            $this->pass
        );
    }
}
?>

Granted, the example follows the modern practice of keeping sensitive information as environment variables. Similar examples use $_ENV or $_SERVER keys instead of getenv(). The effect, though, is global-ish or service-locator-ish in nature: the class is reaching outside its own scope to retrieve values it needs for its own operation. Likewise, one cannot tell from the outside the class what configuration values it depends on.

Is the following any better?

<?php
class Db
{
    public function __construct()
    {
        $this->type = Config::get('db.type');
        $this->host = Config::get('db.host');
        $this->user = Config::get('db.user');
        $this->pass = Config::get('db.pass');
        $this->name = Config::get('db.name');
    }
}
?>

As far as I can tell, that’s a variation on the same theme. The generic Config object acts as a global singleton to carry configuration for every possible need; it is acting as a static service locator. While service location is inversion-of-control, it is in many ways inferior to dependency injection. As before, the class is reaching outside its own scope to retrieve values it depends on.

What if we inject the generic Config object like this?

<?php
class Db
{
    public function __construct(Config $config)
    {
        $this->type = $config->get('db.type');
        $this->host = $config->get('db.host');
        $this->user = $config->get('db.user');
        $this->pass = $config->get('db.pass');
        $this->name = $config->get('db.name');
    }
}
?>

This is a little better; at least now we can tell that the Db class needs configuration of some sort, though we still cannot tell exactly which values it needs. This is the same as injecting a service locator.

Having seen all these examples, and other similar ones, in real codebases, I conclude that configuration values should be treated as any other dependency, and injected via the constructor. I suggest this approach:

<?php
class Db
{
    public function __construct($type, $host, $user, $pass, $name)
    {
        $this->type = $type;
        $this->host = $host;
        $this->user = $user;
        $this->pass = $pass;
        $this->name = $name;
    }
}
?>

Simple, clear, obvious, and easy to test. If you use a dependency injection container of some sort, it should be trivial to have it read environment variables and pass them to the Db class at construction time. (If your DI container does not support that kind of thing, you may wish to consider using a more powerful container system.)

Alternatively, I think the following may be reasonable in some cases:

<?php
class DbConfig
{
    // backend type, hostname, username, password, and database name
    protected $type, $host, $user, $pass, $name;

    public function __construct($type, $host, $user, $pass, $name)
    {
        $this->type = $type;
        $this->host = $host;
        $this->user = $user;
        $this->pass = $pass;
        $this->name = $name;
    }

    public function getDsn()
    {
        return "{$this->type}:host={$this->host};dbname={$this->name}";
    }

    public function getUser()
    {
        return $this->user;
    }

    public function getPass()
    {
        return $this->pass;
    }
}

class Db
{
    protected $dbConfig;

    public function __construct(DbConfig $dbConfig)
    {
        $this->dbConfig = $dbConfig;
    }

    public function newConnection()
    {
        return new PDO(
            $this->dbConfig->getDsn(),
            $this->dbConfig->getUser(),
            $this->dbConfig->getPass()
        );
    }
}
?>

In that example, the DbConfig manages a set of injected configuration values so that the Db object treats its own configuration as a separate concern. However, that approach is just a little too indirect and open-to-abuse for my taste most of the time. The temptation is to start putting more and more inside the DbConfig object, and you end up with a mini-service-locator.

To sum up: Configuration values are dependencies; therefore, inject configuration values the way you would any other dependency.

UPDATE: Stephan Hochdörfer notes on Twitter: “I would probably re-phrase a bit: Configuration values should be treated like deps. Not sure if u can say that they are deps ;).” The point is well-taken, though it may be a distinction without a difference. If the class cannot operate properly without a particular value, whether that value is a scalar or an object, I think it’s fair to say the class is dependent on that value.

Modernizing Serialized PHP Objects with class_alias()

Several weeks ago, a correspondent presented a legacy situation that I’ve never had to deal with. He was working his way through Modernizing Legacy Applications in PHP, and realized the codebase was storing serialized PHP objects in a database. He couldn’t refactor the class names without seriously breaking the application.

I was carefully moving my classes to a PSR-0/4 structure when I found this. The application saves the output of serialize($shoppingCart) in a BLOB column, and unserializes it later to recreate the ShoppingCart object. The serialized object string looks like this:

O:12:"ShoppingCart":17:{s:13:"\00Basket\00items";a:25:{...}; ...}

See how the class name is embedded in the serialized string? If I rename the ShopppingCart class to PSR-0/4 namespace, then the old class won’t be found when the application tries to unserialize() the serialized representation. How can I begin refactoring this without breaking the whole application?

Before I was able to reply, my correspondent ended up changing the serialization strategy to use JSON, which was a rather large change. It ended up well, but it turns out there is a less intrusive solution: class_alias().

If you’re in a serialized situation, and you need to change a class name, you can use a class_alias() to point the old class name to the new one. (Call class_alias() early in execution, probably before you register your autoloader.) You can then rename and move the class according to PSR-0/4 rules, and PHP will handle the rest for you.

For example, if you renamed ShoppingCart to Domain\Orders\Cart, you might do this:

class_alias('Domain\Orders\Cart', 'ShoppingCart');

Now when you call unserialize($shoppingCart) to create an object, PHP will create it as an instance of Domain\Orders\Cart instead of ShoppingCart. Re-serialized representations of the recreated object will retain the new class name, not the old one: O:18:"Domain\Orders\Cart":....

As soon as there are no more serialized representations using the old class name, you can remove the class_alias() call entirely.

MLAPHP Boot Camp!

You may have seen my introductory talk on steps toward modernizing a legacy application at Nashville PHP or at php[world]. That was the presentation that formed the first few chapters of Modernizing Legacy Applications in PHP.

The MLAPHP book has been a great help to a lot of people. But there are a lot of other developers who learn best by watching-and-listening first, instead of reading. They know it can be a great help to have someone talk them through the whole moderizing process, and see the steps being performed.

Usually you have to go to a conference for that sort of thing. You have to pay for the ticket, and for the hotel, and for travel, and meals, and everything else. Even if your employer helps with the money, there’s still the time involved: you have to be away for several days at a time.

Well, now you can have a conference-level training experience in the comfort of your own home, without the travel hassle and hotel expense. I have put together an online weekend “boot camp” where we go through the entire modernization process, from basic prerequisities at the beginning to setting up a DI container at the end. After that end-to-end study of the steps involved, you should be able to jump-start your own modernizing project much more easily.

The 8-hour camp will combine lecture and “live” coding with examples. In particular, there will be plenty of opportunity for you to get answers specific to your own modernizing situation, and to chat with other attendees.

The two-day camp will be on Sat 11 July and Sun 12 July. Each day will run from 12 noon to 4pm (US Central Daylight), with time for breaks built in. That makes it 10am-2pm for Pacific time zone attendees, and 1-5pm for Eastern, which should be convenient for almost everyone.

With your ticket to attend the boot camp, you get:

  • Two days of online boot camp, four hours each
  • A review of the entire modernization process from beginning to end
  • A bonus recording of the camp
  • For $399

This might be the only time I lead this kind of boot camp, so get your ticket while there are still seats available!

Modernizing Legacy PHP: From Service Locator To Dependency Injection

In an earlier article I described how to start moving away from singletons in favor of dependency injection. It occurs to me that the process for moving away from Service Locator is almost exactly the same, except that we use the container outside the class instead of inside it.

Let’s say we have a class that uses a Service Locator. First we examine the class for all uses of the locator. Then, we create constructor parameters for the dependencies it extracts from the locator, and add setter code for those dependencies in the constructor body. For example, we can convert the above Service Locator example classes to these dependency-injected variations:

<?php
class FooClass
{
    protected $db;
    public function __construct(Database $db)
    {
        $this->db = $db;
    }
}

class BarClass
{
    protected $db;
    public function __construct(Database $db)
    {
        $this->db = $db;
    }
}
?>

Finally, any time we instantiate one of these dependency-injected classes, we use the locator outside the class to retrieve the dependencies. We then pass them to the new call for the class. For example:

<?php
// for FooClass
$db = $container->get('db');
$foo = new FooClass($db);

// for BarClass
$db = StaticContainer::get('db');
$bar = new BarClass($db);
?>

Now the class dependencies are explicit and predictable, instead of implicit and unpredictable (i.e., the class might depend on any combination of dependencies hidden inside the container). It is also somewhat easier to build a test, since we only have to build the dependencies themselves, not the container that holds the dependencies.

Afterword

Are you overwhelmed by a legacy PHP application? Have you inherited a spaghetti mess of code? Does it use globals everywhere, so that a fix in one place causes a bug somewhere else? Does every feature addition feel like slogging through a swamp of includes?

It doesn’t have to be that way. “Modernizing Legacy Applications in PHP” gives you step-by-step instructions on how to get your legacy code under control by eliminating globals and separating concerns. Each chapter shows you exactly one task and how to accomplish it, along with common questions related to that task.

When you are done, you will come and go through your code like the wind. Your application will have become autoloaded, dependency injected, unit tested, layer separated, and front controlled. And you will have kept it running the whole time.

Buy the book today!

Legacy Refactor Question: What About Singletons?

In my talk and my book on refactoring legacy PHP code, one of the early steps is to start removing globals and replace them with dependency injection (not a container, just injecting dependencies by hand.) I addressed the use of the global keyword and the $_GET (et al) superglobals, but I did not address singletons directly. I had a Reddit correspondent ask about singletons recently, and how to refactor away from them; I answered on Reddit, where you can read the full conversation.