Sanitation with PHP filter_var()

By | January 17, 2007

I’m adding a combined validate-and-sanitize class to Solar, Solar_DataFilter. It uses some of the new filter extension functions internally.

However, I found a problem with the “float” sanitizing function in the 5.2.0 release, and thought others might want to be aware of it. In short, if you allow decimal places, the sanitizer allows any number of decimal points, not just one, and it returns an un-sanitary float.

I entered a bug on it, the text of which follows:

Description:
------------
When using FILTER_SANITIZE_NUMBER_FLOAT with FILTER_FLAG_ALLOW_FRACTION,
it seems to allow any number of decimal points, not just a single
decimal point.  This results in an invalid value being reported as
sanitized.

Reproduce code:
---------------
<?php
$val = 'abc ... 123.45 ,.../';
$san = filter_var($val, FILTER_SANITIZE_NUMBER_FLOAT,
    FILTER_FLAG_ALLOW_FRACTION);
var_dump($san);
?>

Expected result:
----------------
float 123.45

Actual result:
--------------
string(12) "...123.45..."

The bug has been marked as bogus, with various reasons and explanations that all make sense to the developers. “You misunderstand its use” and “it behaves the way we intended it to” seem to be the summary responses.

However, I would argue that intended behavior is at best naive and of only minimal value. If I’m sanitizing a value to be a float, I expect to get back a float, or a notification that the value cannot be sanitized to become a float … but maybe that’s just me.

Regardless, I’m not going to belabor the point any further; I’ll just avoid that particular sanitizing filter.

Update: Pierre responds with, essentially, “RTFM.” I agree that the manual describes exactly what FILTER_SANITIZE_NUMBER_FLOAT does. My point is that what it does is not very useful. I think it’s reasonable to expect that a value, once sanitized, should pass its related validation, and the situation described in the above bug report indicates that it will not. My opinion is that the filter should either (1) attempt to extract a float value, or (2) indicate in some way that the value cannot be reasonably sanitized (in the sense that the returned value is not “sane”). Since it does not, and since the developers seem unwilling to accept that approach, I’ll just avoid using that filter and write my own.

Update 2: Something just occurred to me. Pierre says in the comments that accepting “abc … 123.45 ,…/” to create a float is a bad idea. Yet the PHP float sanitizer will happily accept “123.abc,/45″ and return a float that will validate. Is *that* a good idea? If so, why?

39 thoughts on “Sanitation with PHP filter_var()

  1. Pingback: PHPDeveloper.org

  2. Ian Eure

    PHP’s developers seem to have an attitude problem. Every single bug I have ever reported on PHP has been marked bogus, even when it’s clearly not.

    Welcome to the club.

    Reply
  3. Ambush Commander

    I don’t know… their behavior seems reasonable to me, although it might confuse people who don’t recognize the subtle difference between sanitation and validation.

    Sanitate – to free from undesirable elements by cleaning or sterilizing
    Validate – to make valid

    You could argue the case that the name FILTER_SANITIZE_NUMBER_FLOAT is misleading because it incorrectly causes people to think of validating the float, when all it is doing is removing characters that are not valid in floats. After all, the behavior is not very useful (I can’t think of many times when I’d want to sanitize a float but not validate it), and this may confuse people.

    Reply
  4. pmjones Post author

    @Ambush Commander — it’s reasonable in that it “behaves as intended.” I’m arguing that the intended behavior is not that useful. Sanitizing to integer gets you a valid integer, sanitizing to string gets you a valid string … and I say that sanitizing to float should get you a valid float, or an error.

    Reply
  5. Pierre

    What do you not understand in my comment in this blog?

    The _sanitize_ filters do *not* validate the format but only clean a given string against a case (here what is allowed in a float value). If you want a float format validation and get a float out of it, you have to use FILTER_VALIDATE_FLOAT.

    This behavior is well documented. For example, FILTER_SANITIZE_NUMBER_FLOAT docs say: “Remove all characters except digits, +- and optionally .,eE.”.

    The time you needed to complain here would have been better used to read the documentation.

    Now about the bogus state, yes,the word “bogus” is a bit hard and can be misunderstood easily. But your issue is definitively not a bug. Reading the

    Ian, I have no idea about what makes you tell that but I always provide clear explanations before marking a bug as “Bogus” (meaning “not a bug”).

    Reply
  6. pmjones Post author

    @Pierre –

    I understand your response perfectly, I think. My point is that the intended behavior is not that useful, and so I am going to “route around” it with a different implementation.

    Additionally, the documentation is very literal about the behavior, but does not explain well the ramifications of the behavior as implemented. Feel free to use my example from the bug entry in the documentation, if you wish.

    Reply
  7. Pierre

    Route around what? You are using the wrong filter, period.

    What do you not undertand in this exact text “Remove all characters except digits, +- and optionally .,eE.”? The documentations (and the tutorials linked in the notes explain the difference very well.

    If you consider the documentation unclear, badly written or incomplete, feel free to provide patches to improve it.

    Reply
  8. pmjones Post author

    @Pierre:

    You said, “You are using the wrong filter, period.”

    You are correct: that filter does not return a proper float value, so it cannot be the right one. In fact, there is no filter that I have seen that will provide the behavior I expect.

    While “route around” is perhaps too harsh, that is what it feels like as a user of the function. As you note, the filter does exactly what the documentation says it does … but what it does is not very useful to me.

    Also, “If you consider the documentation unclear, badly written or incomplete, feel free to provide patches to improve it.” I have provided same in the bug report, feel free to copy and paste at will.

    Reply
  9. pmjones Post author

    @Ambush Commander — In the bug report text above, I would expect either the sanitizing function to attempt to extract a float from the given value, or return an error (or boolean false, or null) to indicate it cannot be sanitized to a float value.

    Reply
  10. Pierre

    @Ambush

    Paul expects ‘abc … 123.45 ,…/’ to return float(123.45). And I think one should not accept such input to create a float, really not.

    @Paul
    As I see, things never change. Good to complain but acts hardly follow. Time to move one, thanks for the fun :)

    Reply
  11. pmjones Post author

    @Pierre –

    You said, “I think one should not accept such input to create a float” and I agree. If one wants to *reject* that value, validate it and check the response to see if it should be used.

    However, I think it’s reasonable to expect that a sanitized value should pass its related validation.

    You’re right again: things never do change. I suppose that goes both ways.

    And in point of fact, I am working on a class method to do exactly what I need even now, so the act does indeed follow.

    Reply
  12. pmjones Post author

    @PIerre –

    Something just occurred to me. You said that accepting “abc … 123.45 ,…/” to create a float is a bad idea. And yet the PHP float sanitizer will happily accept “123.abc,/45″ and return a valid float. Is *that* a good idea? If so, why?

    Reply
  13. Pierre

    Paul,

    Sit down five mins, take a large breath and then __read__:

    It does not return a __valid__ float. It strips all characters not allowed in a float: “Remove all characters except digits, +- and optionally .,eE.”

    So please, take a step back and try to understand what you are using.

    Reply
  14. Chris D

    I see both sides, but frankly it makes sense that, in some cases, you can’t extract a single valid floating point number from FILTER_SANITIZE_NUMBER_FLOAT.

    Let’s take “abc … 123.45 ,…/” for example. There are numerous valid floating point numbers within that string:
    12345
    .12345
    1.2345
    etc

    How is the function supposed to know WHICH one to return?

    Reply
  15. pmjones Post author

    @Pierre — Sit down 2 mins, take a large breath and then __read__ what I am saying. I am saying that the behavior, as implemented, is mostly useless to me. Thus, I am avoiding it in the future, unless its behavior changes.. So please, take a step back and try to understand my point of view.

    Reply
  16. Ambush Commander

    Paul: The float sanitizer will accept any value, and will never return FALSE. It does what the description says, no more, no less. It will ALWAYS return a STRING. Whether or not PHP can then cast it into a valid float depends on its syntax: remember that PHP gobbles up as many valid characters as it can, then ignores the rest. Examples:

    print_r((float) ’6.7′); // 6.7
    print_r((float) ’6.7cm’); // 6.7
    print_r((float) ‘.01.’); // 0.01
    print_r((float) ‘foobar’); // 0

    Reply
  17. pmjones Post author

    @Chris D:

    I get what you’re saying. My point is that if it provide a valid (sane) float, then I can’t consider it sanitized in a way that is useful to me. At the very least, a value sanitized as a float should pass validation as a float.

    Re: which number to return, again, I get what you’re saying. It is a hard problem. My response would be as follows:

    Given that we are asking specifically for float sanitizing, not int sanitizing, so it’s safe to assume we should pay attention to decimal points if they exist. With that in mind, there appears to be only one reasonable result from that string (123.45).

    More complicated strings with more weirdness in them are of course more trouble, but a reasonable algorithm could at least return an indication that that the value could not be sanitized to a float that would pass validation.

    Int sanitizers return values that pass int validation, string sanitizers return strings that pass string validation, float sanitizers should return a value that passes float validation.

    Reply
  18. HR

    Pierre, I’m sure that everyone understand your argument, but repeating it doesn’t make it stronger.

    Unfortunately, it seems the term “sanitize” (or “sanitization”) is interpreted differently by a vast majority of users. In most of people’s minds, a “sanitized” value is all cleaned-up and ready to use. In short, they expect it to be valid. Now I know that’s the job of the “validate” filters, but unless a very clear distinction is made in the manual I’m afraid that people will keep confusing them, a lot of bug reports will be filed and security flaws will be created and I’m sure that’s the opposite of what you’re trying to do.

    WBR

    Reply
  19. Ambush Commander

    @Pierre: Paul is simply saying that he is not going to use the function. That’s fine. Paul is also saying that there could be a more useful (in his opinion anyway) variant. That’s also fine. Whether you think that functionality is useful enough (or even possible to implement) to merit inclusion into the filter library is another issue.

    @Chris: While for this particular example, your objections don’t apply (the float is delimited by spaces, so there’s no ambiguity), you have a good point. I imagine that, if Paul got his way, that the first valid float substring would take precedence, like any good regex is trained to do.

    Reply
  20. pmjones Post author

    @HR — You have captured exactly the spirit of what I am trying to say. :-) I am one of those who think of a sanitized value as “ready for use”, not “may-be-ready,-maybe-not”.

    Reply
  21. Pierre

    “I am one of those who think of a sanitized value as “ready for use”, not”

    Use the logical filters, they do what you want. Can you not understand that and be done??

    Reply
  22. Pierre

    @HR

    I do repeat it, I try to formulate it in various ways to show Paul which function to use. And as far as I can tell he is the only one to do get it here (maybe you too ;).

    There was many bugs report about filter (bugs or requests), only one or two were about this so hard to understand problem and it was solved in one reply. Only Paul seems to have really a problem to accept the difference and refusing to use the logical filters. go figure.

    If you like to provide a better explanation, please send it over. I will happy to commit it.

    Reply
  23. pmjones Post author

    @Pierre — are you saying that using FILTER_VALIDATE_FLOAT will extract a valid float value from the string I provided? I have not seen it do so as yet. None of the (repeated, identical) suggestions you have provided implement the behavior I am expecting as described above.

    Reply
  24. Luis

    Paul, I do believe that FILTER_VALIDATE_FLOAT behaves as you could expect. It will return a float if it can extract a valid float or else it will return false. The string you provided as an example cannot be validated as a float number, so it returns false.

    I do agree with you that the FILTER_SANITIZE_NUMBER_FLOAT is not very useful for common situations, but it works exactly as advertised. I don’t think there is any bug there.

    I also understand you might want the additional functionality of “forcing” a float out of an invalid and strange string, but that’s not so useful for common situations either.

    Reply
  25. pmjones Post author

    @Luis — you are correct, the FILTER_VALIDATE_* functions work exactly as I would expect (or close enough). If we attempt to validate, you are correct; it works just as you say.

    However, this entry is more about the FILTER_SANITIZE_* behavior, not the validation behaviors. It does work as advertised, but that advertised behavior is not very useful to me.

    You (and the developers) are technically correct: there is no bug, it works as intended. My point is that the behavior provided by that intention is mostly useless. If there was a bug-report code that said “maybe you should re-think your approach” I would report it as such. ;-)

    Having said all that, the developers are unlikely to change their approach, so I must avoid those particular functions and implement my own.

    Reply
  26. Luis

    But you must realize that the behavior you’re asking for is not very useful either (at least in common situations). Why would you want to turn the string “..abc.123.45..” into a valid float? And how will you do it when you find something like “23.43.32″?

    Those kind of strings can’t be reasonably turned into floats. All you can do with them is sanitize them (make sure they don’t have “dangerous” characters) or validate them (return a float or false). I don’t see anything wrong in this approach (except that the sanitize function is not very useful, in that I do agree with you. In the case of strings and integers it just happens that a sanitized value is also a valid one, but that’s not the result they were looking for with those sanitizing functions, it’s just an unavoidable side effect – that comes in handy, that’s true :)

    Reply
  27. pmjones Post author

    @Luis –

    Well, it’s useful for me … if I call “sanitize float” I want to know that what I’m getting back has been forced to be a sane float value, or that it will indicate it cannot be made into a sane float value.

    I can agree that the specific method of forcing that value is up for debate; reasonable people can see different algorithms for extracting a sane float.

    But my basic point about sanitizing values remains: either it should return a sane float value, or it should indicate that it cannot return a sane float value. In no case should it return a float value that *might* be sane, but cannot in fact validate as a float.

    In short, it is my opinion that if a value is passed through a “sanitize” function, the result should be able to pass its related “validate” function, *or* it should indicate that the value cannot be made sane.

    I hope my description is understandable, even if you disagree. :-)

    Reply
  28. pmjones Post author

    @Luis — To be clear, I generally validate instead of sanitize, so that the user has a chance to re-enter values. But when I *do* sanitize, I want to know for certain that it will come back as a sane value. Hope that makes sense.

    Reply
  29. Luis

    Paul, if you re-read carefully your answer you’ll realize that what you want to do is *validate*, though you call it *sanitize*. It’s just a naming problem.

    For me:
    Sanitize: remove unwanted characters and return the value without validating.
    Validate: remove unwanted characters, and check if value is valid. If so, return value, else return false.

    The FILTER_VALIDATE_FLOAT should do what you want to do, unless you just want to create a similar filter that will try harder to return a valid float before returning false (for cases as the one in your example). If that makes sense to you, fine. But personally I don’t see it so useful.

    Reply
  30. pmjones Post author

    @Luis –

    For me:

    Sanitize: force the value to be a sane version of what you’re asking for, manipulating it if necessary as necessary.

    Validate: do not manipulate the value at all, merely say whether or not it conforms to a particular format.

    Reply
  31. nick

    Paul said:

    Sanitizing to integer gets you a valid integer

    But Paul, that is not true. _Validating_ as an integer will get you a valid integer, or an error. _Sanitizing_ “to an integer”, as you put it, does not exist because sanitizing != validating.

    The semantic distinction between the sanitizing and validating filters is very important here, and I haven’t seen it explained clearly anywhere, not here nor in the manual, which I guess is unfortunate.

    Reply
  32. pmjones Post author

    @nick –

    I agree that it isn’t explained very clearly.

    Also agree that the semantics are very important; obviously the developers and I (and perhaps others) have very different ideas about what “sanitize” and “validate” should mean and do.

    To me, “validating” should not modify the value in any way, it should merely state whether or not the value is valid as it stands, and “sanitize” means that the value should be “made sane” and manipulated as necessary to make it sane.

    Reply
  33. rodrigo moraes

    I agree with Paul that the filter is not very useful. However, if I had ‘abc … 123.45 ,…/’ in the start of a sanitize process, and since the filter always returns a string, I would expect an empty string in the end, but never ’123.45′ or ‘…123.45…’. Just my 0.02. :-)

    Reply
  34. Luis

    Paul, sorry, I was wrong. For whatever reason I was sure that if you applied “FILTER_VALIDATE_FLOAT” to the string “abc12.34″ it would return “12.34″. Just tested it to find out it returns false.

    Now I understand that you want to write your own filter to do that.

    Sorry for the mistake and for wasting your time. I really thought I would “save” you time because I thought the validating filter would do what you expected.

    Reply
  35. Luis

    Thinking about it, you could just use *both* filters. First sanitize, then validate. It should work in normal situations (though admittedly, it won’t return a valid float out of “abc..23.34..dv” or “34.43.45″).

    Reply
  36. jonovic

    Good discussion I think. Here is my opinion to it:

    validation:
    if (format valid) return value
    else throw error

    sanitation:
    if (format valid) return value
    else return default (nonconflicting) value

    I agree with the fact that sanitation shouldn’t throw error at any case (it should be something like normalization of the input).

    So the result for me is that it is not definitelly bug, because it does it’s job. THE ONLY THING I WOULD CHANGE IS TO ENCAPSULATE THE TYPE CAST INTO FILTER. Because I expect the desired type on the output.

    INT -> (int)
    FLOAT -> (float)
    STRING -> (string)

    My opinion – add it to your calculation :).

    Reply
  37. nick

    Paul said:

    To me… “sanitize” means that the value should be “made sane” and manipulated as necessary to make it sane.

    From what I can see of the filter API, sanitize means “make safe”, not “make sane”. Huge difference.

    Now, “make safe” does not make a lot of sense to me in terms of floats, integers, email addresses, and the like. You can’t make an arbitrary value “safe” in terms of being a float; you can only say it is a float, or isn’t one. Same goes for an email address. And that process is called validation – a value is either a valid email address, or it is not.

    The sanitizing filters that appear to me to be actually useful are STRING, ENCODED, and MAGIC_QUOTES, because there is definitely a useful and important distinction between arbitrary strings and safe strings when speaking in terms of encoding and special characters.

    Reply
  38. dotvoid

    Well was’nt this a fun pseudo discussion. In my opinion your are not talking about sanitizing here. Rather enforcing an arbitrary value to become a float. This is not the same thing as sanitizing. Enforcing a specific format on an arbitrary value is in my opinion mostly not a good idea.

    Please let sanitize be sanitize and validate must (of course!) validate the value and not try to make sense of arbitrary input.

    So to be a bit constructive you could always suggest a FILTER_REQUIRE_FLOAT or something… (Again – not that I think it is a good idea…)

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *