Discussion:
[Tutor] value range checker
Albert-Jan Roskam
2015-08-26 13:19:16 UTC
Permalink
Hello,


I have a written a function checks the validity of values. The ranges of valid values are stored in a database table.

Such a table contains three columns: category, min and max. One record of such a table specifies the range for

a certain category, but a category may be spread out over multiple records.


For example, the category-min-max tuples

("cat a", 1, 1),

("cat a", 3, 3),

("cat a", 6, 10),

correspond to a range of category A of 1-1, 3-3, 6-10, which is the same as 1, and 3, and 6, 7, 8, 9, 10.


The code below does exactly what I want:


import collections
import bisect
import math
import pprint


def get_valid_value_lookup(records):
"""
Translates value range information (from a database table)
into a dictionary of the form {<category>: [<range of accepted values>]}
"""
Boundaries = collections.namedtuple("Boundaries", "category min max")
records = [Boundaries(*record) for record in records]
boundaries = collections.defaultdict(list)
crap = [boundaries[record.category].__iadd__(range(record.min, record.max + 1))
for record in records]
return dict(boundaries)


def is_valid(lookup, category, value):
"""Return True if value is member of a list of a given category, False otherwise."""
try:
return value in lookup[category]
except KeyError:
raise KeyError("Invalid category: %r" % category)


def is_valid2(lookup, category, value):
"""Return True if value is member of a list of a given category, False otherwise."""
# this version also knows how to deal with floats.

try:

L = lookup[category]
except KeyError:
raise KeyError("Invalid category: %r" % category)

adjusted_value = value if int(value) in (L[0], 0, L[-1]) else math.ceil(value)
try:
chopfunc = bisect.bisect_right if value < L[0] else bisect.bisect_left
return L[chopfunc(L, value)] == adjusted_value
except IndexError:
return False


if __name__ == '__main__':
L = [("cat a", 1, 1), ("cat a", 3, 3), ("cat a", 6, 10),
("cat b", 1, 9), ("cat c", 1, 2), ("cat c", 5, 9)]
lookup = get_valid_value_lookup(L)
assert not is_valid(lookup, "cat a", 999) # value 999 is invalid for category 1
assert is_valid(lookup, "cat a", 10)
assert not is_valid2(lookup, "cat a", 0.1)
assert not is_valid2(lookup, "cat a", -1)
assert is_valid2(lookup, "cat a", 6.1)


L2 = [(1, -5, 1), (1, 3, 3), (1, 6, 10), (2, 1, 9), (3, 1, 2), (3, 5, 9)]
lookup = get_valid_value_lookup(L2)
assert is_valid2(lookup, 1, -4.99)
assert is_valid2(lookup, 1, -5)


My questions:

[1] @ is_valid: is there a better way to do this? I mostly don't like the use of the __iadd__ dunder method.

[2] @ is_valid2: Perhaps an answer to my previous question. Is this a better approach?

[3] I am inheriting a this system. It feels a bit strange that these range check values are stored in a database.

Would yaml be a better choice? Some of the tables are close to 200 records.


Thank you in advance!


Albert-Jan
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Alan Gauld
2015-08-26 16:29:08 UTC
Permalink
On 26/08/15 14:19, Albert-Jan Roskam wrote:

> I have a written a function checks the validity of values.
> The ranges of valid values are stored in a database table.

That's an unusual choice because:

1) using a database normally only makes sense in the case
where you are already using the database to store the
other data. But in that case you would normally get
validation done using a database constraint.

2) For small amounts of data the database introduces
a significant overhead. Databases are good for handling
large amounts of data.

3) A database is rather inflexible since you need to
initialise it, create it, etc. Which limits the number
of environments where it can be used.

> Such a table contains three columns: category, min and max. ...
> a category may be spread out over multiple records.

And searching multiple rows is even less efficient.

> Would yaml be a better choice? Some of the tables are close to 200 records.

Mostly I wouldn't use a data format per-se (except for
persistence between sessions). I'd load the limits into
a Python set and let the validation be a simple member-of check.

Unless you are dealing with large ranges rather than sets
of small ranges. Even with complex options I'd still
opt for a two tier data structure. But mostly I'd query
any design that requires a lot of standalone data validation.
(Unless its function is to be a bulk data loader or similar.)
I'd probably be looking to having the data stored as
objects that did their own validation at creation/modification
time.

If I was doing a bulk data loader/checker I'd probably create
a validation function for each category and add it to a
dictionary. So I'd write a make_validator() function that
took the validation data and created a specific validator
function for that category. Very simple example:

def make_validator(min, max, *values):
def validate(value):
return (min <= value <= max) or value in *values)
return validator
...
for category in categories:
lookup[category] = make_validator(min,max, valueList)
...
if lookup[category](my_value):
# process valid value
else:
raise ValueError

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Albert-Jan Roskam
2015-08-28 10:03:21 UTC
Permalink
----------------------------------------
> To: ***@python.org
> From: ***@btinternet.com
> Date: Wed, 26 Aug 2015 17:29:08 +0100
> Subject: Re: [Tutor] value range checker
>
> On 26/08/15 14:19, Albert-Jan Roskam wrote:
>
>> I have a written a function checks the validity of values.
>> The ranges of valid values are stored in a database table.
>
> That's an unusual choice because:
>
> 1) using a database normally only makes sense in the case
> where you are already using the database to store the
> other data. But in that case you would normally get
> validation done using a database constraint.


The other data are indeed also stored in the database. But CHECK constraints seem an attractive mechanism

to get validation done. It should be possible to define a CHECK contstraint with CASE statements in the CREATE TABLE definition.

This page even describes how to do a modulus-11 check, which I also need:

https://www.simple-talk.com/sql/learn-sql-server/check-your-digits/

Will the Python exceptions be clear enough if a record is rejected because one or more contraints are not met? I mean, if I only get a ValueError or

a TypeError because *one* of the 100 or so columns is invalid, this would be annoying.


> 2) For small amounts of data the database introduces
> a significant overhead. Databases are good for handling
> large amounts of data.
>
> 3) A database is rather inflexible since you need to
> initialise it, create it, etc. Which limits the number
> of environments where it can be used.
>
>> Such a table contains three columns: category, min and max. ...
>> a category may be spread out over multiple records.
>
> And searching multiple rows is even less efficient.
>
>> Would yaml be a better choice? Some of the tables are close to 200 records.
>
> Mostly I wouldn't use a data format per-se (except for
> persistence between sessions). I'd load the limits into
> a Python set and let the validation be a simple member-of check.
>
> Unless you are dealing with large ranges rather than sets
> of small ranges. Even with complex options I'd still
> opt for a two tier data structure. But mostly I'd query
> any design that requires a lot of standalone data validation.
> (Unless its function is to be a bulk data loader or similar.)
> I'd probably be looking to having the data stored as
> objects that did their own validation at creation/modification
> time.


The data are collected electronically, but also by paper-and-pencil. With a web page you can check all kinds of things right at the beginning. But that's not true for paper-and-pencil data collection.

Bottom line is that *only* electronic data collection would make things easier.



> If I was doing a bulk data loader/checker I'd probably create
> a validation function for each category and add it to a
> dictionary. So I'd write a make_validator() function that
> took the validation data and created a specific validator
> function for that category. Very simple example:
>
> def make_validator(min, max, *values):
> def validate(value):
> return (min <= value <= max) or value in *values)
> return validator


This looks simple and therefore attractive! Thank you!


> for category in categories:
> lookup[category] = make_validator(min,max, valueList)
> ...
> if lookup[category](my_value):
> # process valid value
> else:
> raise ValueError
>
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.amazon.com/author/alan_gauld
> Follow my photo-blog on Flickr at:
> http://www.flickr.com/photos/alangauldphotos
>
>
> _______________________________________________
> Tutor maillist - ***@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Alan Gauld
2015-08-28 10:28:15 UTC
Permalink
On 26/08/15 17:29, Alan Gauld wrote:

> def make_validator(min, max, *values):
> def validate(value):
> return (min <= value <= max) or value in *values)
> return validator

Oops!
Should of course be

return validate

Sorry,

--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Peter Otten
2015-08-27 07:48:46 UTC
Permalink
Albert-Jan Roskam wrote:

> I have a written a function checks the validity of values. The ranges of
> valid values are stored in a database table.
>
> Such a table contains three columns: category, min and max. One record of
> such a table specifies the range for
>
> a certain category, but a category may be spread out over multiple
> records.

> My questions:

> def get_valid_value_lookup(records):
> """
> Translates value range information (from a database table)
> into a dictionary of the form {<category>: [<range of accepted
> values>]} """
> Boundaries = collections.namedtuple("Boundaries", "category min max")
> records = [Boundaries(*record) for record in records]
> boundaries = collections.defaultdict(list)
> crap = [boundaries[record.category].__iadd__(range(record.min,
> record.max + 1))
> for record in records]
> return dict(boundaries)


> [1] @ is_valid: is there a better way to do this? I mostly don't like the
> [use of the __iadd__ dunder method.

When you find yourself using a list comprehension for its side effect it is
high time to switch to a for loop. So

def get_valid_value_lookup(records):
boundaries = {}
for category, minval, maxval in records:
boundaries.setdefault(category, []).extend(range(minval, maxval+1))
return boundaries

> [2] @ is_valid2: Perhaps an answer to my previous question. Is this a
> [better approach?

As long as the size of the ranges is manageable and you are not actually
seeing float values I'd stick with the dead simple first version.
Replace the list in get_valid_value_lookup() with a set

boundaries.setdefault(category, set()).update(range(minval, maxval+1))

and you get O(1) lookup for free.

> def is_valid2(lookup, category, value):
> """Return True if value is member of a list of a given category, False
> otherwise."""
> # this version also knows how to deal with floats.
>
> try:
>
> L = lookup[category]
> except KeyError:
> raise KeyError("Invalid category: %r" % category)
>
> adjusted_value = value if int(value) in (L[0], 0, L[-1]) else
> math.ceil(value)
> try:
> chopfunc = bisect.bisect_right if value < L[0] else
> bisect.bisect_left return L[chopfunc(L, value)] == adjusted_value
> except IndexError:
> return False

I don't understand what this is supposed to do:

(1) for bisect to work correctly the list has to be sorted. Your
get_valid_value_lookup() doesn't do that

(2) Why/how are L[0], 0, and L[-1] special?

(3) Given a sorted list L there should be no need to bisect_whatever for
value < L[0]


> [3] I am inheriting a this system. It feels a bit strange that these range
> [check values are stored in a database.
>
> Would yaml be a better choice? Some of the tables are close to 200
> records.

Unless you're encountering an actual inconvenience just keep it like it is.
Yea, I'm a lazy bastard ;)

_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Albert-Jan Roskam
2015-08-28 15:53:00 UTC
Permalink
(sorry, Peter, Alan, I sent two mails to you privately. I just switched from Yahoo to Hotmail for DMARC reasons. Still getting used to Hotmail.)



> Subject: RE: [Tutor] value range checker
> Date: Fri, 28 Aug 2015 10:14:49 +0000
>
> ----------------------------------------
>> To: ***@python.org
>> From: ***@web.de
>> Date: Thu, 27 Aug 2015 09:48:46 +0200
>> Subject: Re: [Tutor] value range checker
>>
>> Albert-Jan Roskam wrote:
>>
>>> I have a written a function checks the validity of values. The ranges of
>>> valid values are stored in a database table.
>>>
>>> Such a table contains three columns: category, min and max. One record of
>>> such a table specifies the range for
>>>
>>> a certain category, but a category may be spread out over multiple
>>> records.
>>
>>> My questions:
>>
>>> def get_valid_value_lookup(records):
>>> """
>>> Translates value range information (from a database table)
>>> into a dictionary of the form {<category>: [<range of accepted
>>> values>]} """
>>> Boundaries = collections.namedtuple("Boundaries", "category min max")
>>> records = [Boundaries(*record) for record in records]
>>> boundaries = collections.defaultdict(list)
>>> crap = [boundaries[record.category].__iadd__(range(record.min,
>>> record.max + 1))
>>> for record in records]
>>> return dict(boundaries)
>>
>>
>>> [1] @ is_valid: is there a better way to do this? I mostly don't like the
>>> [use of the __iadd__ dunder method.
>>
>> When you find yourself using a list comprehension for its side effect it is
>> high time to switch to a for loop. So
>>
>> def get_valid_value_lookup(records):
>> boundaries = {}
>> for category, minval, maxval in records:
>> boundaries.setdefault(category, []).extend(range(minval, maxval+1))
>> return boundaries
>
>
> ah, this indeed looks better. Thank you!
>
>
>>> [2] @ is_valid2: Perhaps an answer to my previous question. Is this a
>>> [better approach?
>>
>> As long as the size of the ranges is manageable and you are not actually
>> seeing float values I'd stick with the dead simple first version.
>
>
> The acronym "YAGNI" came to mind when I wrote that function. :-) But I was just curious how it could be generalized an this was a first attempt.
>
>
>> Replace the list in get_valid_value_lookup() with a set
>>
>> boundaries.setdefault(category, set()).update(range(minval, maxval+1))
>>
>> and you get O(1) lookup for free.
>>
>>> def is_valid2(lookup, category, value):
>>> """Return True if value is member of a list of a given category, False
>>> otherwise."""
>>> # this version also knows how to deal with floats.
>>>
>>> try:
>>>
>>> L = lookup[category]
>>> except KeyError:
>>> raise KeyError("Invalid category: %r" % category)
>>>
>>> adjusted_value = value if int(value) in (L[0], 0, L[-1]) else
>>> math.ceil(value)
>>> try:
>>> chopfunc = bisect.bisect_right if value < L[0] else
>>> bisect.bisect_left return L[chopfunc(L, value)] == adjusted_value
>>> except IndexError:
>>> return False
>>
>> I don't understand what this is supposed to do:
>>
>> (1) for bisect to work correctly the list has to be sorted. Your
>> get_valid_value_lookup() doesn't do that
>
>
> awww, good point. The rows are currently sorted in the database, but it is still a simple "just in case" thing to to sort the data.
>
>
>> (2) Why/how are L[0], 0, and L[-1] special?
>
>
> No idea actually. You've spotted the bug I *deliberately* added? :-))
>
>
>> (3) Given a sorted list L there should be no need to bisect_whatever for
>> value < L[0]
>>
>>
>>> [3] I am inheriting a this system. It feels a bit strange that these range
>>> [check values are stored in a database.
>>>
>>> Would yaml be a better choice? Some of the tables are close to 200
>>> records.
>>
>> Unless you're encountering an actual inconvenience just keep it like it is.
>> Yea, I'm a lazy bastard ;)
>
>
> :-) I would like to store the validation and check functions in a central location where they can easily be (re)used and maintained.
>
> For example, we also have to do a modulus-11 check, and I really am relucatant to have multiple functions for the same check (e.g.
>
> one an SQL check constraint, one in Python, perhaps even one in JavaScript.
>
>
>
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Albert-Jan Roskam
2015-08-28 10:17:04 UTC
Permalink
----------------------------------------
> Subject: Re: value range checker
> To: ***@hotmail.com
> CC: ***@python.org
> From: ***@gmail.com
> Date: Thu, 27 Aug 2015 12:41:45 -0400
>
> On 2015-08-26 09:19, Albert-Jan Roskam wrote:
>> For example, the category-min-max tuples
>>
>> ("cat a", 1, 1),
>>
>> ("cat a", 3, 3),
>>
>> ("cat a", 6, 10),
>>
>> correspond to a range of category A of 1-1, 3-3, 6-10, which is the same as 1, and 3, and 6, 7, 8, 9, 10.
>
> Hi-
>
> An interval tree ( https://en.wikipedia.org/wiki/Interval_tree ) is the
> typical choice of data structure for storing and searching sets of
> intervals, and the bx-python package (
> https://bitbucket.org/james_taylor/bx-python ) has a high-quality Cython
> implementation of one. I have used that interval tree implementation for
> storing intervals of genome coordinates, and it worked very well. I
> don't remember whether that implementation deals with single points very
> well (i.e. start and end are the same value) -- some interval tree
> implementations handle that well and some do not. It shouldn't be much
> of a tweak if not, though.
>
> Like almost any tree-based index structure with O(log n) performance, it
> might only be worth using this when the number of intervals grows large
> enough for a linear search to become a bottleneck.


Hi Matt -- thanks, I did not know this. Interesting. It looks a bit like overkill for what I am trying to do (performance is not a bottleneck), but I will certainly look at it in more detail!


regards,

Albert-Jan
_______________________________________________
Tutor maillist - ***@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor
Continue reading on narkive:
Loading...