New User? Register here - Existing Users: Username: Password: [Advanced Login]

 

 

Current Poll

Your preferred Interactive shell?









( 1388 votes ~ 15 comments )

 

Weblog entry #21 for e5z8652

MySQL datetime from date, time
Posted by e5z8652 on Tue 16 Feb 2010 at 22:16
Tags: none.
Say you have a MySQL table called "traffic" representing http requests through a web proxy with a date field (i.e. 2010-02-13) and a time field (i.e. 22:40:23) for each request. Then say you want to select records with a range using a datetime (i.e. 2010-02-13 22:40:23).

Let's say we want to see records for all traffic between 4:30 in the afternoon of January 1st and 5 in the afternoon on January 2nd, a 24 1/2 hour period.

If you write a select statement like this:

SELECT * FROM traffic WHERE date BETWEEN '2010-01-01' AND '2010-01-02' AND time BETWEEN '16:30:00' AND '17:00:00';
7573 rows in set (1.63 sec)



you will get all of the records from January 1st for the half hour between 16:30 and 17:00, and all of the records from January 2nd for the half hour between 16:30 and 17:00.

That's not what we wanted.

However, we can use a function to combine the date and time:

SELECT * FROM traffic WHERE DATE_ADD(date, INTERVAL time HOUR_SECOND) BETWEEN '2010-01-01 16:30:00' AND '2010-01-02 17:00:00';



The DATE_ADD function lets us turn the separate date and time fields into a single datetime field. Yay! It works great in testing with a small sample database, but there is one small caveat:

mysql> SELECT COUNT(*) FROM traffic;
+-----------+
| COUNT(*) |
+-----------+
| 186208916 |
+-----------+
1 row in set (0.00 sec)



That is 186 million rows in the table, representing proxy traffic for the past ten months or so. Going the DATE_ADD route means that MySQL will attempt the conversion on all 186 million rows of the table, and then compare the results to the BETWEEN statement. This takes a LONG TIME. Like a half hour on my hardware. Given that this SELECT statement is used in a cgi script that feeds a web page, a half hour is a bit long to ask the user to wait.

But we can fix this a little:

SELECT * FROM traffic WHERE date BETWEEN '2010-01-01' AND '2010-01-02' AND DATE_ADD(date, INTERVAL time HOUR_SECOND) BETWEEN '2010-01-01 16:30:00' AND '2010-01-02 17:30:00';
146948 rows in set (2.72 sec)



Now MySQL limits the results using the date value, and only does the calculation on the records for January 1st and January 2nd. Much faster at just under three seconds!

But is this really the most efficient way to do this? I can't change the database schema so the individual date and time fields must remain the way they are. What I'm looking for is a way to request a date/time range in a more efficient way than the one above. Any suggestions?

 

Comments on this Entry

Posted by treeboy (78.86.xx.xx) on Wed 17 Feb 2010 at 12:37
[ Send Message ]
Naively, I thought that you would just be able to create a view and index that, but I didn't realise that you cannot create indexes on views in MySQL - d'oh!

Are you allowed to add triggers that populate data in a new database with the values you want ? You could then add tables to the new database that reference this table and have the columns you need to index.

Given that you are not allowed to modify the schema, are you allowed to make a read-only replica ? You could add a new table that just has a foreign key to this table with just index the NEW columns that you need.

The problem would seem that you are trying to perform OLAP on a DB that you are not allowed to modify for the compromise of performing both transaction processing and reporting: this is not practical.

I once worked at a place where their two 40GB databases (in the early 90's) were created for transaction processing and so their overnight processes were taking almost 10 hours! Backups weere getting very difficult to fit in.

I mentioned this to the main dev team and their management and stakeholders and eventually got the go ahead to modify these schema: the result was that the transaction processing saw no negativie change in performance and the overnight jobs went down to less than an houor.

[ Parent | Reply to this comment ]

Posted by treeboy (78.86.xx.xx) on Wed 17 Feb 2010 at 12:49
[ Send Message ]

I hadn't read your query particularly well either:

SELECT
	*
FROM
	traffic
WHERE
	(
		date = '2010-01-01'
		AND time > '16:30:00'
	) OR (
		date = '2010-01-02'
		AND time < '17:00:00'
	); 

[ Parent | Reply to this comment ]

Posted by tinus (82.93.xx.xx) on Wed 17 Feb 2010 at 13:36
[ Send Message ]
the "OR" needs to be an "AND".
I'm curious how fast this query is.

[ Parent | Reply to this comment ]

Posted by tinus (82.93.xx.xx) on Wed 17 Feb 2010 at 13:40
[ Send Message ]
noooo. now I'm not reading this query well either.
I was thinking of this:
SELECT * FROM traffic WHERE (date > '2010-01-01' AND time > '16:30:00') AND (date < '2010-01-02' AND time < '17:00:00');
curious as to which one is faster.
I bet your one is.

[ Parent | Reply to this comment ]

Posted by treeboy (78.86.xx.xx) on Wed 17 Feb 2010 at 13:49
[ Send Message ]
Sorry: that query will alwaays return no rows: the date cannot be > '2010-01-01' and < '2010-01-02'.

[ Parent | Reply to this comment ]

Posted by tinus (82.93.xx.xx) on Wed 17 Feb 2010 at 14:03
[ Send Message ]
SELECT * FROM traffic WHERE (date >= '2010-01-01' AND time >= '16:30:00') AND (date <= '2010-01-02' AND time <= '17:00:00');

anyway, it's a stupid idea and will be slower than yours.
(unless the interval spans several days perhaps)

[ Parent | Reply to this comment ]

Posted by treeboy (78.86.xx.xx) on Wed 17 Feb 2010 at 14:14
[ Send Message ]
This is the query that the OP originally had, it will only get the traffic between those times on both days: i.e. only ships between 16:30 and 17:00 on each the two days.

[ Parent | Reply to this comment ]

Posted by tinus (82.93.xx.xx) on Wed 17 Feb 2010 at 14:20
[ Send Message ]
right. I must have lost my marbles.

[ Parent | Reply to this comment ]

Posted by treeboy (78.86.xx.xx) on Wed 17 Feb 2010 at 14:05
[ Send Message ]
I should add, to extrapolate, if you wanted to include everthing from 16:30 on the 1st Jan to 17:00 on 5th Jan:
SELECT
	*
FROM
	traffic
WHERE
	(
		date = '2010-01-01'
		AND time > '16:30:00'
	) OR (
		date IN ( '2010-01-02', '2010-01-03', '2010-01-04' )
	) OR (
		date = '2010-01-05'
		AND time < '17:00:00'
	); 

[ Parent | Reply to this comment ]

Posted by e5z8652 (209.112.xx.xx) on Wed 17 Feb 2010 at 19:39
[ Send Message | View Weblogs ]

Working with the live data, I don't really want the * returned. So here's what I used, moving to February to get Monday-Friday data:

SELECT 
     date, time, resultCode, url, authuser 
FROM 
     traffic 
WHERE 
     authuser = 'john_doe' 
     AND 
          (( date = '2010-02-01' AND time > '16:30:00' ) 
     OR 
          ( date IN ('2010-02-02', '2010-02-03', '2010-02-04' )) 
     OR 
          ( date = '2010-02-05' AND time < '17:00:00' ));

3657 rows in set (18.90 sec)

And:

SELECT 
     date, time, resultCode, url, authuser 
FROM 
     traffic 
WHERE 
     authuser = 'john_doe' 
     AND 
          date BETWEEN '2010-02-01' AND '2010-02-05' 
     AND 
          DATE_ADD(date, INTERVAL time HOUR_SECOND) BETWEEN '2010-02-01 16:30:00' AND '2010-02-05 17:00:00';

3657 rows in set (21.93 sec)

So the non-computational way is faster. Thank you! I have a very basic understanding of SQL and it never occurred to me to group the date & time with parenthesis.

But this is a cgi script written with Perl CGI, so the trick will be to create the SQL query string where there will be arbitrary numbers of intermediate dates. That makes the query string change in the script.

But that's a whole other ball of wax, and will be easy enough to just do a date diff between the start & end dates.

Thanks again to both of you!

[ Parent | Reply to this comment ]

Posted by treeboy (78.86.xx.xx) on Wed 17 Feb 2010 at 23:03
[ Send Message ]
Hi, Mr/Ms e.

That difference is really very disappointing. I would expect much more as one query is doing a table scan and the other should be using index(es).

Are you running them one after the other ? In which case, it may be a caching thing making the second one seem faster. When you are testing these things, you should run them three times each (and ideally interleaved or after a restart) so that you do not get caching issues.

If that is not the problem, then you may not have all the indexes that you require: are you allowed to add indexes ?

To check to see what indexes (if any) you are using in your query run exactly the same query, but prefix it with the keyword EXPLAIN.

If they are all there but they are not being used, then it may be that the query optimiser is not using all of the ones that it can and you may be able to force it by including criteria that do not really mean anything. For example: you could add an arbitrary "and time >= '00:00;00'" to the "date IN" clause which may hint the optimiser to use the same columns as the other two date/time parts.

Or you could look at explicitly using optimiser hints which are an even greater in depth tool.

Finally: I really do recommend that you do not hide brackets in the style of SQL that you write: but that really is an entirely different kettle of fish ;-) And probably quite rude in the way that I have just phrased it... Sorry.

[ Parent | Reply to this comment ]

 

 

Flattr