On 2013/11/13 10:04 , Stephane Bortzmeyer wrote:
A colleague asks me if it is possible to have bulk access to public measurement data. (Properly: otherwise, he could dor a for loop over all measurement IDs but I suspect it is not nice.)
I find nothing in the Atlas documentation. Is it because I do not search hard enough or because this possibility does not exist?
No, there is no API for that. I wonder what it should look like. Assuming that one bulk data download gets 100 Mbps (just a guess, you don't want bulk data downloads to slow down the rest of the system) then downloading 1 Tbyte of data takes about a day. So for the entire Atlas data set that may easily be a week or more. Connections that last for more than a week are not nice, so I guess the client should maintain a cursor and issue multiple requests. (That is very close to just iterating over all measurements) One other issue is whether the data gets stored or not. If you are going to store data, then an API that can be used to update the stored data set make most sense. If you are not going to store the data, then running map/reduce on the server side make most sense, but I'm not sure if there exist proper sandbox environments to allow that to be done on the hadoop cluster directly. Philip