Written by Yuri Shmorgun
Yuri is WiseStamp’s head of development. He loves jogs on the beach and server-side coding.
The other day, I wanted to iterate over all entities in our App Engine datastore. Naturally, MapReduce came to mind, so I googled “app engine python mapreduce”. And I got the same things you did: the project source code on GitHub with convoluted examples and links to an outdated, gazillion-page-long tutorial.
I started following the tutorial; but, the first link in it pointed to a non-existent SVN repository :facepalm: This shouldn’t be so agonizing, right? I mean, someone took the time to write the library, and already did all the heavy lifting: this should be easier. And indeed, easier it is. I am here to tell you all about it.
This post deals with the mapper aspect of MapReduce. I might write a follow-up dealing with filters, callbacks and reduce if there’s demand 🙂 I assume you’re familiar with Python, App Engine and some basic MapReduce concepts. It should take around 15-25 mins to complete from scratch. Let’s get to it.
Grab the source for the mapper example here.
Take a look at the app contents. There are (only!) 4 items that don’t come with a default App Engine app. Let’s look at them and see how to run a MapReduce job:
First things first: let’s look at app.yaml. There are three entries you need to add in order to make the example work. In your production environment, it could be as few as one.
Items 1 & 2 are custom (i.e. app-dependent). The first populates the datastore (which you won’t need if the entities are already in your datastore). The second kicks off the MapReduce job, which is also custom. You could do that with cron or trigger it via some business logic. The third is critical and should be copy-pasted as is, because it handles all the inner calls that MapReduce generates while running a job.
So that MapReduce will have data to map (i.e. iterate over), we’re going to populate the datastore with entities of the Dummy model. The Dummy model is very simple: it contains only one field (counter). We set it to 0 by default, and the mapper will increase it every time an entity is processed. More on that later. This is the Dummy model and it resides in dummy.py:
Simple, elegant, dumb. Exactly what we need.
Now for the PopulateHandler (in tasks.py), which populates the datastore with entities we can map (i.e. iterate over):
No Nobel Prize for mathematics here either. Simply takes a number and creates as many Dummy entities in the datastore as requested. Note: if you provide a very large number you will get a request timeout, as App Engine limits request times to 60 seconds.
You’ll notice at the bottom of the file the url->handler configuration. Specifically ‘/populate_db’. So, to populate your datastore, run the App Engine app (locally or in the cloud) and navigate to: http://you_example_app_url/populate_db
Each time you open this URL, 100 Dummy entities will be added to your datastore. After you’ve populated your datastore using “populate_db”, take a look at the Dummy entities. Note that their counter is zero.
The mapper function gets called for every datastore entity that the MapReduce job will find. In our case the mapper is very simple: it increments the Dummy entry’s counter by 1, saves it, and logs the new counter value.
Mapper (found in tasks.py):
Again, very simple. Note that it is not part of a class. It could be; but, for the sake of simplicity, it’s a standalone function.
Now all that’s left is to glue our puzzle pieces together and actually run the MapReduce job (which in our case only maps and doesn’t reduce). Let’s review the function that configs and starts a job. You can find it in tasks.py under StartHandler:
control.start_map() kicks off the MapReduce job (don’t forget it’s import: from mapreduce import control). This is the beefiest function until now and it’s also the last. It really is much simpler than it looks. Let’s review the arguments:
To run the mapreduce job, go to http://you_example_app_url/start_job
You can monitor the progress of the mapreduce job in the logs (if you’re on a local machine) or in the Task queses tab in your App Engine project dashboard in the dev console.
Was this tutorial easy to follow? Did you deploy your first mapper job successfully? I love feedback and especially the kind that I can act on 🙂