The other day, I wanted to iterate over all entities in our App Engine datastore. Naturally, MapReduce came to mind, so I googled “app engine python mapreduce”. And I got the same things you did: the project source code on GitHub with convoluted examples and links to an outdated, gazillion-page-long tutorial.

I started following the tutorial; but, the first link in it pointed to a non-existent SVN repository :facepalm: This shouldn’t be so agonizing, right? I mean, someone took the time to write the library, and already did all the heavy lifting: this should be easier. And indeed, easier it is. I am here to tell you all about it.

This post deals with the mapper aspect of MapReduce. I might write a follow-up dealing with filters, callbacks and reduce if there’s demand :) I assume you’re familiar with Python, App Engine and some basic MapReduce concepts. It should take around 15-25 mins to complete from scratch. Let’s get to it.

 

<TL;DR>

  1. Download the example app source.
  2. Run the example app with App Engine SDK.
  3. Open http://example_app_url/populate_db to put some Dummy entities into the datastore.
  4. Open the datastore viewer and note that all Dummy entities have counter values of 0.
  5. Open http://example_app_url/start_job to kick off the MapReduce job.
    • You can monitor the MapReduce progress via the logs
  6. Open the datastore viewer again and note that all Dummy entities now have counter values of 1.
  7. Be amazed and humbled.

</TL;DR>

 

Step 0 – The Infrastructure

Grab the source for the mapper example here.

Notes:

  • You will need to run the PopulateHandler (see Step 2) to have data to map (i.e. iterate over).
  • There’s a graphy lib in the app’s root. Keep it there. MapReduce comes with a built-in UI that’s built on graphy and your app won’t run without this dependency.

Take a look at the app contents. There are (only!) 4 items that don’t come with a default App Engine app. Let’s look at them and see how to run a MapReduce job:

  • mapreduce folder (duh) – Resides in the python/src folder of the original MapReduce project. It should be in the app’s root.
  • graphy folder – MapReduce depends on it for its built-in UI. It should be in the app’s root.
  • dummy.py – The Dummy model we’re going to iterate over. It could be any of your own models.
  • tasks.py – All the logic of running a MapReduce job and processing the dummy entities is here.
  • app.yaml – I know, this one comes by default. But, it needs tweaking for MapReduce to work, so it’s in the list too 😛
Find out more about:  AngularJS running string (marquee alternative)

 

Step 1 – The Handlers

First things first: let’s look at app.yaml. There are three entries you need to add in order to make the example work. In your production environment, it could be as few as one.

app.yaml with mapreduce alterations

 

Items 1 & 2 are custom (i.e. app-dependent). The first populates the datastore (which you won’t need if the entities are already in your datastore). The second kicks off the MapReduce job, which is also custom. You could do that with cron or trigger it via some business logic. The third is critical and should be copy-pasted as is, because it handles all the inner calls that MapReduce generates while running a job.

 

Step 2 – The Data

So that MapReduce will have data to map (i.e. iterate over), we’re going to populate the datastore with entities of the Dummy model. The Dummy model is very simple: it contains only one field (counter). We set it to 0 by default, and the mapper will increase it every time an entity is processed. More on that later. This is the Dummy model and it resides in dummy.py:

2015-06-26_2116

Simple, elegant, dumb. Exactly what we need.

Now for the PopulateHandler (in tasks.py), which populates the datastore with entities we can map (i.e. iterate over):

2015-06-26_2119

No Nobel Prize for mathematics here either. Simply takes a number and creates as many Dummy entities in the datastore as requested. Note: if you provide a very large number you will get a request timeout, as App Engine limits request times to 60 seconds.

You’ll notice at the bottom of the file the url->handler configuration. Specifically ‘/populate_db’. So, to populate your datastore, run the App Engine app (locally or in the cloud) and navigate to: http://you_example_app_url/populate_db

Each time you open this URL, 100 Dummy entities will be added to your datastore. After you’ve populated your datastore using “populate_db”, take a look at the Dummy entities. Note that their counter is zero.

Find out more about:  Be aware of the double slash

 

 

 

Step 3 – The Mapper Function

The mapper function gets called for every datastore entity that the MapReduce job will find. In our case the mapper is very simple: it increments the Dummy entry’s counter by 1, saves it, and logs the new counter value.

Mapper (found in tasks.py):

mapper function

Again, very simple. Note that it is not part of a class. It could be; but, for the sake of simplicity, it’s a standalone function.

 

Step 4 – The Glue

Now all that’s left is to glue our puzzle pieces together and actually run the MapReduce job (which in our case only maps and doesn’t reduce). Let’s review the function that configs and starts a job. You can find it in tasks.py under StartHandler:

start_map function. This kicks off the mapreduce job

control.start_map() kicks off the MapReduce job (don’t forget it’s import: from mapreduce import control). This is the beefiest function until now and it’s also the last. It really is much simpler than it looks. Let’s review the arguments:

  1. name – An arbitrary string to described the job to the coder.
  2. handler_spec – This is important: it defines the mapper function (see above) that processes the entities.
  3. reader_spec – The type of the input reader. If you keep using datastore, you should keep using that input reader.
  4. mapper_parameters – A dictionary with more params for the mapping process:
    • entity_kind – This is important: the path to the entity kind you’re mapping.
    • processing_rate – How many entities each shard will process.
  5. shard_count – How many shards will run simultaneously.
  6. queue_name – You can define a queue for any of your mapreduce jobs. For simplicity’s sake, I use the default.

To run the mapreduce job, go to http://you_example_app_url/start_job

You can monitor the progress of the mapreduce job in the logs (if you’re on a local machine) or in the Task queses tab in your App Engine project dashboard in the dev console.

Was this tutorial easy to follow? Did you deploy your first mapper job successfully? I love feedback and especially the kind that I can act on :)

Written by Yuri Shmorgun
Yuri is WiseStamp’s head of development. He loves jogs on the beach and server-side coding.

Related Posts

  • Hernán Acosta

    Hi Maxime, I have the same problem. Can you share how you did it with mapping?
    I think the only way write a CSV reader to get the file, apply python csv.DictReader, and yield each row to the output.
    Do you have a better idea, suggestion or code?

  • Hitesh Solanki

    Hi can u tell me how to start this i am total beginer to python pls