Percona Live: Do Not Press That Button

PLD-17-01.pngIf I had to mention a single technical blog that I always find informative and I have followed for many years, I would say without doubts the Database Performance Blog from Percona. That is why I am so keen to attend this year the Percona Live Open Source Database Conference in Dublin and present a lighting talk Do Not Press That Button” on September 26th. You can find more about my short session on RDS here. Looking forward to Dublin!

The moment a project fails

The moment you realize a project you have been developing does not deliver can take different forms. It might even be accidental. For me the reality check was a tall de carrers sign in the streets of Barcelona.

A keen and slow runner, in the last few months I have been developing a tool to crawl and keep up to date running events for RaceBase World. The project was a mix of AWS Lambda, Scrapy and Python, able to collect over 25K races around the world and keep them up to date. Not an easy task.

The goal was simple: sometime you are lucky enough to plan your holidays around a marathon abroad, possibly one of the largest events around the world.  More often you plan your vacations or business trips and then you simply wonder if there is a running event in the area.

I have now been in Barcelona for a few weeks, I have been looking for what I believed was an unlikely road race in summer and my lovely crawler could not find any. And Runner’s World Spain could not find one too. 

Still a sign on the door of the building where I live is telling me that up to 5 thousand runners are going to run in Barcelona next Sunday for la Cursa Barça. No better way to prove that my global database is inefficient.

You can of course try to collect million races worldwide, something that is very hard to achieve and will anyway generate too much noise for the end user. But having only a few thousand events globally will include only the large races (the ones a runner can find without any help from RaceBase World or any other website) and a few local random ones.

And the moment I cannot trust my own project to find a race, I can consider it a failure. But before working on a new idea or a new (local) approach to discover new races, it is time to join la Cursa Barça and forget Python.

From Macedonia to Codemotion

As the BBC recently reported, Matthew Nimetz has spent the last 23 years trying to find a name for the republic of Macedonia that can be accepted both in Skopje and Athens. But a solution for the Macedonia naming dispute has not been agreed yet.

Screenshot from 2017-08-10 18-43-36

What name should a developer use today when working on location-based services? The user friendly Macedonia or the formal but longer The Former Yugoslav Republic of Macedonia ? How can you make your users in Skopje and Athens both happy?

This is one of the examples I might use at the next Codemotion in Berlin. I do not expect to discuss in half an hour all the geopolitical challenges targeting an international audience and their workarounds, but I am very excited to present the talk “The (accidental) political developer” .

dhvqufnumaevjlh

You can find the abstract here and an introduction to the topic in my previous posts, The (accidental) political software developer and Location-based services and countries on AWS.

See you on October 12 & 13 at Kulturbrauerei!

Run Alexa Run

I have always been a slow but keen runner. And I have always loved joining running events as an excuse to travel and to have one more weekend on the road. What is now called a runcation.

While looking for one more reason to play with Alexa and Amazon Lambda, developing a simple skill to find the next race in a country was an obvious choice.

Thanks to Matt, you can read more about the experiment on RaceBase World.

Alexa and RaceBase World

My next challenge?

Rely on Alexa’s results to choose my next race.  With over 16000 events, from 5K to 100 miles, in 168 countries on RaceBase World,  a bug in the still very basic Lambda function might be costly:  I am getting ready for a less ordinary race and an obscure destination.

Below  is a short audio demo, the code is available on GitHub. The skill is currently available on the UK store only.

Location-based services and countries on AWS

A few weeks ago I covered my experience and challenges working with geolocation technologies for Funambol and RaceBase World, as the (accidental) political software developer.

How does it work using AWS services?

As the focus of this blog is AWS technologies, what about Amazon and their location-based services? Any way to keep those geolocation issues under control?

First of of all let’s see which services Amazon offers that include geolocation capabilities.

AWS offers almost nothing. If we compare the products from AWS with Google Maps API or GeoNames, there is nothing yet that has geolocation capabilities. Of course you can run a third party AMI from the Marketplace, like MaxMind – GeoIP or IP2Location Geolocation. But that is just taking advantage of an EC2 instance, it is not a service directly provided from AWS.

Even Amazon Rekognition, a deep learning-based image analysis, does not offer the full set features that Google Vision has and that could potentially trigger location-based issues. If you upload a picture on Google Vision, thanks to Landmark Detection, you might end up with a location or a questionable place on the Earth. But if you upload a picture of the Eiffel Tower on Amazon Rekognition, you have as a result a disappointing but very safe 99.2% tower.

Screenshot from 2017-04-25 21-49-30

And you have something similar with a picture of the Western Wall in the the Old City of Jerusalem. Definitely not so accurate, but not a result that can create many controversies.

Screenshot from 2017-05-28 22-49-59

What about Amazon user interfaces and products for end users? Many photo sharing and storage services like Google Photos or Apple Photos or OneMediaHub (a white label solution provided by Funambol) create tags according to the GPS coordinates (EXIF data) of the pictures uploaded by a user. With the challenges of defining the best tag for Hong Kong or determine if Crimea is a Russian or Ukrainian territory.  Prime Photos from Amazon does not. No feature, no issues.

Really nothing on Amazon or AWS?

Amazon of course still has to provide a user interface and a chance for the user to add and validate  an address. They have their own interesting choices (no Kosovo, for example) and there might be some entries in the list that might be disputed by some users as countries – somehow the approach of Google of calling the drop-down  location and not country feels safer. But that is hardly interesting for a developer.

Screenshot from 2017-04-23 15-14-16

To summarize, I never had to deal in the past with geolocation issues or controversies while working directly with AWS services or the Amazon platform.

But there is something new that might pose a potential challenge, the Alexa SDK and the built -in slot types that define how data in the slot is recognized and handled. And Amazon Lex, the “conversational interfaces for your applications, powered by the same deep learning technologies as Alexa”.

Alexa relies on slots, that are list of values with many of them predefined by Amazon. For example, the slot AMAZON.Country is a ready list of (English) names of countries around the world. Or AMAZON.DE_CITY provides recognition for German and world cities commonly used by speakers in Germany and Austria.

Anything to be worried about? Let’s test it first.

Big World: Alexa and geolocation

After attending a presentation at Factory Berlin  and a talk at the AWS summit, both from Memo Döring and both very inspiring, I decided to build a simple dynamic skill for Alexa. The skill, called Big World, relies on data from Population.io, a project of the World Data Lab that aims to make demography accessible to a wider audience. The very simple skill, given a country name, returns today and tomorrow’s population and returns the values from the World Population API. Below  are a screenshot and a short audio demo, the code is available on GitHub.

Screenshot_2017-05-25-21-37-27

 

Any problem?

Going back to the topic of this post, the only challenge was to match the names from the built-in Alexa slot to the countries of the Population.io API. The backend supports only values such as Arab Rep of Egypt, Islamic Republic of Iran, West Bank and Gaza or Hong Kong SAR-China. Names that it’s very unlikely a user is going to say while talking to a voice assistant like Amazon Echo and that require mapping.

var PS = 'West Bank and Gaza';
(...)
else if(countryName.toUpperCase()=='PALESTINE'){ 
 countryName=PS;
}

But unless you type the name entirely wrong, there is really no big challenge and the answer still is not controversial as you are dealing only with the name and not the location itself. For example:

Q. Alexa ask Big World the population of Palestine.

A. You are not alone in this world. The population of West Bank and Gaza today is 4916233, tomorrow there will be 368 people more.

or in a simpler scenario:

Q. Alexa ask Big World how many people live in England.
A.
The world population is growing as we speak. The population of United Kingdom today is 65473338, tomorrow there will be 1101 people more.

The answer might not be 100% accurate but it is the best approximation available using the data from Population.io. As for any vocal conversation, an audio interaction with a smart speaker is more forgiving than an incorrect point or country name on a website.

Of course some users might still not be able to find results for specific and perfectly valid country names but that is down to poor coding and logic in the Lambda function and not to the specific Alexa slot.

At the end…

Due to the lack of real location-based features, the services currently available on AWS do not currently present most of the challenges covered in the previous post. But for the very same reasons they do not provide a solution or any help to the developer to address or mitigate them.

 

Finding marathon cheats using Amazon Rekognition

In the last few weeks many articles have been published about the problem of cheating  in the biggest running events, for example Beijing marathon to use facial recognition in cheating crackdown.

There is apparently even a former marathoner and business analyst, Derek Murphy, who devotes his time to catch the cheats, as the BBC recently reported : The man who catches marathon cheats – from his home. The booming of the biggest international marathons, the grow of qualifying events for the most prestigious ones make the likelihood of cheating higher, with bib-swappers who give their chips to a faster runner and bib-mules who carry more than one chip during the entire race.

Is it really cheating?

Let me start saying that I am not a big fan of blaming “marathon cheats” in public forums. There are  scenarios when a runner might decide to take part in a race using the number of someone else and most of the them do not hurt the community or other runners. Qualifying for the UTMB or for Boston Marathon at the expenses of other runners is of course not one of them. There are a few hundred trail events that allow runners to collect points for the UTMB, even more road marathons that can give you an official time good enough to go to Boston. You can find most of them on RaceBase World.

Screenshot from 2017-04-26 17-08-52

I have been running almost 100 races in different countries in the last 15 years, I have a volunteered in a dozen of running events in Berlin alone and I do not need face recognition to figure out that there are indeed a few runners every race who are running with someone else number. And I never did anything to stop them.

Data privacy and marathon running

Before testing Amazon Rekognition as a tool to find cheats in marathons, we might want to discuss if facial recognition technology is really a threat to privacy and how we can have data without contacting the race organizer.

Unfortunately runners have been used for some times. While you might take good care of your data on-line, make sure not to post to social networks or remove geolocation information from your pictures, there is nothing you can really do if you are a runner to avoid your personal data being shared everywhere.

You run 42 kilometers with a number on your shirt and everyone can find all the pictures for a given runner and his personal data (often including date of birth and residence) on public websites. Last year I was able to figure out personal information about the girlfriend of a old schoolmate I have not met in 20 years simply by looking at the pictures and results of a old Berlin Marathon only.

Is face recognition during a marathon actually possible?

Using facial recognition to address cheating crackdown in a recreational event feels like using a machete to cut the salad but does it actually deliver?  Can Amazon Rekognition help in validating the results of qualifying races for UTMB or Boston?

As a first test, I took the images of the latest marathon I run (OK, I barely finished walking), the Berlin Marathon 2016. And I of course compared the very first picture in the set associated with my number (the one before the start) with the last one, after crossing the finish line.

rekognition-marathon

Amazon Rekognition simply confirms something that runners have always known, a marathon changes you. After completing a 42K you are not the same person anymore.

Jokes aside, this is just a one picture test on one runner but highlights the challenges of tracking the runner along the race using face recognition alone. It might help combined with other technologies but it is likely to generate a significant number of false positives.

Testing a few more pictures from my previous races still available on-line (the not existent data privacy for runners), I had mixed results. Some pictures match easily others do not. Some are false positives others are not. And I was definitely younger.

Screenshot from 2017-04-24 21-06-43
A random test with the London Marathon

Let’s instead limit the goal to confirm that a runner taking part in an event matches by gender and age with the category she was registered for. That’s the most common scenario of cheating that is impossible to cover using split times along the course.

Think about your younger and fitter cousin making your PB so you can qualify for the Boston Marathon, your long term running pal who collects a few points for you so you can qualify for the UTMB next year.

Last Sunday took place the London Marathon, the largest spring marathon in Europe and one of the biggest in the world with New York, Berlin and Paris.

All the results  of the race are of course available online, and there is a simple GET request that returns the data for a given bib number

http://results-2017.virginmoneylondonmarathon.com/2017/?event=MAS&pid=search&search%5Bstart_no%5D=*****&search%5Bsex%5D=%25&search%5Bnation%5D=%25&search_sort=name

In the same way you can access all the pictures of all the runners on MarathonPhoto and retrieve the pictures of a runner using again the start number and the last name you got from the previous request (the RaceOID is the one of the London Marathon 2017).

http://www.marathonfoto.com/index.cfm?RaceOID=19802017S3&LastName=****&BibNumber=*****

We do not want (yet) to process 30K or 40K runners, let’s use a very small sample to see how Amazon Rekognition works. Let’s use Random.org to get 10 numbers we can test.

10297

Three of them did not match any runner (not all numbers are assigned for the race) and one runner did not start at the event. What about the other 6 runners? We know the category (age range) and the gender for all of them.

- 31724 (18-39, male) 
- 10297 (18-39, male)
- 12471 (18-39, male) 
- 19412 (18-39, female) 
- 17970 (45-49, female) 
- 21095 (18-39, female)

Retrieving the first image of the set of each one using the MarathonFoto URL, we are able to double check the runners using Amazon Rekognition to match the above data with the results with face recognition.

How did Amazon Rekognition score?

- 31724 (35-52, male 99,9%)
- 10297 (29-45, male 99.9%)
- 12471 (14-23, male 99.9%)
- 19412 (23-38, female 100%)
- 17970 (20-38, female 100%)
- 21095 (30-47, female 100%)

It did very well. All runners matched the expected gender.

Even if there was a bit of luck as in one of the picture Amazon Rekognition selected a different and incorrect runner in the corner of the photo. And this will be the biggest  challenge for an automatic bot: we need to first match the bib number in the image with the race number as there might be multiple runners in the same shot, so combine the technology already used to map the bib number to the photo to face recognition.

Did all runners match the expected age range as well? In short, yes.

The only failure (12471) is due to picking up the wrong face in the picture, but once address that is correct too.  Note as well that for runner 12471 the overlapping of the age range is minimum. But even when the overlapping is minimum, the correct age is in the range (you can find that information searching the athlete name on-line: 31724 is 39 and 21095 is 33).

Limitations

So we have a few limitations here that working with a race organizer can easily address:

  1. The London Marathon is sensible enough to publish the age category but not the year or date of birth. A data of course they have. And that (wrongly) many race organizers make public.
  2.  The pictures are screenshots from the MarathonFoto site and are not the best quality (I did not pay for them).
  3. We need to parse multiple photo of each runner to have a significant confidence and that might increase the cost of the solution

Conclusions

Of course a very small set of a few runners is not expect to catch cheaters (and I would not publish their data anyway) but confirms that the approach of face recognition to catch cheats in marathons is feasible even if it most likely needs other tools too to have a reasonable level of accuracy.

But running a full data validation for a big event requires just the collaboration of the race organizer, a few dollars, maybe an instance running Scrapy and a couple of AWS Lambda function. But everyone today can create profiles of thousands of marathon runners around the world and verify their data. Whatever that is good or not.

But I still believe that at the moment the claim of the Beijing Marathon is more a PR stunt than the real way they are going to use to address the issue.