Architectures on the cloud: from traditional deployments to serverless architectures is the title of my talk at the SFI in Krakow. Looking forward to April 5th and a great conference in Poland! The full abstract is available here.
Since 3 years ago, RDS offers the option of running a MySQL database using the T2 instance type, currently from db.t2.micro to db.t2.large. These are low-cost standard instances that provide a baseline level of CPU performance – according to the size — with the ability to burst above the baseline using a credit approach.
What happens when you run out of credits?
There are two metrics you should monitor in CloudWatch, the CpuCreditUsage and the CpuCreditBalance. When your CpuCreditBalance approaches zero, your CPU usage is going to be capped and you will start having issues on your database.
It is usually better to have some alarms in place to prevent that, either checking for a minimum number of credits or spotting a significant drop in a time interval. But what can you do when you hit the bottom and your instance remains at the baseline performance level?
How to recover from a zero credit scenario?
The most obvious approach is to increase the instance class (for example from db.t2.medium to db.t2.large) or switch to a different instance type, for example a db.m4 or db.c3 instance. But it might not be the best approach when your end users are suffering: if you are running a Multi-AZ database in production, this is likely not the fastest option to recover, as it first requires a change of instance type on the passive master and then a failover to the new master.
You can instead try a simple reboot with failover: the credit you see on CloudWatch is based on current active host and usually your passive master has still credits available as it is usually less loaded than the master one. As for the sceenshot above, you might gain 80 credits without any cost and with a simple DNS change that minimizes the downtime.
Do I still need to change the instance class?
Yes. Performing a reboot with failover is simply a way to reduce your recovery time when having issues related to the capped CPUs and gain some time. It is not a long term solution as you will most likely run out of credits again if you do not change your instance class soon.
To summarize, triggering a failover on a Multi-AZ RDS running on T2 is usually a faster way to gain credits than modifying immediately the instance class.
In the last few weeks many articles have been published about the problem of cheating in the biggest running events, for example Beijing marathon to use facial recognition in cheating crackdown.
There is apparently even a former marathoner and business analyst, Derek Murphy, who devotes his time to catch the cheats, as the BBC recently reported : The man who catches marathon cheats – from his home. The booming of the biggest international marathons, the grow of qualifying events for the most prestigious ones make the likelihood of cheating higher, with bib-swappers who give their chips to a faster runner and bib-mules who carry more than one chip during the entire race.
Is it really cheating?
Let me start saying that I am not a big fan of blaming “marathon cheats” in public forums. There are scenarios when a runner might decide to take part in a race using the number of someone else and most of the them do not hurt the community or other runners. Qualifying for the UTMB or for Boston Marathon at the expenses of other runners is of course not one of them. There are a few hundred trail events that allow runners to collect points for the UTMB, even more road marathons that can give you an official time good enough to go to Boston. You can find most of them on RaceBase World.
I have been running almost 100 races in different countries in the last 15 years, I have a volunteered in a dozen of running events in Berlin alone and I do not need face recognition to figure out that there are indeed a few runners every race who are running with someone else number. And I never did anything to stop them.
Data privacy and marathon running
Before testing Amazon Rekognition as a tool to find cheats in marathons, we might want to discuss if facial recognition technology is really a threat to privacy and how we can have data without contacting the race organizer.
Unfortunately runners have been used for some times. While you might take good care of your data on-line, make sure not to post to social networks or remove geolocation information from your pictures, there is nothing you can really do if you are a runner to avoid your personal data being shared everywhere.
You run 42 kilometers with a number on your shirt and everyone can find all the pictures for a given runner and his personal data (often including date of birth and residence) on public websites. Last year I was able to figure out personal information about the girlfriend of a old schoolmate I have not met in 20 years simply by looking at the pictures and results of a old Berlin Marathon only.
Is face recognition during a marathon actually possible?
Using facial recognition to address cheating crackdown in a recreational event feels like using a machete to cut the salad but does it actually deliver? Can Amazon Rekognition help in validating the results of qualifying races for UTMB or Boston?
As a first test, I took the images of the latest marathon I run (OK, I barely finished walking), the Berlin Marathon 2016. And I of course compared the very first picture in the set associated with my number (the one before the start) with the last one, after crossing the finish line.
Amazon Rekognition simply confirms something that runners have always known, a marathon changes you. After completing a 42K you are not the same person anymore.
Jokes aside, this is just a one picture test on one runner but highlights the challenges of tracking the runner along the race using face recognition alone. It might help combined with other technologies but it is likely to generate a significant number of false positives.
Testing a few more pictures from my previous races still available on-line (the not existent data privacy for runners), I had mixed results. Some pictures match easily others do not. Some are false positives others are not. And I was definitely younger.
A random test with the London Marathon
Let’s instead limit the goal to confirm that a runner taking part in an event matches by gender and age with the category she was registered for. That’s the most common scenario of cheating that is impossible to cover using split times along the course.
Think about your younger and fitter cousin making your PB so you can qualify for the Boston Marathon, your long term running pal who collects a few points for you so you can qualify for the UTMB next year.
Last Sunday took place the London Marathon, the largest spring marathon in Europe and one of the biggest in the world with New York, Berlin and Paris.
All the results of the race are of course available online, and there is a simple GET request that returns the data for a given bib number
In the same way you can access all the pictures of all the runners on MarathonPhoto and retrieve the pictures of a runner using again the start number and the last name you got from the previous request (the RaceOID is the one of the London Marathon 2017).
We do not want (yet) to process 30K or 40K runners, let’s use a very small sample to see how Amazon Rekognition works. Let’s use Random.org to get 10 numbers we can test.
Three of them did not match any runner (not all numbers are assigned for the race) and one runner did not start at the event. What about the other 6 runners? We know the category (age range) and the gender for all of them.
- 31724 (18-39, male) - 10297 (18-39, male) - 12471 (18-39, male) - 19412 (18-39, female) - 17970 (45-49, female) - 21095 (18-39, female)
Retrieving the first image of the set of each one using the MarathonFoto URL, we are able to double check the runners using Amazon Rekognition to match the above data with the results with face recognition.
How did Amazon Rekognition score?
- 31724 (35-52, male 99,9%) - 10297 (29-45, male 99.9%) - 12471 (14-23, male 99.9%) - 19412 (23-38, female 100%) - 17970 (20-38, female 100%) - 21095 (30-47, female 100%)
It did very well. All runners matched the expected gender.
Even if there was a bit of luck as in one of the picture Amazon Rekognition selected a different and incorrect runner in the corner of the photo. And this will be the biggest challenge for an automatic bot: we need to first match the bib number in the image with the race number as there might be multiple runners in the same shot, so combine the technology already used to map the bib number to the photo to face recognition.
Did all runners match the expected age range as well? In short, yes.
The only failure (12471) is due to picking up the wrong face in the picture, but once address that is correct too. Note as well that for runner 12471 the overlapping of the age range is minimum. But even when the overlapping is minimum, the correct age is in the range (you can find that information searching the athlete name on-line: 31724 is 39 and 21095 is 33).
So we have a few limitations here that working with a race organizer can easily address:
Of course a very small set of a few runners is not expect to catch cheaters (and I would not publish their data anyway) but confirms that the approach of face recognition to catch cheats in marathons is feasible even if it most likely needs other tools too to have a reasonable level of accuracy.
But running a full data validation for a big event requires just the collaboration of the race organizer, a few dollars, maybe an instance running Scrapy and a couple of AWS Lambda function. But everyone today can create profiles of thousands of marathon runners around the world and verify their data. Whatever that is good or not.
But I still believe that at the moment the claim of the Beijing Marathon is more a PR stunt than the real way they are going to use to address the issue.
As a software developer, the chance to discuss politics is high at the coffee machine or after a couple of beers in the evening but not while writing code. Somehow the last few weeks proved me wrong, I managed to discuss controversial borders and disputed countries in more than one occasion. And all thanks to the new ubiquitous geolocation and image recognition services.
The Hong Kong user
The founders of RaceBase World, a service to discover and rate running events around the world, are based in Hong Kong. And they were not too impressed when their profile page stated as home country China. The geolocation labeling was provided by Mapbox, one of the largest provider of custom on-line maps. While the labeling might be justified – most of their users consider Hong Kong to be part of China – the choice was controversial for many runners based in the territory. And using the official name, the Hong Kong Special Administrative Region of the People’s Republic of China, is not really a feasible option.
And it’s not the only service affected.
When I then uploaded a trail picture taken in Hong Kong but not far from mainland China on OneMediaHub (a cloud solution provided by Funambol) the result was more bizarre. The picture was labeled with the location “Shenzhen, China”. Even if downtown Shenzhen is not exactly a paradise for trail running and it is quite far away.
In this scenario the problem was in the algorithm used to match EXIF data and the accuracy of the open source geolocation database used, GeoNames.
In the same way, a picture taken in the West Bank, not far from Jerusalem, has on OneMediaHub.com the location “Jerusalem, Israel”. Again, the author was not too happy.
Is that really so bad?
Most of the geolocation services are pretty accurate and the error margin is very low. The vast majority of the users are hardly affected by the issues above, something we call corner cases. And even if one of your summer picture get tagged with the next town on the Costa Brava you are hardly going to complain. Or be offended. You might not even notice the bug.
But the problem is that a significant percentage of those scenarios where the algorithm fails or where there is a controversial decoding is in disputed territory or partially recognized states. And that introduces some challenges for the developer who does not want to deal with politics while writing code.
It’s only geolocation!
Actually even a simple signup form where the user has to choose the country might be controversial. Not everyone in the world sadly agrees on the status of Kosovo. Or Palestine. Or even their names.
Google uses “Palestine” (but label the field location) while Amazon goes for a neutral “Palestinian territories”.
Relying on the official UN status might be a safer option, but it does not make local users (or your web designer) very happy either. Let’s go back the Mapbox example with RaceBase World.
Mapbox works for the Palestine Marathon and make most (if not all) the runners attending the event happy but let’s assume a marathon is taking place in Simferopol, the largest city in the Crimean peninsula. Would most locals be OK with Ukraine as the country? Runners in Germany and runners in Russia have usually a different option about the status of Crimea. And there are many similar examples without even considering war zones.
How to fix those issues?
As a developer, if you have only a local audience it’s relatively easy. And you can minimize the controversies. If not, you can have some workarounds or hacks for challenging names or simply hide them (pretend that automatic decoding did not work or just show the city name). Racebase World for example now shows Hong Kong for new registrations in the autonomous territory.
Better, but with a significantly higher development costs, you could show localized names according to where the audience is. Or rely on localization to mitigate the issue (different names in different languages)
But at the end of the day the big players drive the geolocation databases and they care more about where most of their users are. When “2.8 million people took part in marathons in China in 2016, almost twice the number from the previous year”, as the Telegraph recently reported, it’s hard to argue with Mapbox’s approach on what China is and what China is not. Runners in Hong Kong might not be their first audience or growing market.
How can I test my application?
If you want to test how your website performs in critical area, you do not even need real pictures, just edit the EXIF data of a random picture using Photo Exif Editor or similar applications and enjoy the challenge. And you are read to go.
How does it work with AWS services?
What about Amazon and AWS services? Any way to limit or keep the above issues under control? This will be covered soon in the second part of this post.