Mobile Apps Data Extraction on Scale
Introduction

Scraping Data from different mobile data resources isn’t a new concept, however, it looks that different ways of doing it haven’t scaled easily. At X-Byte Enterprise Crawling, We have worked hard on Mobile Apps Data Extraction on Scale and that’s why we have made this blog to provide useful information about this subject.

Reverse Engineering
Reverse engineering

Now, the question is how can we do that? Assume you need to scrape data from the mobile application, let’s assume that we have got APK of an Android app and we wish to scrape 500,000 data points (UI screens) every day, how can you do that and what will be the cost?

It’s important to find how a client communicates with the servers, which protocol he is using, how they are transferring messages to each other.

As this might look like the finest scalable and affordable solution, it may only provide solutions to one application, so to do if we want to repeat a process again and again with other applications? What if an API gets changed? So, you can see that it’s hard to guess the efforts it should make.

After that, we have used Android Emulator, made installation of the APK, connected that to the proxy and observed the data.

The entire communication was done using HTTPS, after a few hours, we could able to monitor traffic from the clients to the server as well as even capable to simulate calls to a server.

Outcome

Reverse engineering is very easy to start and looks the most affordable and most scalable way of doing it. However, it may take some long days as well as the development costs are random and you don’t always get the end results.

Selendroid or Appium
Selendroid or Appium

With tools like Selendroid or Appium, the scenario is completely different. You can easily write the scenario that you need to test and automatically run that test script again and again. We have decided to use Appium with Android Emulator.

Android emulators are known as impossible tools to deal with for mobile development, though with the release of x86 emulators, things have started to work easily and it feels like working the applications inside laptops run quicker than the physical devices themselves.

Later, we created a Docker container using Ubuntu 16.04, Appium and Android x86 emulator to start the test about how many of those we could run at the same time.

So, assuming that we can utilize 1 CPU for 1 emulator, we will require 700 CPU’s for 700 emulators! It is a huge requirement and very expensive too!

Outcome

Physical hardware always brings good performance however, it’s very hard to deal with on a large scale.

So what to do to avoid the physical hardware management?

 Linux Docker AWS and Android

Well. We can use the public cloud like AWS. However, when we took this approach to the cloud, things worked completely different. Actually, Linux, Docker, AWS, and Android have worked really well together, however, with an emulator, they’re not. AWS EC2 gives you a Virtual Machine and Android Emulator is a Virtual Machine on top of that. To take benefits from the hardware acceleration while using x86 Android emulator, the host machine needs to reveal this competence, however Amazon, as well as any public clouds, don’t expose this, rather they utilize it for themselves for serving us with the virtual machines, therefore, we were unable to even start an Android x86 emulator!

So How We Have Done That? Well, We have used Ravello.
Ravello Cloud solution

Ravello Cloud

The Ravello solution has provided nested virtualization or Kernel-based Virtual Machine support on host machine when running on a public cloud.

It has given us the capability to run the x86 Android emulators on the cloud. We have tried it and it worked also, however, in terms of performance, the scrapping has taken 3 times more time compared to physical machines and with the use of more emulators, things have got worse.

Outcome

Ravello Cloud solution works however, its performance is not up to the mark.

Genymotion Cloud

Genymotion Cloud

Another solution is the Genymotion Cloud that offers Android Machine Image or AMI for Amazon EC2.

So rather than getting Windows or Ubuntu VM, you will get Android VM! It looked like the finest solution which runs on a public cloud. With AMI, we were able to run on the t2.small instance (with 1 core + 2 GB memory) as well as the scrapping script that runs on the physical hardware.

The problem with this solution is the cost as every instance alongside the image costs 0.148$ per hour and on a large scale with 700 Android emulators, it becomes really expensive.

Outcome

Genymotion works really well on the cloud as well as provided nearly the same performance as running on the physical machine, although, they are really expensive while using them on a large scale.

Bluestacks and Nox

Bluestacks and Nox

These products were specially developed for the gamers, however, it doesn’t indicate we can’t utilize them. So, we spin up t2.medium Windows VM on the AWS EC2 to try it.

With Nox, the installation failed as the graphic card’s driver was outdated. Even after solving this, more obstacles were coming, so ultimately, we gave up and thought to try Bluestacks.

bluestacks installation procedure went well, and its performance was also excellent.

However, the problem was that we didn’t come up with a solution for running multiple Bluestacks applications separately on the cloud within our Virtual Machine, and our tested APK also didn’t work well on that, maybe because the Bluestacks works in some type of tablet mode.

Outcome

Bluestacks worked well while running on the virtual machines; it’s accessible and even evident through ADB, which means that we can run Appium tests on that. However, the downside is that it can run only on Mac or Windows only, and you can only run one instance separately and it works merely on the Tablet mode.

While selecting one of the emulator solutions, some optimizations can help speed up time to Scrape The Data, Naming some, If possible, Utilize landing URLs and deep links; while running on solid machines, the application speed could be better than the actual on the real device.

Summary

To summarize, when you need to Scrape Data from Mobile Apps in scale, in case reverse engineering works well and fits your requirements, then take it because, according to X-Byte Enterprise Crawling, it is the most affordable and scalable solution.

Other available solutions through an Android emulator aren’t much accessible and the results are really expensive. If you know other solutions for scraping mobile applications on scale, then you are most welcome to share them with us!.