Reversing Google Play and Micro-Protobuf applications

19 septembre 2012 – 22:11

I recently released a Google Play Unofficial Python API, which aims at providing a way for developers to query Google’s official Android application store. Such projects already exist, but they are all based on the previous version (« Android Market »), and are therefore limited. My goal was to adapt those projects and port them to the last version of Google Play.

This article first highights the limitations of existing projects. Then it focuses on the official Android client for Google Play and its internals, based on a Protobuf variant. Thanks to Androguard and its awesome static analysis features, I show how to automatically recover the .proto file of Google Play, enabling us to generate stubs for querying Google’s servers. Finally, I quickly introduce the unofficial API.

Existing projects

Google Play can be queried in two ways: using the official website or the Android client. The website contains pretty much all the useful information, such as app name and developer name, comments, last version number and release date, permissions required by the app, statistics, etc. I guess one could build a simple program that queries this website and parses the pages, but it would still have one limitation: you simply cannot download apps. Well, you can, but for this you will need an actual compatible phone, and as soon as you perform the install request, the application will get downloaded and installed on your phone. Then if you want to retrieve it in order to analyse it, you must plug in your phone and use adb pull. Some managed to get Google Play run within the emulator, but this is still a bit complicated and not straightforward: you need Java, Android SDK, customize your emulator ROM to embed Google Play, and script everyting yourself.

The main project I have been looking at is android-market-api, written in Java. Actually, I am a Python fan, and played much more with its Python equivalent. The goal of those projects is to simulate the network activity generated by the Android client, query Google Play servers, and parse the result. The underlying protocol used by Google Play is based on Google’s Protocol Buffers, aka Protobuf. For those who do not know, this library provides a way to encode messages in binary, compact blobs before sending them on the network, and decode them on the other side. The documentation contains plainty of details on the actual encoding format, so I won’t cover it. The only important thing to know about Protobuf is that it is much easier to decode messages if you know the structure of exchanged messages. Messages are composed of fields, each one having a tag, a name and a type. When encoded, a message embeds the tag, value and type (only basic types, or a generic « message » type) of each field, but not their names. Therefore, the semantics of each field must be guessed, and that is not always easy.

When Google Play Android client is able to query Google’s servers and download APKs, all network communications are done with Protobuf and HTTP(S). The underlying Protobuf file used by the unofficial API projects (and based on Android Market) has been published as a .proto file. The unofficial API can forge some of those requests and interpret results. While playing with them, I have managed to search Android apps, but I could not always download them. Indeed, this version of the API requires a numeric « assetId » corresponding to the app you want to download. When trying to get appropriate assetIds using other API methods such as search(), I got non-numeric values, such as: v2:com.fankewong.angrybirdsbackup2sd:1:4. This type of value is rejected by Google Server when trying to download the app. Too bad…

A first look at Google Play Android client

The weird thing is that the non-numeric assetId problem occurs quite often, but not on all apps. I guess this is because Google updated their API when they switched to Google Play; those projects are using the old version of the API. The only way to have up-to-date information and be able to download any app would then be to analyse the updated Android client, and adapt existing projects.

Here we go! We retrieve com.android.vending-1.apk from an up-to-date Android phone using adb, and we use our favorite Android RE tools. A first look at class names highlights a pretty explicit VendingProtos class, under the com.google.android.vending.remoting.protos package. It contains references to a package named com.google.protobuf.micro, embedded within the app. This package contains classes used to encode and decode messages. It is actually part of a public project, named micro-protobuf, which is a lightweight version of Protobuf. However, the underlying protocol remains the same.

Most of network traffic is sent using HTTPS. After installing our own on CA onto the phone and setting up an interception proxy like Burp, we can sniff traffic. From a black-box approach, the exchanged data looks like a binary stream:

Capturing a Protobuf response with Burp

All we need now is the .proto file of Google Play to be able to decode it. But how can we get this file? It is unfortunately not embedded within the app, so we have to find another way. A paper and a tool have been published on the subject, but work only when the studied app or program embeds some kind of metadata, used by reflection features of Protobuf. This metadata is generally embedded in regular stubs generated with Google’s standard protobuf compiler called protoc. However, this is not the case here since the Protobuf stubs embedded within Google Play Android client were not compiled with standard protoc. Micro-protobuf seems to remove this metadata, probably to make protocol reversing harder.

Anyway, is there a way to guess the structure of exchanged messages, just by having a look at the decompiled Java code of the app? Let’s go back to the VendingProtos class. It is contains many subclasses, among which one named AppDataProto:

public static final class AppDataProto extends MessageMicro
{
  private int cachedSize = -1;
  private boolean hasKey;
  private boolean hasValue;
  private String key_ = "";
  private String value_ = "";

  [...]

  public AppDataProto mergeFrom(CodedInputStreamMicro 
                                paramCodedInputStreamMicro)
    throws IOException
  {
    while (true)
    {
      int i = paramCodedInputStreamMicro.readTag();
      switch (i)
      {
      default:
        if (parseUnknownField(paramCodedInputStreamMicro, i))
          continue;
      case 0:
        return this;
      case 10:
        String str1 = paramCodedInputStreamMicro.readString();
        AppDataProto localAppDataProto1 = setKey(str1);
        break;
      case 18:
      }
      String str2 = paramCodedInputStreamMicro.readString();
      AppDataProto localAppDataProto2 = setValue(str2);
    }
  }

  public AppDataProto setKey(String paramString)
  {
    this.hasKey = 1;
    this.key_ = paramString;
    return this;
  }

  public AppDataProto setValue(String paramString)
  {
    this.hasValue = 1;
    this.value_ = paramString;
    return this;
  }

  [...]
}

We can guess that this class represents a Micro-Protobuf message (the extends MessageMicro part) and that it has two string fields: key and value. Their tag can be extracted from the mergeFrom() method, which aims at decode incoming binary messages. It is composed of a main loop (while(true)) and a switch statement. Each case – except the first and second ones – corresponds to a field. The value of each case is actually the binary representation of the tag and type of the field. Everything is in the documentation; to skip the details, the actual value of each case is equal to (tag << 3) | type. For instance, 10 stands for tag 1, type 2 (string). 18 means tag 2, string. Thus, the actual .proto file looks as follows:

message AppDataProto {
  optional string key = 1;
  optional string value = 2;
}

Actually type 2 is not exactly « string », but any length-delimited field. It could be a string, a series of bytes, or an embedded message itself. In that case, the code looks like this:

case 26:
  VendingProtos.AppDataProto localAppDataProto = new VendingProtos.AppDataProto();
  paramCodedInputStreamMicro.readMessage(localAppDataProto);
  DataMessageProto localDataMessageProto2 = addAppData(localAppDataProto);
  break;

This field has a tag equal to 3 (26 >> 3) and is a message which name is AppDataProto. In order to get this sub-message structure, we would have to repeat the analysis process to the corresponding class, and so on.

Automatic analysis

We now have a way of recovering a message structure by analyzing the generated code. All we need now is automating the process. For this, we can use Androguard, a multi-purpose framework intended to make Android reversing easier. With Androguard, we can simply open an APK, decompile it, parse its Dalvik code, and do all sorts of things. Once installed, one can use the provided androlyze tool to dynamically interact with the framework, and then write a script to automate everything.

Androguard lets us easily browse the available classes and find those that extends MessageMicro.

In [1]: apk = APK('com.android.vending-1.apk')
In [2]: dvm = DalvikVMFormat(apk.get_dex())
In [3]: vma = uVMAnalysis(dvm)
In [4]: proto_classes = filter(lambda c: "MessageMicro;" in c.get_superclassname(), dvm.get_classes())
In [5]: proto_class_names = map(lambda c: c.get_name(), proto_classes)

Then we extract the mergeFrom() method of each class by filtering the method list generated by dvm.get_methods_class(class_name). The basic block list of each method can be obtained with vma.get_method(m).basic_blocks.gets().
The first is usually the one that implements the switch instruction. In Dalvik, a switch is often represented as a sparse-switch instruction, which operand is a table composed of a list of values and offsets, called sparse-switch-payload. Here is an example:

invoke-virtual v3, Lcom/google/protobuf/micro/CodedInputStreamMicro;->readTag()I
move-result v0
sparse-switch v0, +52 (0xa4)
[...]
sparse-switch-payload sparse-switch-payload 0:9 a:a 12:12 1a:1a 22:22 2a:2a 32:32 3a:3a 42:42 4a:4a

Each (value, offset) tuple correspond to a case of the switch; if the value matches the compared register, then the execution continues to the corresponding offset. Once we are able to browse each case of the switch (and its target basic block), we can determine the name of each field and its type by examining the name of the corresponding accessors. For instance, here is a typical basic block:

invoke-virtual v3, Lcom/google/protobuf/micro/CodedInputStreamMicro;->readString()Ljava/lang/String;
move-result-object v1
invoke-virtual v2, v1, L[...]AddressProto;->setCity(Ljava/lang/String;)L[...]AddressProto;
goto -25

Each basic block contains two accessor calls: readXXX() and setYYY(). Their goal is to read an incoming series of bytes and initialize one field of the message. XXX corresponds to the type of the field (here, string), and YYY to its name (city).

The simplified analysis algorithm looks like:

for each class that extends MessageMicro:
  get its mergeFrom() method
    find the sparse-switch instruction
    get the corresponding sparse-switch-payload
    index all values and offsets in a dict
    for each value, offset:
      tag = value >> 3
      get the target basic block using the offset
      find readXXX() and setYYY() calls
      type = XXX
      name = YYY
      index the tuple (tag, type, name)

Then we only need to format the output in order to generate a parsable .proto file, dealing with nested messages and groups among other things.

I called the resulting script androproto.py. It is released with the API code; feel free to play with it. It is able to analyze the target app and print the recovered Profotuf file. I apologize for the dirty code; since Google Play is the only app using Micro-Protobuf that I’ve analyzed, this script is pretty specific. But it should work with any app using this library, with a few changes. Its output on Google Play app looks like this:

message AckNotificationResponse {
}
message AndroidAppDeliveryData {
  optional int64 downloadSize = 1;
  optional string signature = 2;
  optional string downloadUrl = 3;
  repeated AppFileMetadata additionalFile = 4;
  repeated HttpCookie downloadAuthCookie = 5;
  optional bool forwardLocked = 6;
  optional int64 refundTimeout = 7;
  optional bool serverInitiated = 8;
  optional int64 postInstallRefundWindowMillis = 9;
  optional bool immediateStartNeeded = 10;
  optional AndroidAppPatchData patchData = 11;
  optional EncryptionParams encryptionParams = 12;
}
message AndroidAppPatchData {
  optional int32 baseVersionCode = 1;
  optional string baseSignature = 2;
  optional string downloadUrl = 3;
  optional int32 patchFormat = 4;
  optional int64 maxPatchSize = 5;
}
[...]

The resulting output is almost usable with protoc. Almost, because there is a duplicate message that you need to manually remove in order to make protoc happy. But after taking care of that detail, you have a working googleplay.proto that you can use to generate C++, Java and Python stubs for querying Google Play API!

Building Google Play Unofficial Python API

In order to parse Google Play protobuf messages, we dump each server response intercepted with Burp into a file, an use:

protoc --decode=ResponseWrapper googleplay.proto < dump.bin

ResponseWrapper is the root message type; it can be easily guessed by looking at the message names. Once we have a clue of what’s received by the application, we can start building our own API. Since we need a valid auth token from Google server, we need first to authenticate. I simply reused the code from android-market-api-py. Once logged in, we need to deal with protobuf traffic. For most of API requests, the Android client does not send protobuf messages, but only simple GET or POST requests, such as search?c=3&q=%s. In order to parse Protobuf responses, we use the generated Python module (googleplay_pb2):

message = googleplay_pb2.ResponseWrapper.FromString(data)

The resulting message can be browsed like a regular Python object. For some API methods, Google servers also return some prefetch data. A prefetch element contains a URL and raw data. It acts like a cache and can be dealt with pretty easily with a few lines of code.

The final API is pretty straightforward to use. Just follow the README. First make sure to edit googleplay.py and insert your phone’s androidID, then supply your Google credentials in config.py. You can use the provided scripts, producing CSV output, and prettify them with pp. Sorry for the following truncated output due to this blog…

$ alias pp="column -s ';' -t"  # pretty-print CSV

$ python search.py earth | pp
Title                           Package name                            Creator                  Super Dev  Price    Offer Type  Version Code  Size     Rating  Num Downloads
Google Earth                    com.google.earth                        Google Inc.              1          Gratuit  1           53            8.6MB    4.46    10 000 000+
Terre HD Free Edition           ru.gonorovsky.kv.livewall.earthhd       Stanislav Gonorovsky     0          Gratuit  1           33            4.7MB    4.47    1 000 000+
Earth Live Wallpaper            com.seb.SLWP                            unixseb                  0          Gratuit  1           60            687.4KB  4.06    5 000 000+
Super Earth Wallpaper Free      com.mx.spacelwpfree                     Mariux                   0          Gratuit  1           2             1.8MB    4.41    100 000+
Earth And Legend                com.dvidearts.earthandlegend            DVide Arts Incorporated  0          5,99 €   1           6             6.8MB    4.82    50 000+
Earth 3D                        com.jmsys.earth3d                       Dokon Jang               0          Gratuit  1           12            3.4MB    4.05    500 000+
[...]

$ python categories.py | pp
ID                   Name
GAME                 Jeux
NEWS_AND_MAGAZINES   Actualités et magazines
COMICS               BD
LIBRARIES_AND_DEMO   Bibliothèques et démos
COMMUNICATION        Communication
ENTERTAINMENT        Divertissement
EDUCATION            Enseignement
FINANCE              Finance

$ python list.py 
Usage: list.py category [subcategory] [nb_results] [offset]
List subcategories and apps within them.
category: To obtain a list of supported catagories, use categories.py
subcategory: You can get a list of all subcategories available, by supplying a valid category

$ python list.py WEATHER | pp
Subcategory ID            Name
apps_topselling_paid      Top payant
apps_topselling_free      Top gratuit
apps_topgrossing          Les plus rentables
apps_topselling_new_paid  Top des nouveautés payantes
apps_topselling_new_free  Top des nouveautés gratuites

$ python list.py WEATHER apps_topselling_free | pp
Title                  Package name                                  Creator          Super Dev  Price    Offer Type  Version Code  Size    Rating  Num Downloads
La chaine météo        com.lachainemeteo.androidapp                  METEO CONSULT    0          Gratuit  1           8             4.6MB   4.38    1 000 000+
Météo-France           fr.meteo                                      Météo-France     0          Gratuit  1           11            2.4MB   3.63    1 000 000+
GO Weather EX          com.gau.go.launcherex.gowidget.weatherwidget  GO Launcher EX   0          Gratuit  1           25            6.5MB   4.40    10 000 000+
Thermomètre (Gratuit)  com.xiaad.android.thermometertrial            Mobiquité        0          Gratuit  1           60            3.6MB   3.78    1 000 000+

$ python permissions.py com.google.android.gm
android.permission.ACCESS_NETWORK_STATE
android.permission.GET_ACCOUNTS
android.permission.MANAGE_ACCOUNTS
android.permission.INTERNET
android.permission.READ_CONTACTS
android.permission.WRITE_CONTACTS
android.permission.READ_SYNC_SETTINGS
android.permission.READ_SYNC_STATS
android.permission.RECEIVE_BOOT_COMPLETED
[...]

$ python download.py com.google.android.gm
Downloading 2.7MB... Done

$ file com.google.android.gm.apk 
com.google.android.gm.apk: Zip archive data, at least v2.0 to extract

Conclusion

Although there is no metadata within Micro-Protobuf applications, recovering .proto files is still doable and it can still be done automatically. The lack of obfuscation is clearly an advantage for an attacker, since all class and method names are easy to understand. Having a non-official Google Play API is handy for many reasons: performing statistics that aren’t available on the official front-end, looking for plagiarism, automatic malware search / downloading / analysis (Androguard to the rescue)… Feel free to browse the source, fork the project, and improve it!

  1. 10 réponses à “Reversing Google Play and Micro-Protobuf applications”

  2. Good job! I was hoping someone would provide us with a more recent .proto for Google Play. I’m trying to port this to PHP now – my monologue is available at https://github.com/splitfeed/android-market-api-php/issues/12

    Par Marko le 5 décembre 2012

  3. Excellent post…I tried scraping via Gplay web based, but recently, when I scraps Gplay, after 1 or 2 minutes, the http response message get an error (code != 200), redirect to the captcha, also if I use some proxy and a process that changes random these proxies. Do u know some way to by-pass this restriction? Perhaps to create a Google account inside the app process..??? I have already used a random select of user-agents…, (my code has an high level of parallel process)
    thks in advance, Paolo

    Par paolo le 19 avril 2013

  4. Hi, thanks for the feedback. I guess Google implements some kind of throttling in order to prevent (or slow down) the crawling process. I didn’t try, so I don’t know if there is a way to bypass it. But i would be careful; if you’re always using the same account, Google could track all your requests and decide to block it. You should maybe throttle your requests by sleeping between each method call.

    Par Emilien Girault le 22 avril 2013

  5. Just thought I’d stop by here and mention I’ve written a protobuf decoder for Burp. My extension supports loading a .proto or compiled python proto module (_pb2.py) for automatic deserialization and ability to tamper messages right from Burp.

    - http://www.tssci-security.com/archives/2013/05/30/decoding-and-tampering-protobuf-serialized-messages-in-burp/
    - https://github.com/mwielgoszewski/burp-protobuf-decoder

    Par Marcin le 3 juin 2013

  6. Salut, je viens de découvrir ton code et c’est un vrai bonheur à l’utilisation. Grand merci pour ce travail !
    J’utilisais RealApkLauncher avant.

    Une fonctionnalité manquante est la gestion automatique des mises à jours des APKs téléchargées.
    Pour cela, soit il faudrait stocker dans une liste les APKs que l’on veut suivre entre 2 utilisations du logiciel (ca je peux faire), soit il faudrait récupérer les infos (nom et version) dans les APKs déja téléchargées sur le disque (ca je ne sais pas faire). Qu’en penses tu?

    Je pense développer une interface en wxpython pour ton API pour télécharger les APKs à la manière de RealApkLauncher. Je vais faire ça demain si j’ai le temps.

    Par Tuxicoman le 11 août 2013

  7. Salut,
    Je ne connaissais pas RealAPKLeecher. Effectivement la feature de suivi des versions est tout à fait possible, la version disponible étant retournée par l’API (appel à details() puis récupération du champ doc.details.appDetails.versionCode dans la réponse).
    Si tu choisis l’autre option, c’est également faisable en inspectant le manifest de l’application. Tu peux le faire à la main, mais il faut extraire l’APK, convertir le manifest (XML binaire -> XML simple) puis le parser. Ou bien utiliser Androguard, qui le fait très bien. Il y a même une méthode pour récupérer la version. L’inconvénient est que l’outil repose sur pas mal de dépendances.
    Je ne me suis pas replongé dans ce projet depuis des mois, il faudra que je le mette à jour un de ces quatre pour supporter les derniers formats de message Protobuf inclus dans les dernières versions de Google Play…

    Par Emilien Girault le 11 août 2013

  8. J’ai avancé. Voici l’interface en cours : http://jesuislibre.net/download/wip.png

    Pour l’update, j’utilise Androguard pour récupérer le numéro de version, il n’y a pas d’autre dépendances que Python apparemment pour cette fonction.

    J’aimerai pouvoir afficher le versionString à la place du versionCode car ca me semble plus parlant pour les utilisateurs mais il ne semble pas présent dans les résultats renvoyés par ton API. C’est normal ou j’ai raté un truc?

    Par Tuxicoman le 12 août 2013

  9. Sympa la GUI :) . Concernant l’API, en fait je ne fais que récupérer ce que les serveurs de Google renvoient, et effectivement le versionString ne semble pas toujours présent. Je n’ai pas compris pourquoi.
    N’hésite pas à forker le projet sur Github. Plusieurs personnes m’ont déjà fait des pull requests, mais j’avoue ne pas avoir beaucoup de temps pour les intégrer. Je préfère laisser le choix aux intéressés d’intégrer ces modifs selon leurs besoins.

    Par Emilien Girault le 12 août 2013

  10. J’ai publié le logiciel si tu veux voir à quoi ça ressemble : http://tuxicoman.jesuislibre.net/2013/08/googleplaydownloader-telecharger-les-apk-sans-rien-demander-a-google.html

    Par Tuxicoman le 19 août 2013

  11. Salut !
    Moi je cherche à ajouter une fonctionnalité à mon script : quand on veut récupérer un jeu, parfois celui-ci contient un fichier .obb qui intègre toutes les données du jeu (souvent près d’1Go). Or je n’arrive pas à trouver un moyen de récupérer ce fichier (https://github.com/matlink/gplaycli/issues/6). As-tu quelques infos à ce propos ?
    Merci !

    Par Matlink le 22 août 2015

Désolé, les commentaires sont fermés pour le moment.