Does the site you want to scrape also have a mobile application? It's time to take a closer look at it...
You might get a nice surprise and discover a private API.
Private APIs for mobile applications
To listen to the traffic of the application you have to stand between it and the destination server.
We're going to set up a proxy that once configured on the phone will intercept all the data sent.
All phone communications will pass through the proxy hosted on our computer.
We will be able to listen to everything and even modify requests before they reach the server.
Setting up the proxy server
I personally use Burp Suite, a suite of tools to test the security of a web or mobile application.
The free version contains some limitations such as the impossibility to register a project but we can do without it very well.
Download and install the latest version of Burp Suite: https://portswigger.net/burp/
Start a new project using the default configuration. You should then find yourself in front of the Burp Suite interface with tabs for all features.
The feature we are interested in is in the proxy tab.
The proxy must be configured to listen on the right network interface. The configuration is done in the Options tab.
By default it only listens on the local address (127.0.0.1 also called localhost).
We need to add a network interface so that the proxy can be contacted throughout our local network.
In my case the IP address of my computer on my network is 192.168.1.73 the proxy must listen on this interface.
You need to add a listener on your network interface.
Two things to select: the port and the interface. For the port just enter an available network port, 8080 is the port assigned for proxies by default.
Concerning the interface you have to choose the interface that corresponds to the common network with my mobile.
Once validated, check that it is active in the "running" column. Just check/uncheck the box to activate or deactivate it.
Burp's proxy server is now operational! But if you try to configure it on your mobile you will see that all HTTPS requests fail...
This is due to the fact that by adding an intermediary between the destination server and the mobile application the certificates are no longer valid.
That's what HTTPS is all about: encrypting and certifying communications.
The solution: Import the Certification Authority (CA) of our proxy into the mobile. This will validate all certificates generated by the proxy for all :
Once the certification authority is installed, you just have to configure the wifi network of your phone to use the proxy.
Once the proxy is configured on your phone, all these communications will now go through the proxy server hosted on your computer. You can intercept and modify each request before it is received by the destination server.
Test the proxy while browsing a site in HTTPS
What if we test the proxy's correct operation by testing a default Google search in HTTPS?
In the "Intercept" tab you can view in real time all requests before they reach the server.
For each request received by the proxy you can delete it ("Drop"), it will never reach the server or send it ("Forward").
You can let all requests go through by pressing the "Intercept is on" / "Intercept is off" button.
In the HTTP History tab you have the history of all requests. A request is added to the history even if the "Intercept" option is disabled.
Here you can see my phone's queries when doing a Google search. You can simply find the query that generates the results of the autocompletion of Chrome.
By cleaning the URL as much as possible to remove all the keys we get to this clean URL that still works and gives the results of Google Suggest in a table that can be easily exploited afterwards:
To test the queries you can use PostMan an extension for Chrome or test them directly in Burp Suite using the Repeater tool.
Right click on the request and then "Send to Repeater".
All the information of the query will be sent to the Repeater tool accessible through the tab of the same name. You will then be able to test the query by modifying a few parameters.
Scraping data from a French directory
What if we could scrape mobile directory applications more simply than their website?
The goal is to retrieve all the contact details of hairdressers in the 12th arrondissement using the application's private API.
After a first search for "Coiffeurs 75012" in the mobile application, we find several queries that are mainly used for self-completion.
And then the next query whose result is a big JSON object:
To test the request I transfer it in the tab " Repeater " of Burp (right click on the request then " Send to Repeater ".
Notice that the query header contains a key. If we delete this header or if we modify this key the server returns a 401 Unauthorized error.
But if we keep this authentication header, the API will return the results whatever the request.
Just change the parameter in the URL to find out:
Hairdressers 15th district :
Hairdressers 10th district :
The returned JSON does not contain all the results announced at the beginning of the file.
But a parameter in JSON tells us the URL to call to get the next page.
A simple GET parameter jumpPage=X.
10 lines of Python are enough to retrieve all the pages and extract companies and their information.
Most mobile applications work with an often private API that can be used to retrieve data more easily.
Don't hesitate to test, with a bit of luck you'll find some real nuggets!