How to query and report on data from an API

TLDR: today’s business data is spread across multiple cloud platforms, accessible via API. However reporting directly from an API is not feasible - for technical and performance reasons, you need to stage your data first in a more “report-friendly” location

Today, the majority of software used by small-medium businesses is provided by third-parties. Cloud applications like Xero, LIghtspeed, Trello - and tens of thousands more - provide immense power at the fraction of the cost of bespoke software.

This habit (spreading your business across multiple cloud services) means that your data becomes silo’d. Your inventory may be with Unleashed or Cin7, your sales are with Vend or Revel and your staff costs are in Deputy or Tanda. And as you onboard new cloud services, your data increasingly grows further and further apart.

Because I’m a bit of a space-nerd, I often imagine this a little like the universe expanding - just as the space between galaxies increases as the universe expands, so too do the gaps between cloud applications as people integrate more and more of them into their business.

The democratization of data

Not coincidentally, on the other side of the internet, is this growing awareness of Big Data, with its corresponding democratization of reporting and analysis tools - such as Power BI, Tableau, Databox - and hundreds more.

Thus, as a business grows, it is natural that these two worlds collide. Owners with a modicum of technical savvy can easily familiarize themselves with tools which were once the realm of data specialists. And naturally, with this newfound power, their mind quickly turns to analyzing their growing silo’s of cloud data.

Reporting from your cloud applications

So, how exactly do you connect these two paradigms and create meaningful reports from your cloud application data? Well, let’s examine a few ways in order of effectiveness:

Native reports
CSV exports
Reporting against the API
Staging your API data
A hosted staging service

Native reports

Geek level	0/10
Benefits	free; easy to use; real-time (usually)
Cons	one-size fits all; data is not granular

Any cloud service worth its salt probably already has a bunch of reports pre-built into the application. If you run a POS, you probably have something like Sales by product. Or if you are running an accounting application, it probably offers Balance Sheet and Depreciation reports.

Sales by product

If these are all you need, great - read no further!

However, most businesses outgrow these reports pretty quickly. Perhaps you want to view your sales by hour? Or maybe your accountant needs a depreciation report that excludes office equipment. As soon as your needs begin to differentiate from the masses, you’re going to need to start building your own reports. So, let’s look at the next option - CSV exports.

CSV exports

Geek level	2/10
Benefits	free; data is usually more granular
Cons	requires Excel/G-Sheets/etc skills; high manual overhead; multiple exports required for complex or long-ranging queries; the cloud application decides what you need

If you need a bit more customization from your reports, your cloud application might offer CSV exports of their data. A CSV export is simply a single file that you can open in a spreadsheet (e.g. Microsoft Excel or Google Sheets).

The data in these files is usually a lot more granular than in-built reports - often even providing raw data from the database. And this level of granularity allows you to group/sort/filter in Excel/Sheets in any way you want,so your report options are almost unlimited.

Of course, you are still dependent on the cloud platform to decide exactly how raw/granular these exports are - so let’s not get too excited here. But even assuming you can get the data you need, there is still a very high manual component to this type of reporting. Consider our Sales by product report we discussed earlier - this report crosses over a few “domains”:

Orders - the dollar value that customers paid for your products, as well as order status (e.g. if the order was refunded)
Order Items - each order can have multiple orders. A single order can have multiple products (e.g. a coffee and a muffin), so it is this order-item construct which shows precisely how many of each product was purchased within a particular order
Products - the name of each product, whether it is active etc

There are plenty of ways to skin a cat, but for arguments sake let’s say that your cloud application is very good, offering a high level of granularity in your CSV exports. Therefore, to popular this report you would need to download the following CSV files:

Orders.csv

OrderID	Total	WhenCreated	etc...
101	56.25	2020-10-01T03:45:00Z	...
102	51.30	2020-10-01T03:49:00Z	...
103	122.00	2020-10-01T03:58:00Z	...
...	...	...	...

OrderItems.csv

OrderItemID	OrderID	ProductID	Quantity	etc...
289	101	13	1	...
290	101	4	1	...
291	102	456	12	...
...	...	...	...	...

Products.csv

ProductID	Name	etc...
13	Coffee	...
14	Orange juice	...
130	Muffins	...
...	...	...

Thus, very time you wanted to update your Excel/Sheets report with the latest information, you would need to:

Log in to your cloud application
Navigate to the various order/product pages
Set your download filters (e.g. orders from the last month)
Download each file in CSV format
Import each file into Excel/Sheets

And that’s just for Sales by product. What if you want to differentiate by site? What if you have product modifiers? How about restricting to VIP customers? Reporting against CSV files becomes very tiresome and is prone to human error. But don’t worry - help is on the way....API access!

Reporting against the API

Geek level	6/10
Benefits	real-time data; no human/manual intervention
Cons	API throttling; performance; requires software development skills; dependent on API uptime

Most cloud applications offer some kind of API (Application Programming Interface), that allows your data to be accessed programmatically and automatically by another piece of software.

Furthermore, many reporting applications have “connectors” which let you connect directly to an API and download data on-the-fly. Essentially, this means your reports can run in real-time, directly from your cloud application. This is pretty cool - so let’s see it in action using Power BI’s Custom Connectors as an example…

Yeah, nah, I’m not going to provide a tutorial for this because it is massive and plenty of others have done this for us. Here is an example for grabbing data from Youtube. I encourage you in particular to make a coffee then settle down with the riveting code that is the data definition file.

Seriously - custom connectors are cool, but now you’re getting into serious geek territory - authentication, JSON, web requests, paging. In fact, now that I think about it, I wrote an article on this too - if you’re accessing an API, chances are you need to deal with everything mentioned in this post.

But, let’s assume for the moment that you do have the skills and time to develop these kinds of connectors. Don’t get too excited, because I guarantee that your reports are still going to be, well, terrible:

API throttling

Most APIs limit the number of queries you can do in a particular minute or day. A typical example is 150 requests/minute or 5000/day. Consider this in light of our Sales by product report we discussed earlier, against a mid-size franchise - 5000 sounds like a lot, but if you have 1,000 products and your API pages over summary records, then you’re already limiting yourself to just 5 renders/day. And that’s assuming zero sales!

API throttling alone is usually a deal-breaker for anything but the most trivial of reports.

Performance

Again, let’s consider Sales by product. As discussed in our CSV section, this requires multiple domains - Orders, Order Items & Products. And as with the CSV report, you’re going to need to download information from each of these domains separately. Only three domains you say - well, I can wait a few seconds…

…but not so fast. If your cafe made 1,000 sales yesterday, and the Orders endpoint pages in batches of 25 (a la Revel's POS), then that is forty consecutive API calls. This is not scalable.

I know what you’re thinking - what about Graph APIs Ben?! If you’re constructing a model to service, for example, a web page - Graph is great. But if you’re reporting across millions of records and dozens of years, you’re still going to hit your “complexity” throttling limits and network latency issues downloading and parsing data. Nope - Graph isn’t going to help us here.

API downtime

Nobody’s perfect and even the best and biggest cloud applications go down from time to time. So if your reports have a live API connector, well...

Staging your API data

Geek level	9/10
Benefits	near real-time data; resilient to API downtime; highly performant; no manual/human intervention
Cons	requires even more software development skills; an additional "link in the chain"; ongoing hosting costs

All of the drawbacks with direct API access can be addressed with a single solution - data warehousing. Quite simply this means writing your API connectors as before - but instead of using them directly from your reporting tool, you run them continuously in the background, dumping the data into a dedicated database (“warehouse”).

Now, instead of connecting directly to the API, your reports query directly from the database. Let’s see how this addresses the pain points above...

API throttling

Because your background processes are running continuously, data is downloaded as a constant trickle, instead of a flood. This means it is easy to keep under the API limits, as millions of records are gradually downloaded.

Performance

Firstly, reporting from a database means you forego the overhead of web requests. This alone will reduce your query time by a factor of 1000 or so (yes, one thousand). Furthermore, databases are designed to be queried and have features like indexing and foreign keys to speed up performance and facilitate cross-domain reports (i.e. orders, order items & products).

API downtime

If the API goes down, your background process simply pauses and resumes when the API is back up again. In most cases, you’ll never even notice.

However, staging your data isn’t without its own problems...

Additional software development skills

As well as being able to connect to APIs, you now need to be able to write/read to databases, spawn background tasks, deal with security and deployments...

Hosting costs

To run in the background, your app needs to be “hosted” somewhere - and unless you want to leave your personal computer on 24x7, this usually means paying for hosting services such as AWS or Azure. At time of writing, the cheapest app/database plan I could find was USD$58.75/mth on Azure (pricing calculator here).

An additional link in the chain

Every link in your data-reporting chain is an opportunity for the system to break. Perhaps your database gets full? Perhaps the database host goes down? Whenever this happens, you need to drop tools and get this working again before you can get back to your real job.

A hosted staging service

Geek level	0/10 (get some other geek to do it for you!)
Benefits	no software development skills required; near real-time data;resilient to API downtime; highly performant; no manual/human intervention
Cons	an additional "link in the chain"; ongoing hosting costs

This is exactly the same as the staging solution mentioned above, except that somebody else does it for you, foregoing the requirement of needing software development skills.

You still have to pay for hosting, although it is generally barely more expensive than hosting it yourself. And of course it is still another link in the chain - but at least there is a team dedicated to making that link as strong as possible.

And hey - what do you know? It turns out this is exactly what we offer at SyncHub. Wow - what are the chances....?

How to query and report on data from an API

The democratization of data

Reporting from your cloud applications

Native reports

Sales by product

CSV exports

Orders.csv

OrderItems.csv

Products.csv

Reporting against the API

API throttling

Performance

API downtime

Staging your API data

API throttling

Performance

API downtime

Additional software development skills

Hosting costs

An additional link in the chain

A hosted staging service

Compounding API calls

The Data Accessibility Checklist