Keeping your data up-to-date isn’t as easy as you think

23 Oct

I had a chat with a client the other day about how they might build some integrations of their own, syncing between an API and data warehouse - essentially what SyncHub does. As part of this, I listed a handful of “gotchas” and other considerations and afterwards I sat back and was struck by how much we have put into SyncHub in just the last 18 months. So, thought I’d share…

Remember, there are two main parts to a data integration - back-loading historical data, then keeping it up to date in real time. The list below covers both, but the majority of problems occur in the latter phase.

1. The API goes offline

When one of your systems is down or unavailable, you obviously can’t export your data at that time. Systems can be down for a number of reasons. With cloud-based software, an API may not working or the system could have changed the way the API works – in which case you need to fix this before getting your data out. For more traditional systems, this could mean anything from connectivity issues to the export functionality not working as expected.

Unavailability can also be a result of APIs imposing a throttling limit. SyncHub proactively avoids this limit by spacing out its API queries according to the limit provided. However, some systems share their limits across all apps that are using their API key. This means that if a third party consumes all of the API availability, SyncHub will start hitting 429 errors and the data sync will pause until the recommended time as advised by the cloud platform.

Regardless of why an API is offline, you need to know when your data hasn’t been exported, so you can fix the problem.

SyncHub is coded defensively, meaning that we expect APIs to become unavailable. Our task system notes the unavailability status and begins a series of rolling retries, finally escalating in notifications to our development team if the system still doesn’t work.

2. Data structures change

This happens when a system introduces changes which can cause our data structure to break. This is more of a risk for cloud-based software where updates to the source system are continual.

Often these issues are only identified at the last moment, when you’re compiling reports – or sometimes not at all. If that happens, any reporting will be inaccurate.

Getting a notification when this happens is vital to ensure the accuracy of your data. SyncHub’s logging system advises our developers after a failure pattern is detected and we investigate why this happened. We are usually able to fix the issue and have things up and running within half a day.

3. Dynamic data structures

Data can be exported from the API in different formats. The two most typical examples these days are JSON and XML. Usually an API either lets the developer choose, or commits to just one throughout their entire API.

However, we deal with a couple of cloud platforms that deliver the results in JSON, unless there is an error - in which case the response is sent in XML. Not helpful.

Speaking of not helpful, one of the platforms we integrate with returns lists of People records in JSON. If there are more than one person, then we get a properly formatted JSON array. If there are no records, then we get an empty array. But, if there is one item in the list, the JSON is a Person object - not an array. Thus, our JSON parser fails (well, failed - we sorted it out, obviously, but….sigh).

SyncHub has built dynamic parsing engines which can adapt on-the-fly to this type of behaviour.

4. Timezones and date compatibility

If you are reporting across multiple data sources, you need to understand how the relative offset of the dates are stored.

For example, your SaaS is hosted in Sydney and stores time at GMT+10. You are in New Zealand, so you add two hours to render your reports in New Zealand Time. But what if New Zealand is in Daylight Savings Time and Australia is not? You’ll then need to adjust for two or three hours accordingly.

Now add another data source - this time a SaaS that stores its dates in US Eastern Time - and you need to repeat the above process. If you also have businesses in different time zones and want to report across all of them, you add another layer of complexity.

SyncHub addresses this issue in one of the extremely rare cases where we modify data. We convert every recorded date to UTC by cross-referencing the incoming date with the locality of the platform. This makes all SyncHub dates easily comparable between data sources without expensive mappings required.

4. Detailed vs summary records

Some APIs require a two-hop process for bringing down data because their recently-modified lists only contain summary records of the entities. This means SyncHub calls the API multiple times – first to get a list of modified records. Then, for each record we need to call back to the API again to get detailed records via a different endpoint.

This requirement increases the API load by a factor of the paging size, which is typically 100 records.

5. Nested data structures

Power BI developers find it easy and intuitive to report off a relational database, and this is what SyncHub provides. However, in most cases, the data we receive from cloud platforms is not relational. Specifically, many endpoints return nested data.

Nested data structures are problematic because the child entity’s relationship to the parent is often inferred simply because its nested. This means that there is usually no explicit relationship defined and you don’t know which data is connected. To mitigate this, SyncHub calculates and stores calculated columns to ensure our data structure has proper referential integrity.

Nested data structures also place a higher burden on the syncing engine because it is forced to sync sometimes dozens of sub-entities all at once. The risk here is that if one fails, they all fail.

SyncHub addresses the problem of nested data structures manually - our developers explore the entity structures and use meta-data to annotate these relationships programmatically.

6. Missing data periods

There are cases where a period of data (e.g. February 2020) is missing from a table’s records (see the example below regarding the third-party plugin). In these cases, the sync needs to be able to retrospectively “back fill” this period, while also retaining its real-time capabilities.

SyncHub resolves this with its Segment functionality. This allows different periods of the same data endpoint to be synced concurrently.

7. Modification filters

When you’re getting the latest data from your system, you need to be able to access only the relevant data, not all your data. For example, you may want to export your sales figures daily, and to do that you need data filters to narrow the search window.

Some APIs provide excellent modification filters which allow you to get data that was modified between X and Y dates. Some only provide a from date - making historical data-loading quite challenging. Others provide no modification filters, while others filter by creation date, not modification date.

The timeframes provided also change. Some systems provide granularity down to the milliseconds, others down to the day. Some cache their data, so modifications aren’t available via the API for a minute or two and if you check this period in the interim, you’ll miss the data and move on to the next sync window.

SyncHub addresses these issues with a wide variety of tools. Some are in-code, such as our fuzzy-matching algorithms and some allow the user to tweak the sync themselves, such as our Trails module.

8…9…10…Then, just when you think you’ve solved everything…

These issues can take many different forms, are often peculiar and are always unexpected! Recent real-world examples include:

one of our POS integrations (not naming any names) slow massively if you query data further than two years back. This is because their system moves older data into cheaper storage. Information from this period then takes longer to deliver via the API, resulting in this failure.
another system’s update resulted in sales orders for April 2020 disappearing for some clients if they were using a particular third-party plug-in. It took a week for them to fix the bug, after which we just used a Segment to back-fill the data

By knowing when these failures are happening, SyncHub can proactively fix these issues as soon as they happen, keeping your reporting as accurate as possible. We can also retrospectively resync data for specific timeframes once any downtime issues have been resolved. This means for you, keeping your data up-to-date across all your systems is an automated process, which just works.

Ben Liebert