In this industry it can feel challenging to tell others you've made mistakes, but we believe that it's important to show not just the answer, but to outline the process of finding that answer so others can learn from mistakes made on the journey. Mistakes are sometimes even more valuable than the solution itself.
The technical team here at Cygenta loves to learn new technology stacks and find new and fancy ways of processing the data we rely on to provide some of the services to our clients. You can read about some of our past experiences in our guides to using AWS Athena and AWS Redshift. In this post I will take you through an experience we recently had with technical debt.
Several months ago our team had to bring together a proposal for a large piece of work for a multinational bank and to do so we needed very specific data and lots of it. In the back of our minds, we knew that once we had that data it could maybe be leveraged for some of the other services we offer, such as our OSINT and pentesting services. We started to look at how best to serve this data to the client and also how to manage it with the team internally. We asked ourselves a few obvious questions:
- How quickly can we injest the data with minimum effort on manipulation?
- How expensive is it going to be?
- Is it a tried and trusted method?
- Have we got existing experience or will training be needed?
- Can we get the data out easily?
- How much technical debt will this bring?
The obvious answer was a database, but which one? We have experience in most of the systems out there, but we had a specific use case and the very specific data, timelines and budgets also came into play. The winner in our minds was AWS DynamoDB, a NoSQL based database engine with easy-to-use features, low cost and we knew we could be up and running with data in hours.
We would love to say we answered each of those questions above in order to make that decision, however we wouldn't be writing this if we had. We missed the last two! Well, technically, we did ask the penultimate one but we only thought about it in the short term.
Let's look at those last two questions. Firstly "can we get data out easily?". Well, the answer was yes, meaning for the immediate job in hand we created a DynamoDB system that could pull out the specific data we needed in a time and cost effective manner. But then several months down the line when we wanted to start integrating that data into other services, we needed to start pulling out data in different and unexpected ways. That's when the problems started to appear.
DynamoDB being a NoSQL (a Key:Value pair based engine) means that it doesn't allow SQL like queries, so unless you know specifically what you need out of it and can provide the Key, you are left having to create programatic solutions that iterate through data. These quickly become convoluted at best and impossible to change or maintain at worst. Both methods create technical debt in the process, too. What we were then left with is a method called 'Scans' which returns all of the data in the table and then we end up applying filters to it afterwards. What is the point of having it in a database if you have to return all the data for every single query? When dealing with large datasets, this increases both the cost and the speed delay of each query too, the worst of both worlds!
We have already started to touch on the last question, "how much technical debt will this bring?". Technical debt is the debt of work you will need to pay back in future to make any changes to a technical solution you put in place (this is worse if it relates to business solutions, too). Imagine a spreadsheet that starts out simple but over the years gets very complex, the original people who made it are long gone and everyone is now scared to change it in case it breaks. The total hours and monetary costs of the effort needed to change this is the debt being paid back.
AWS promise a simple and easy-to-use database engine, but the technical debt we accrued in just a few short months was the very opposite of that promise. We were already paying back hours of effort in technical debt we didn't even know we had. The major reason for this is that whilst it's almost effortless to ingest data into it (it smashed our expectations when we were answering questions one and two), getting data out proved to be exhausting. Even when we decided to pay back the technial debt and move entirely from it we found leaving even harder. Apparently one of the hardest things you can try to do with DynamoDB is export your entire table.
Take, for instance, the smallest table we wanted to move from DynamoDB. At just under 100MB of data spread over 140,000 rows it could fit happily in an Excel spreadsheet. In order to extract all of the data from that one table AWS provide two out-of-the-box solutions. The first is to use the DynanoDB console where you are allowed to select only 100 records at a time and export them to a CSV. We could then recombine the resulting 1400 csv files easily enough afterwards, but only if we wanted to spend hours upfront selecting and exporting (no, seriously, they only allow 100 at a time!).
AWS recommend a totally different approach to getting around the 100 record limit, which is to use the AWS DataPipeline system to export the table directly into an S3 Bucket.
But this solution won't work for us and we want to explain why it might not work for you either.
Firstly, AWS DataPipeline spins up a m3.xlarge ec2 box with 16GB of RAM to do the heavy lifting in the background.
Depending on the size and number of tables to extract, this can cost you a small fortune.
The second and more annoying issue for us was that AWS DataPipeline is not available in all zones.
Even when we poured more hours into trying to hack together a solution by using different regions we couldn't get it to work. The technical debt was building quicker and quicker! We soon abandoned both of the solutions that AWS provided to our problem.
AWS DynamoDB is fantastic, let me be clear on this, the system and NoSQL in general are not at fault here. Yes, AWS could have made a better solution for getting data out, but that is not the fault of the database engine itself, that's a service (DataPipeline) availability issue and a UX (100 records) issue. Had we known this before we deployed the database we might not have gone that route at all. The failure here was on us, or specifically me personally. I made the final call on what system to use based only on the first few questions we asked.
So how did we finally solve the extraction issue for no extra cost and very little time? Using an existing AWS EC2 machine we installed this small piece of NodeJS code and because of the way AWS handles IAM role based authentication we didn't need to supply AWS credentials. Minutes later all databases we needed were extracted and in CSV format ready to move to an RDS of our choice. Multiple days of technical debt eliminated in under 30 minutes.
Moving forward we have decided to ask more in-depth and long-term questions about supplying data internally across our services, I hope this blog post allows you to consider the questions above in a similar manner. We will continue to use DynamoDB for other projects, but only when it is the best fit for the data and its use.