Last week a subset of our customers experienced an unfortunate incident which removed some email contents that had been saved to Pipedrive, using the Smart Email BCC feature.
The people affected rightfully raised concerns about our ability to protect customers from any kind of data loss.
I would like to start rebuilding the level of customer trust we have enjoyed until now.
Yes, actions speak louder than words, but first allow me to detail our architecture and practices around customer data storage and protection, shed some light on the incident, and then share additional safeguards we are planning to adopt.
A secure, proven database technology
Most of our customers’ data is stored in MySQL databases.
From the very beginning, there was a decision to keep each of our customers’ data in a separate database.
MySQL technology provides a fairly easy and efficient way to keep multiple databases on the same server, which allows us to keep each customer’s data separate from the data of other customers, but still use server resources efficiently.
Automated failover mechanism
Servers can stop working unexpectedly.
Our MySQL databases have real-time replica and an automated failover mechanism in case something happens to the master database.
To keep the disruption and data loss to an absolute minimum, the system is set up in a way that if an issue with the master database is automatically detected, the machine is taken offline, and it is replaced with another machine that was doing real-time backups. This all happens automatically with no manual input needed from our engineers.
Some of you may have seen “maintenance” notices at various times, and in most cases such messages appear during a failover process to prevent users from changing data on a database experiencing issues.
Encrypted database snapshots
In addition to failover replica and mechanisms, we have a nightly backup process that takes snapshots of all customer databases, encrypts them, and stores them securely in a separate datacenter.
Taking snapshots is a good idea for a couple of reasons, the first being surgical data recovery.
By contrast, real-time backups are only good for server failures, and it's possible for the backup machine(s) to fail right after the main server fails. If a user accidentally bulk edits some data, these changes are also replicated in the backup machine.
But the snapshots taken on previous days still have the correct data from those days before the incident, so it's possible to recover past data up to a recent version that is less than a day old – whatever happens to the master and backup databases – and minimize the data loss.
We keep these snapshots in a separate location to minimize the risk of natural disasters or other events taking out the entire data center in one geographical location. Our hosting partners are of course prepared to handle risks like that, but it makes sense not to rely on the stability of a single location.
In addition to customer databases, we have several other classes of derived data.
One of them is the data search index.
This index is updated each time data is added or modified in the customer database. We use Elasticsearch technology for powering our search indexes, which is distributed, scalable and highly available.
Code related precautions
Our engineering organization has a sophisticated way of doing code changes to prevent accidents. The process includes a group review of each task before development and a peer review before anything is released live.
We take similar care to our database backup processes outlined above with our code, which is versioned and backed up many times over.
Reliability of current Pipedrive systems and personnel
To explain the reliability of the system: If a customer causes an issue with their data accidentally, we can go back to a data snapshot 1, 2, 10 or up to 180 days ago and restore the data.
We do hundreds of these data recoveries for our customers every year – and haven’t lost any data in six years of operation.
I myself have worked at Skype and banks that operate internationally and understand the importance of communications, data, security, and storage, which partly explains why I decided to write about the incident publicly on our blog.
More to the point, user data has been and will continue to be handled by a core team of five dedicated professionals. Collectively, they have a combined 51 years in data infrastructure experience, including extensive experience of critical production systems management in banking, government, and the telecommunications sectors, and are led by a PhD.
How the recent incident happened
Long ago, we took an architectural decision to start using Elasticsearch for storing incoming email bodies. At the time, it looked like a wise decision. Since then, we learned that the solution doesn’t scale to the extent we expected for email storage.
We have been focused on rebuilding both of our email features – Smart BCC and Full Email Sync – for quite some time, and were happy to release the latter to the public in late July.
After rebuilding and adding new features, we had to clean up our Elasticsearch storage and migrate old data into a new solution. Unfortunately, there was a mistake in our cleanup procedure that affected old, unmigrated data of our customers, as well as search indexes of those customers.
I would like to sincerely apologize for the mistake made, and personally take all responsibility.
I also wish to reassure Pipedrive users that:
- This was a one-time event, unrelated to the usual day-to-day functionality of the product,
- We have and will continue to make every effort to restore all lost data possible, and
- We have learned our lesson to avoid any such incident in the future
Improvements to reliability already in progress
- We are reviewing all of our data classifications and run through disaster recovery scenarios for each of them - the aim of these exercises is to prevent any future data loss, and to decrease the time it takes to recover data significantly, for each of the data classes, should an incident be unavoidable
- We are preparing to migrate to multiple datacenters hosting to reduce location-based risk, and for the additional benefit of running accounts from a hosting location geographically close to customers
- We are upgrading our storage system to Ceph, which is designed to provide excellent performance, reliability, and scalability
- Further, we will be growing the Infra team by one or two experienced data professionals, and are planning to add a central monitoring function, and in time, a full team, to boost our capacity to respond to technical and other potentially disruptive incidents
Lessons learned in responding to this incident
We have learned a lot during this time, and want to share the following lessons:
- Trust takes a long time to build and just one incident to break - we're very aware of this, and we want to make things right
- We have a solid approach to reliability, but this wasn't enough - we've learned an important lesson and are busy making our systems and processes even more robust
- Our people are the best asset we have, and we are completely proud with how professional, rational, and responsive everyone at Pipedrive has been under trying circumstances
- We’re lucky to have customers like you - we draw an incredible amount of energy, drive, and strength from the salespeople, entrepreneurs and dealmakers who use, promote and evangelize the product and the company
We are especially grateful to customers who got in touch to share messages like these:
“Kudos on your customer transparency. I know this sort of email is never easy to write.” - Kevin M
“That is exactly how I would handle the situation. Too many people (and companies) think it’s ok to just say “I’m sorry” repetitively and do nothing. I’m impressed that Pipedrive is owning up to the responsibility and making it right. I’ll find a way to get past this but you’ve just earned my loyalty by that action. No excuses, no excessive apologies, just decisive action. ” - Ted L
“Even with the issues this week, I would say there is one strength in your company that never seems to waver, the communication with your customers. I really appreciate that and the overall customer service you provide.” - Melissa M
“Sh*t happens when you party naked… Still love Pipedrive” - Emil M
And we’re here if you ever need to get in touch...
Our customer support is available 9am-5pm in the US and Europe, Monday to Friday - feel free to get in touch regarding this issue or anything else.
All of our customer support specialists have been briefed on the incident, should you have any unanswered questions. We will communicate any updates to admins of affected customer accounts as they are confirmed.
Finally, on a personal note, if you have any feedback on this article, my response, or that of Pipedrive as a team, please do not hesitate to email me directly on email@example.com, and I will do my best to answer you and satisfy any and all concerns you may have.
In the meantime, thank you for accepting our sincere apologies, your collective patience, and understanding, and for the time you have taken to read this post in full.