When you're building an application, a situation often comes up where your users want to delete some content but you still need to know that the data once existed.
Let's say you're building a messaging app. You might want to give your users the chance to restore deleted conversations.
So, when someone comes to delete a conversation you need to do something other than simply deleting the document from Couchbase. Instead, you could use a flag in the JSON body that identifies the document as somehow non-live. For example:
Underneath that you'd create a GSI index to give you efficient access to all of those deleted conversations:
CREATE INDEX deletedmsgs ON `conversations`(deleted);
Then, to populate the view of deleted conversations in your app's UI you'd run a N1QL query such as:
SELECT * FROM `conversations` WHERE deleted = true;
So far, everything seems pretty obvious.
Adding this to an existing data set
Now, what happens if you're only just introducing this feature to your application? The likelihood is that you'd already have many thousands of conversations whose underlying documents don't have the deleted flag.
You might think of this as a great opportunity to use N1QL's IS MISSING. You'd be right but you might be thinking along slightly different lines to what I'm going to suggest.
You coulc account for documents that don't have the deleted flag by changing your N1QL query to something like this:
SELECT * FROM `conversations` WHERE deleted = true OR deleted IS MISSING;
To make it a bit more realistic, we might also filter our query based on the person involved:
SELECT * FROM `conversations` WHERE sender = user123 AND deleted = true OR deleted IS MISSING;
This would do the job but it wouldn't be the most efficient way to do it.
Once we introduce N1QL's MISSING keyword in an OR situation, then we set N1QL off on a full document scan: our query is asking N1QL to check every document for the presence of the deleted JSON key. Our index helps only with those documents where deleted exists, because it doesn't index documents where the key isn't present.
Instead, in this case, we'd be better running a one-off query that would return all of the documents that are missing deleted. We could then work through each of them and update them to include deleted = false and so all those documents would now be in our efficient index.
If the query takes a little longer, that's okay: we'd be running it as part of a background data janitoring task, so none of our users would even be aware of it. Once we'd run this task over all the documents, we'd no longer need MISSING in our user-facing query because we'd know that all our conversation documents would be either deleted: true or deleted: false.
Using schema versions instead
We could avoid using MISSING altogether by maintaing schema versioning:
If we imagine that our new schema version, featuring the deleted flag, was 2.0.0 then our background janitor task could look for all the documents that have a schema version older than 2.0.0 and then run them through an upgrade process, part of which would be adding the deleted flag.