What is considered a backup?
Now it might seem like a really simple answer and almost like common sense, but in the information technology and almost any other industry, it’s important to consider what is ACTUALLY a backup solution for information and everything else. Nowadays, with “cloud this” and “cloud that” the term “backup” has kind of gotten corrupted and misconstrued as any method to putting something away in storage. In a sense, I can agree with that but I believe it’s necessary to narrow down that definition a bit more and adding in this like version control and redundant data stores.
3 copies, on 2 different types of media, and at least 1 off site.
Surprisingly, Wikipedia has the definition that I am looking for “backup”:
In information technology, a backup, or the process of backing up, refers to the copying and archiving of computer data so it may be used to restore the original after a data loss event. The verb form is to back up in two words, whereas the noun is backup.
Backups have two distinct purposes. The primary purpose is to recover data after its loss, be it by data deletion or corruption. Data loss can be a common experience of computer users; a 2008 survey found that 66% of respondents had lost files on their home PC. The secondary purpose of backups is to recover data from an earlier time, according to a user-defined data retention policy, typically configured within a backup application for how long copies of data are required. Though backups represent a simple form of disaster recovery, and should be part of any disaster recovery plan, backups by themselves should not be considered a complete disaster recovery plan. One reason for this is that not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server by simply restoring data from a backup.
Syncing and File Storage =/= Backup Solutions
To many, services like Google Drive, Dropbox, and One Drive are all considered backup methods but I would beg to differ, these are cloud storage services without any automated or intuitive version control. If you’re working in Google Drive and specifically with Google Docs or Google Sheets, any changes that you make are saved and now any prior resemblance of what you were working on is now gone. Having a backup means that you have a version that you can revert back to in case sh*t hits the fan, so if you worked on something and now what’s on that “backup” server is the latest version, how can you revert or refresh back to a time that was working just fine?
In fact, the better word to describe those are syncing utilities and not actually a backup solution. These syncing utilities don’t have snapshots of an old file in case you want to revert it nor do they really have version control in case you want to completely restore back to an older time. Although it is possible in certain syncing utilities to back up everything at a certain time each day, those are just additional features and not the original point of the service. These file storage solutions can’t help if you’re trying to revert back to an old file for a project since you already worked on it since and synced the newest changes.
Now there are plenty of exceptions and caveats like changing the filename each time so that the older versions persist and remain but that beats the point of backing up files if you’re going to completely change the filename each time. In the end, a backup solution answers this question, “If your hard drive totally dies, can you restore yourself from that service?” … if the answer is yes, then you have a backup, if the answer is no, then it’s really more of a syncing file storage service. Now you could probably do a full backup onto those services, but then the bottleneck becomes the upload and download speeds. Let’s also consider the case of viruses and malware, specifically ransomware or anything that encrypts files… a backup solution should adequately roll back to a time where everything was fine but with these syncing utilities the “backups” will be replaced with the encrypted data and thus all is moot.
Example: You have a 10 page paper you’re writing and you accidentally overwrite it. With those cloud-storage options above, there’s a good chance the data will be gone as soon as it syncs to the cloud. If it had versioning, you’d “browse through the calendar” so to speak and choose the version you want to restore.
Depending on the purpose you have many different options on what you want to do ranging from individual machines to enterprise organizations. Of course, with an individual machine or just a handful typically the best would be external hard drives and maybe a cloud backup solution. There are different things like:
Which can provide backup solutions from personal to enterprise scale, I believe that Shadow protect also backs up to an image and it allows the user to boot directly into a VM if a computer dies to help with the restore process. Other options can be to backup into hard drives, USB, network locations etc. All of these solutions essentially fall under the Disaster Recovery as a Service, indicating how these solutions are primarily meant to provide disaster recovery (pretty self-explanatory right?).
NAS. No, not the rapper.
Now if you don’t want to use a cloud based solution or a DRaaS, you also have smaller scale choices such as setting up your own file server and NAS that can contain your backups. Essentially it will be the same as what one of those DRaaS solutions provide but they’ll be all done “locally”, in the sense that it’s a personal backup on a personal machine. Just as the above rule of thumb goes, these backup devices should be placed in areas that are not home to ensure the security and availability of it as needed. With NAS devices, if you need a refresher I do have a post about SANs and NASs for you to read back on! Anyways, in smaller situations a NAS device should suffice (SANs are more expensive than a gold and diamond plated gold bar) and with multiple NAS devices and hard drives you can create a cluster for high availability! Below are two different NAS configurations for high availability, obviously active-active is the best, but not many budget and cheaper machines really provide it, the only difference can be a short downtime as the configuration manager has to automatically spin up the passive NAS machine.
Active Passive NAS – Only 1 server is active at any time, so user is getting redirected to one of the servers. The active server is receiving all of the travel and the passive server is synchronizing as needed and if the active server falls then the passive server will reconfigure itself to become the active server. As a result, it can take some downtime to reconfigure itself.
Active Active NAS – Essentially both NAS units are active and synchronizing and thus no downtime if any of the servers fall. Traffic travels to both and thus if any of the servers fall it’ll automatically get redirected to the next active server.
Best practice is to make NAS into a High Availability Cluster, constantly synchronizing data to ensure data is the same on all cluster nodes. A standard cluster needs a minimum of 3 hosts contributing storage with at least one solid state disk (used for caching) and one hard disk drive for permanent storage. The capacity of all three (or more) hosts are used to create a shared datastore. Capacity and availability can be customized by way of a ‘failures to tolerate’ setting. It can do some pretty cool stuff, like rebuilding data automatically onto remaining hosts if one fails and doesn’t come back within x amount of time.
Now another option can be with RAID levels. Remember those? With the 0, 1, 5, 10 etc? Remember how there is data redundancy, parity bit overhead etc? Those are also an option however I believe RAID drives need to be installed and formatted at the same time, meaning if it’s a new computer that’s all fine and dandy but if it’s an up and running machine, it’ll be best to backup the information, reformat both drives, reinstall them in RAID and then bring back the drives and have them duplicate the information. HOWEVER, RAID IS NOT A BACKUP SOLUTION, IT’S DATA REDUNDANCY. It’s really there for working through a drive failure, it just ensures data is still there in case of failure. So Continuity Measure, Not Backup Measure.
After RAID has been configured on a NAS, another facet to take into account is the concept of Error Correcting Code Memory (ECC Memory). It’s important because ECC memory is able to detect and correct common forms of internal data corruption, as a result it is commonly used in servers and machines where data corruption cannot be tolerated such as in large computation purposes and in this case data restoration and storage devices. Even with one single bit, ECC memory is able to maintain a memory system immune to them so that the data that is read from each word is always the same as the data that was written to it (kind of confusing but just understand, it’s bit by bit correction). With servers, you want ECC RAM simply because you never want to mess up what you’re serving or providing to an actual user and in larger scales, you don’t want an accidental bit flip to cause you to lose a pile of work or report an incorrect result somewhere.
The reason I bring up ECC is because with FreeNAS, it uses an advanced and modern filesystem (optional) called ZFS (Zettabyte File System) that can utilize ECC memory! There is a lot more overhead due to file and bit checking and provide the “bees kness” with lots of robust functionality but something to consider is that it requires more RAM. That extra overhead in ZFS goes towards pooling together storage blocks to divide the available space into file systems along with high and transparent compression, capacity reservations, and clonable snapshots (useful in spinning up different instances). Alongside the flexibility is the dependability regarding integrity and redundancy, ZFS creates a chain of trust by checksumming data, the metadata and by periodically determining if data or backups are suffering from “bit rot” (which is why ECC is recommended).
***ECC Memory depends on the Motherboard!***
ZFS works fine without ECC. There is the “urban myth” based on the idea that an in memory checksum error might lead to a scrub of death. (Scrub scans the whole surface and basically rewrites damaged damaged blocks, so it basically might be the worst-case scenario). But the thing is: in case of a checksum error, it will try to recover it from a raid copy – and this block will not match the checksum as well. In a single drive if the original write checksum is bad then you just lost your gile until you recover from your backup. This point is moot when you get into multi-drives setup with proper redundancy. I just advise agains’t running ZFS in single drive mode with non ECC ram. This is a particular case only. With multidrive setup you are right though. Meaning ECC is nice, but not necessary.
Replication and Deduplication
With all that in mind, let’s start to finish with some data replication and deduplication!
Synchronous replication technology does not acknowledge the write from the primary application until the block has been replicated to the target site. Asynchronous replication then acknowledges the write and then replicates that block over time.
With synchronous replication, it allows data to be continually up to date at the target site and that you know that data at the source site is keeping the data at the target site through the same process, however the problem is you don’t actually know when the write has been completed and when the replication is achieved. With asynchronous replication, it confirms the write has started and the the replication begins regardless of bandwidth or latency, but that also means that it can get out of sync and never catch up since the acknowledgement has already happened and nothing is done to confirm its completion or correctness.
Finally let’s talk about data deduplication and how it can help the entire back up process go so much quicker! When identical files are found a single copy is kept and then the other copies are replaced with pointers to the original. Users can interact with the pointers as if they were the original file, and when users make modifications to the file then a copy with their modifications is created for their use. Essentially, data deduplication looks through the binary and bit representation of the file for changes and only replaces and backs up those changes, it skips the redundant information and continues on its merry way until it runs into a dissenting difference. Especially with these large storage devices, the challenge is storing them efficiently and quickly thus data deduplication tries to shrink the amount of data being stored and illustrating to everybody else that it is doing what it is supposed to be doing!
Dedupe allows you to back up that data to a disk using dedupe methodologies, and then because dedupe actually eliminates redundant blocks, it can then allow you to replicate that backup to another location. This was something that was only possible in the smallest environments up until now.
Here’s an example, say there is a file that a project team is currently working on and revising and they all save it back onto the local file server and NAS device (by extension). Each user would have their own unique copy but also have a pointer back to the original and unchanged file. Data Deduplication only works on IDENTICAL, down to the bit stream, files so that one copy is actually there and pointers are distributed to everybody else with the right permissions for it. So downloading a music file and putting it on the shared folder means it would show up on everybody’s shared folder as well, but instead it only keeps one copy of it and creates a pointer (think symbolic link) for everybody else to reference that one music file, hence it appears that the file is there but it’s really just a pointer back to the one unique original music file. But as soon as someone tries to erase the artist information, then that becomes a unique copy specific to that one user while everybody still has the pointer to the original copy. There are different forms of data deduplication to essentially set the threshold on how the data is scanned before creating that pointer, File and Block Level Deduplication.
File-level deduplication works at the file level by eliminating duplicate files; block-level deduplication works at a block level (which may be a fixed-size block or a variably-sized block) by eliminating duplicate blocks. In doing so, the advantage of file-level deduplication is that it doesn’t require as much overhead, if there is a change, save the file again. However, that means in a large 25MB powerpoint, it can resave the entire file if the name of the author was slightly changed. With block-level deduplication, it can eliminate the redundant chunks of data, so on a smaller scale than a single instance file, but it also means there is much more of a resource utilization and allocation due to the increased overhead in having to scan more thoroughly. Especially with block-level deduplication, there is no standardized definition for a “block”, so it could also be considered a file, a chunk, a block, a byte, and even a bit. All that matters is that data deduplication is utilizing the best algorithm to determine what has not changed as quickly as possible and to deduplicate what has changed in the most efficient manner.