Anyone knows about SW solution to encrypted multi HDD storage with deduplication?

postcd

Weaksauce
Joined
Nov 24, 2016
Messages
96
Hello,

i want to speak up about my problem i am long time thinking about and unable to solve despite of searching and also asking on other sites. Sumary is in the title of this page.

Currently i am having Windows system that i do not want to leave.
I need to store around 7TB of data and growing maybe 150GB/mo., mainly movies and also data of an application which is utilizing roughly million files (mostly small files of total size maybe 500GB). Currently i am on HDDs (mentioned app files on one ext. HDD and movies, etc on other external, then system hdd), but i would welcome if i can speed UP the storage at least by 100% (-> read striping) and also i am running out of disk space.

ISSUE: need more space than SSDs can offer and bigger speed than single HDD can offer.

i think i may need to create some storage "pool" out of my HDDs. I want this storage pool be encrypted after i shutdown computer. I would need to use USB enclosures i think as i do not want to buy expensive and power hungry NAS and i do not trust in compatibility of the network attached filesystem to my WIndows 10 PC (that all apps will have no problem using it).

I can use Windows app DrivePool to combine multiple USB drives, this app do not offer deduplication (if the virtual storage pool - NTFS filesystem contains one file in multiple directories, this SW is not able to save physical disk space by utilizing physical space equal to 1 file only). But no problem, maybe i should find some SW that will do it regularly for me by replacing dupes by hardlinks?

I do not need real time deduplication as i read that the ZFS filesystem and maybe BTRFS filesystem's deduplicaton feature requires lets say 6GB RAM + 1GB RAM per 1TB of the storage and also additional CPU time (https://hardforum.com/threads/zfs-dedupe-is-fixed-soon.1854062/) - that is too much resources for me. Thinking if there is some software solution that do not require me purchasing expensive minipc with big RAM..

Maybe i am complicating it too much, maybe i should buy 2 large external HDDs and continue like now, but i am starting to need more space than largest affordable SSDs offer and also more than one HDD speed (IOPS) - (at least for certain file directories/files of like 500GB size needs to have fast read than current single HDD offers). I can not move them to SSD, because they are part of the APP which also contains very large files needing terabytes of storage. I tried to use symlinks (symbolic links) to a different HDD, but the app not worked with symlinks (hardlinks on windows only within single drive).

I do not know what to do.
Note that i prefer external 2.5" USB drives because of lower: noise,PWRconsumption,price.
I do not wish to run some old noisy computer/NAS just to allow me connect 2-3 HDDs together. Also i wish i can prevent buying additional expensive or power hungry hardware. So i prefer (not insist) software solution or solution using simple static HW. DrivePool SW can do everything i want, except it will waste my disk space (in my case space wasted would be in terabytes) because it can not handle duplicates on storage - maybe i should find a Windows SW that will regularly scan for dupes and do replacement by hardlinks? Symlinks does not help as mentioned.
Maybe i should buy some miniPC like Rpi (<10Watt) to offload my HDD IOPS and CPU cycles (which is also becoming problem). I only do not know how i would reliably and simply connect my Windows PC storage and Linux PC storage. From this point looks better to utilize single storage.

Advices on good setup are very welcome, thank you in advance and sorry for hard reading (i am not native speaker).

---------------------

Interesting comments found:
HAMMER (the first version, not HAMMER2) has offline deduplication that is usually scheduled to run at night. This allows regular use to be quite performant and with a low memory footprint. Of course there are tradeoffs, in particular heavy disk usage at certain hours (which can be an issue depending on the workload) and the fact that space is reclaimed only after some time that it has been wasted.
https://news.ycombinator.com/item?id=15070542

dedup filesystem
(HAMMER VERSION 5+) Perform offline (post-process) deduplication.
Deduplication occurs at the block level, currently only data
blocks of the same size can be deduped, metadata blocks can not.
The hash function used for comparing data blocks is CRC-32 (CRCs
are computed anyways as part of HAMMER data integrity features,
so there's no additional overhead). Since CRC is a weak hash
function a byte-by-byte comparison is done before actual dedup-
ing. In case of a CRC collision (two data blocks have the same
CRC but different contents) the checksum is upgraded to SHA-256.
...

The -m memlimit option should be used to limit memory use during
the dedup run if the default 1G limit is too much for the
machine.
https://leaf.dragonflybsd.org/cgi/web-man?command=hammer&section=8
(unsure if quoted text shows it can work in my case somehow, but the Dragonfly download page shows no ARM CPU support, it says "DragonFly BSD is 64-bit only")
 
Last edited:
Your requirements for what you are trying to do are not reasonable. You'll have to flex on budget, performance or any other needs. Also, encrypted data does not dedup or compress that well. Why? Encrypted data is almost all unique with very little commonality. In other words, encrypted data is a lot like any compressed media file as that it doesn't dedupe at all. You may get 1:1.2 dedupe if you are lucky but you are going to waste a ton of CPU cycles doing it.

You are over complicating it and your management of what you posted is an animal to manage. Don't reinvent the wheel, because all you are going to end up with is some clunky convoluted complicated system that doesn't work, and locks you into a hardware model you can't get out of. Buy a small 4 drive nas with encryption built in that allows you to easily ingress and egress your data in and out of it down the road. Its a lot cheaper then you think.
 
Your requirements for what you are trying to do are not reasonable. You'll have to flex on budget, performance or any other needs. Also, encrypted data does not dedup or compress that well. Why? Encrypted data is almost all unique with very little commonality. In other words, encrypted data is a lot like any compressed media file as that it doesn't dedupe at all. You may get 1:1.2 dedupe if you are lucky but you are going to waste a ton of CPU cycles doing it.

You are over complicating it and your management of what you posted is an animal to manage. Don't reinvent the wheel, because all you are going to end up with is some clunky convoluted complicated system that doesn't work, and locks you into a hardware model you can't get out of. Buy a small 4 drive nas with encryption built in that allows you to easily ingress and egress your data in and out of it down the road. Its a lot cheaper then you think.


Tou are right on your aguments
But you can dedupe before you encrypt
 
Yes you can depending on the data. Then going back and laying encryption on top of will be ugly. Very hacky.

If you are absolutely serious about data encryption on the disk, you'll spend the money on drives that are self encrypting. Meaning, they do encryption at rest with out you, the end user having to do anything as its baked into the firmware. You interact with it to get its security key so what ever disk management system you use can then interact with it. All it means is if someone takes a drive out of your hardware stack.. the drive just looks like it has garbage on it. However, if the drive is online and mounted using the security key, then anyone who has access permission wise to the system can access the data. A high level view of it is like thinking of it like using secure keys in ssh for a no password prompt to log into a server. Difference is your storage controller uses the key provided by the drive to access the data.
 
Oracle Solaris 11.4 and a Raid-Z2 pool ex 6 x 4TB=16 TB usable where 2 disks can fail without dataloss ?

All the unique ZFS security features, Windows ntfs alike ACL with ZFS snaps as previous versions, SMB 3.1, realtime dedup2 with reduced memory needs and encryption per ZFS filesystem. Despite dedup2, I would use at least 16 GB RAM but RAM is not that expensive nowadays.
 
Oracle Solaris 11.4 and a Raid-Z2 pool ex 6 x 4TB=16 TB usable where 2 disks can fail without dataloss ?

All the unique ZFS security features, Windows ntfs alike ACL with ZFS snaps as previous versions, SMB 3.1, realtime dedup2 with reduced memory needs and encryption per ZFS filesystem. Despite dedup2, I would use at least 16 GB RAM but RAM is not that expensive nowadays.

this would work, but whats OPS budget?

@OP whats your budget to work with? That may help us better get you in a direction?
 
Solaris 11.4 for a commercial use case is at least 800 USD/year with support and updates.
You can download for free for test and development but without support and updates prior a Solaris 11.5.
 
Unless someone is familiar with Linux, taking on a Solaris project might be a little advanced. Tons of tutorials for home grown ZFS based solutions, but they can be tough to follow for the average user. And building a ZFS solution is going to be around 800$

Banging around newegg, I saw this: https://www.newegg.com/Product/Prod...=1&cm_re=encrypted_nas-_-22-108-687-_-Product 4bay nas with encryption built in. OP just has to buy some drives. Solid platform, vendor backed on a known good stack of hardware and a warrenty? Yeah.. Would be sub 800$ and with a few 4TB spinners https://www.newegg.com/Product/Product.aspx?Item=N82E16822149627&ignorebbr=1 could get OP under 800$.
 
Is there a reason you are looking at USB drives other than not wanting a NAS?
Wouldnt it be better to fit drives directly into your PC?
This will remove many problems that can occur accessing drives through a USB interface over a long term, reliability will be improved substantially and data recovery much simpler if there is a problem.
Its also cheaper and guarantees you will get max performance.
 
OP here!

_Gea and few others did not understood from the following my sayings that my budget is something like a cost of a miniPC or less (which means maybe up to $350):
"i do not want to buy expensive and power hungry NAS"
"that is too much resources for me. Thinking if there is some software solution that do not require me purchasing expensive minipc with big RAM.."

my current working PC motherboard supports up to 16GB RAM which i have and utilizing them already at maximum. My computer can hold only single 2.5" drive, that is why i came with an idea of ext. USB HDDs (Nenu). If i will not use just software solution to build the storage pool, i would have to buy some NAS which you suggest or a minipc (energy efficient & silent as it will be in my room - i would prefer less than 10Watt as mentioned, which is raspberry pi etc. - see "Build a Raspberry Pi NAS") and these i thing do not support 16+GB RAM.

PS: in the meantime i discovered i may not need deduplication feature built in the HDD pooling solution, because maybe i can find and use some external app to regularly replace duplicates by hardlinks within resulting pool.
So the requirement now would be "only":

budget/silent/energy efficient solution for
- joining multiple HDDs and creating storage pool out of them prefferabli with read
- resulting storage "pool" will be supported by my WIndows apps (i have no experience with NAS so do not know if there is problem that some apps do not not understand attached storage)
- data readable on Linux (in case my WIndows PC is out of order)
- storage pool is encrypted when PC gets shutdown

Thx all for attempt to help. I welcome and appreciate all your ontopic ideas that will not go wildly over mentioned budget. And sorry for my too broad and maybe hard to understand "request" o_O
 
Last edited:
Freenas will do this.

But you will need 8 gig of ram as a minimum, and need to skip the deduping option.

Freenas will work on most reasonable hardware, maybe a used HP or lenovo desktop? and will encrypt fine.

As you're managing lots of small files, you could add an optane memory nvme to use as cache.

Done and dusted.
 
And where is the problem?

If you can build your own and skip professional server features like ECC and IPMI but insist on Intel quality nics
(I note Euro prices with EU Vat but supposely USD is quite similar)

- Silverstone CS 380 case with 8 hotplug bays (Sata, dualpath SAS) 130 Euro
- PSU 50 Euro
- Mainboard socket 1151 ex Asrock H270 Pro4 or B360M Pro4 (1151v2) 70 Euro
- the cheapest Celeron 40 Euro
- 8 GB RAM 50 Euro

sum less than 350 Euro without disks - where the case ist the most expensive part.
Without a backplane it can be much cheaper

look for a free storage os with web management like my napp-it for Solaris or its free forks
or FreeNAS/XigmaNas based on Free-BSD - they are all based on ZFS (either native ZFS on Solaris or Open-ZFS)

From outside view (a Windows client) such a ZFS appliance is quite identical to a Windows machine as it just offers SMB shares, in the Solarish case even with Windows ntfs alike ACL and ZFS snaps as Windows "previous versions" out of the box.
 
Last edited:
Budget vs requirements will not work for what you are trying to do. Hopefully you'll find a solution that works for you.
 
thx both for good tips.

Method A:

_Gea advised setup has downside in higher power consumption (possibly over 45Watt) i think, also maybe noise of the PSU fan. Here is an idea on how to reduce power consumption by at least 20W and possibly for lower budget:

- budget second hand computer case without PSU as a storage for HDDs, and then AC power adapter which results with several IDE connectors to power HDDs (i already have one for 2 HDDs - price was under $10), then also use external HDDs powered via USB port. (Assuming FreeNAS or ZFS support connecting both external and internal HDDs - most of my storage capacity is currently in ext. HDDs & i know i should use reliable power supply for the HDDs).

- then use lowe power budget miniPC with at least 8GB RAM (i thought ZFS without dedup can run on even 1GB RAM - when you google, you can see some articles reddit.com/r/DataHoarder/comments/3s7vrd/ ): aliexpress.com/category/70803003/mini-pc.html ; then refine search by typing 8GB or 16GB

=> this whole mentioned setup will cut power consumption of the computer from like 40-50W to like 15-20W. (not counting HDDs)

There are also a mobile devices with 16GB RAM (phones), though it may be hard or incompatible to run Linux that support FreeNAS/ZFS on it.

Method B - i will use Windows software like DrivePool + DiskCryptor. The software initial cost is <$50. This setup i mentioned already and would not add any noise and additional power consumption

Method C - i will wait a few years for manufacturers to buid better miniPCs with more RAM and CPUs with AV1 support and in the meantime i will continue with not ideal setup like now (stand alone HDDs - without any read striping), only i buy new bigger HDD and migrate small one to it.

I have checked also various NAS storages, but ones that i like (8GB RAM, encryption..) cost too much.

Sad there is no software pooling solution for low RAM minipc like Rasberry (1GB RAM) where i can also use encryption, said FreeNAS was suggested by you for 8GB RAM and ZFS for 8GB RAM even i read on above linked reddit article that the ZFS can be run on lower RAM. Dedup would not be used (too high ram usage).

Another question: Wondering if i can set some file types on a ZFS pool can be accelerated by placing them on a SSD which would be part of the pool created mainly by HDDs.
 
Another question: Wondering if i can set some file types on a ZFS pool can be accelerated by placing them on a SSD which would be part of the pool created mainly by HDDs.

You can't do that in ZFS.

The idea behind ZFS is different.
For a sequential workload disks are fast enough. In a ZFS raid, sequential performance scale with number of data disks so a few disks can give a sequential performance good for a 10G network. For a random read/write workload where SSDs are much faster, ZFS use its superiour rambased read/write caches (blockbased not filebased with a read most/ read last strategy on reads). It is quite usual to have a read hit rate from cache over 80%. This is why you see ZFS storage systems with > 100 GB RAM.

Additionally you can extend the rambased readcache Arc with an SSD or NVMe readcache L2Arc (max 10 x RAM) where you can additionally enable read ahead what can improve some sequiental workloads.
 
thx both for good tips.

Another question: Wondering if i can set some file types on a ZFS pool can be accelerated by placing them on a SSD which would be part of the pool created mainly by HDDs.

What you are looking for is Tiered Storage, a form of Hierarchical Storage. It is often used in enterprise SAN systems, but I don't know of any actively-maintained HSM system being developed today that is FOSS. The last one I remember was lvmts, but that hasn't been updated in a very long time.
 
postcd, how much duplication do you have in your current 7TB? If you don't actually know, run a scan overnight to help make an informed decision.
 
Current duplicity level does not matter as i am planing to rise it to very roughly 40% of the used disk space. Duplicity will be mainly big media files.
 
Last edited:
Current duplicity level does not matter as i am planing to rise it to very roughly 40% of the used disk space. Duplicity will be mainly big media files.

Media files do not de-duplicate or compress that well. Unless you have multiple copies of the same exact files, dedup will not be a thing for you.

Why? There is virtually zero commonality in media files that aids in de-duplication. Media files in say, mp3 audio, or mp4 with aac and h264 are already hyper compressed and squashed down based on the codec used to create the file. Even if your files a straight up raw video audio data, its like reading an encrypted stream data. There is virtually zero commonality in the data itself. Its like zipping a zip file a few times.. The compressed file re-compressed just ends up larger.

At best you'll see maybe 1:1.2 to 1:1.4 de-dupe if you are lucky.

HammerSandwich nailed it when it was suggest to run a scan on your content. You could absolutely be wasting your time on building your perfect solution only for it to not actually be a good fit for your data.

Your requirements vs your budget will never line up. The old saying lines up with this one.. Fast, Cheap and Reliable.. You can't have all 3. Pick 2.
 
@kdh

> Unless you have multiple copies of the same exact files, dedup will not be a thing for you.

yes, i have identic media files (made by coping one to other location) so dedup is for me.

> Your requirements vs your budget will never line up. The old saying lines up with this one.. Fast, Cheap and Reliable.. You can't have all 3. Pick 2.

thank you for your experience, i would pick cheap and reliable.
 
Back
Top