Buying into Discovery

emily.s.finn · September 18, 2020, 6:03pm

Hi PBS! My lab is looking to buy into Discovery for both our data storage and processing needs. I’m curious to hear from other PBS labs using Discovery how your experience has been (ease of use, support, value for $, etc). If you went the route of building your own compute set-up, would be interested to hear about that too.

I spoke with Arnold Song who mentioned that DBIC has some dedicated resources on Discovery – if anyone could tell us how that works, and whether we can join in on that, would appreciate it. Is it only for data collected at DBIC?

mthornton · September 19, 2020, 4:28pm

I bought in a little while ago - just one node. One thing they don’t mention on the website is that they don’t charge you in one lump sum any more. It’s a monthly bill instead, which makes it feel a bit less dramatic to buy in. Don’t have much experience with it yet, of course, but I can offer a few reflections on why I did it:

The 5x multiplier was a big part of it for me. It seems very financially efficient, assuming we end up actually being able to tap into those additional resources when we need them. The buy-in does expire eventually, so that’s certainly an drawback.
As a corollary to the first point, the cluster has 12 nodes with 2x k80 GPU boards. A one-node buy in means being able to tap into 5 of those (load permitting), meaning 10 k80s. At around $400/board new, that would be $4k - building my own system to accommodate them would easily bring the total cost way above the cost of a node buy in.
We get 1TB of extensively backed-up space for free on dartFS, plus our personal and our lab’s personal directories. Adding more space is much cheaper than buying more compute, and seems reasonably priced to me, when factoring in the cost of offsite multi-media backups of the same volume on other services.
Maintenance on HPC systems - both hardware and software - is not a trivial thing. I would rather outsource that to specialized staff than try to handle it in-house.
I think that learning to use an HPC cluster - including scheduling, etc. - is an important skill for trainees to pick up.

That said, I am still planning to build a couple of workstations for the lab’s exclusive use - for prototyping and smaller analyses.

To echo Emily, I would love it if people with more experience could weigh in to share their experience!

ljchang · September 19, 2020, 11:09pm

Sorry I meant to talk to you all about this at some point. Jeremy, Jim, and I bought into discovery a few years ago. The process and pricing has changed a little bit over the years. It definitely depends on what you need, but I highly recommend buying in to Discovery and recommend that you add to the DBIC group rather than doing something on your own. This is because it is highly advantageous to pool resources.

Compute Nodes
The standard nodes are about $5k and typically have 16 cores with 8gb of ram per core (128gb). Research computing is now continually buying them and negotiating package deals, which allows them to provision new resources to us immediately when we need them. The more of us that buy in on the same group, the more cores we can pool together. We get a 4 x multiplier on the cores we buy. I think we at least 10 at the moment, but I can’t remember exactly. We get them for about 5 years until the warranty expires, then they are cycled out and we have to buy more, which is why I’m recommending that you not rush to buy a bunch right away, but add them as needed. We bought a high-ram node (1.5tb) with 24 cores called ndoli. These have actually gotten more expensive from when we originally purchased it and we decided to do an extended warranty for a few years rather than buy a new one. The CPU nodes are accessed through the main scheduler and ndoli is accessed by directly connecting to it, though we may revisit this in the future. Anyone that is part of the DBIC group, which is basically the entire department, has access to these resources. We are hoping that we will be able to continue to augment this system with funds from the imaging center, individual PI’s startup, and Tor’s funds. There are GPU nodes on discovery that we can use, but to my knowledge no lab has bought their own on discovery. People typically just build local workstations for this.

Storage
Storage is a little more complicated. There are several tiers of storage on DartFS, I can’t remember the exact differences or prices, but it is roughly $100 per tb. There is a slow and faster option and with and without snapshots. We originally bought about 40tb for the imaging center, but now we recommend that every lab just buy what they need. It was too hard to keep track and some labs (like ours) were using a lot more than others. I think our lab probably has about 30-40tb currently and we recently switched to the faster storage with snapshots. I’m personally not a huge fan of DartFS, it is insanely slow and does not use standard unix permissions, instead it uses Access Control Lists (ACL), which are way more complicated and need to be managed in a windows computer. I highly recommend you request to have your storage use the standard POSIX permissions as the ACL stuff is a nightmare and conflicts with a lot of software we use. It’s a long story about why storage is set up this way, but it will likely change in the future. The short answer is that they decided to make it possible to mount this storage directly to any computer. So you could mount your discovery storage to your laptop and work with your data on any computer. Cool idea, however, in practice it creates a host of problems and I strongly recommend that you do not do this as it automatically converts your storage to use ACL permissions rather than POSIX, and it is a huge pain to switch it back. None of this storage is currently backed up other than the RAID 5/6 they are using and the snapshots. Definitely make sure you have anything important (like data) backed up elsewhere. DBIC storage is currently not HIPAA compliant, so make sure you aren’t storing anything that is PHI, though I understand this will change soon.

Additional Computing Resources
In addition to everything on Discovery, our lab has purchased a few more things. We have a ton of laptops for students and RAs to work on and collect data. We have a few windows and linux computers for various things. We have a workstation for RAs to work on things. We have about 80tb of direct attached storage configured as RAID 5. We have a webserver that is in the Moore server room, which hosts a bunch of web apps that we have developed including http://neuro-learn.org/.

There are some legacy servers in the Moore server room that were used by other labs before many of us started migrating over to Discovery. I believe these are still active and Andy Connolly is maintaining and managing them. Not much storage - maybe 30tb, a few rack mounted compute nodes from microway - probably at least 2 with about 64 cores, but they are very old and pretty slow (I think Dr. Zeus and Hydra). Research Computing at Dartmouth was historically not great, which is why many departments including PBS built and managed everything on their own. I think CS still does. However, about 5 years ago, they started to make some substantial investments and have begun to hire a bunch of new people and grow the infrastructure. They’ve been trying to get more departments like ours to move back to centralized computing. We’ve been very happy working with them and encourage all of the new labs to do the same rather than manage your own resources.

emily.s.finn · September 21, 2020, 2:24am

Thank you @mthornton and @ljchang, this is really helpful. Planning to start with one node and probably ~10TB storage (just for data that we’re actively analyzing) but will likely explore other alternatives for long-term storage. Good to know re: the permissions thing – we’re a pretty Mac-heavy group, so having standard POSIX would make our lives a lot easier. Will join the DBIC group too and am happy to discuss pitching in to help maintain/augment those resources in the future.

Topic		Replies	Views
How do I use the Biopac System? Dartmouth Brain Imaging Center	3	593	October 17, 2020
FYI: HOWTOs for DataLad (git-annex) on discovery HPC Discovery Questions	0	318	July 15, 2022
About the Discovery Questions category Discovery Questions	0	485	July 17, 2020
How do I specify the amount of memory my job requires? Discovery Questions	3	1161	January 12, 2021
Dartmouth NSF grant examples Career Development	6	702	September 21, 2020

Buying into Discovery

Related topics