Overparameterization in deep learning has led to many breakthroughs in the field. However, overparameterized models also have various limitations, such as high computational and storage costs, while also being prone to memorization. To address these limitations, the field of sparse neural networks has gained a renewed focus. Training sparse neural networks to converge to the same performance as dense neural architectures has proved to be elusive. Recent work suggests that initialization is the key. However, while this research direction has had some success, focusing on initialization alone appears to be inadequate. In this work, we take a broader view of training sparse networks and consider the role of regularization, optimization, and architecture choices on sparse models. We propose a simple experimental framework — Same Capacity Sparse vs Dense Comparison (SC-SDC) — that allows for a fair comparison of sparse and dense networks. Furthermore, we propose a new measure of gradient flow — Effective Gradient Flow (EGF) — that better correlates to performance in sparse networks. Using top-line metrics, SC-SDC and EGF, we show that the default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Another issue with sparse networks is the lack of efficient, flexible methods for learning their architectures. Most current approaches only focus on learning convolutional architectures. This limits their application to Convolutional Neural Networks (CNNs) and results in a large search space, since each convolutional layer requires learning hyperparameters such as the padding, kernel, and stride size. To address this, we use techniques that leverage Neural Architecture Search (NAS) methods to learn sparse architectures in a simple, flexible, and efficient manner. We propose a simple NAS algorithm — Sparse Neural Architecture Search (SNAS) — and a flexible NAS search space that we use to learn layer-wise density levels (percentage of active weights). Due to the simplicity of our approach, we can learn most architecture types, while also having a smaller search space. Our results show that we can consistently learn sparse Multilayer Perceptrons (MLPs) and sparse CNNs that outperform their dense counterparts, with considerably fewer weights. Furthermore, we also show that the learned architectures are competitive with state-of-the-art architectures and pruning methods. Based upon these findings, we show that reconsidering aspects of sparse architecture design and the training regime, combined with simple search methods, yields promising results.