Singular value decomposition (SVD) is a fundamental computational kernel and tool wildly used in data analytics such as least squares regression, principle components analysis (PCA), and pattern recognition. While a number of dedicated hardware processors have been proposed to accelerate the computationally intensive SVD computation, these designs suffer from poor flexibly and scalability, and/or lack full consideration of compute and data movement challenges associated with SVD. This paper presents a scalable parallel SVD FPGA engine based on the Hestenes-Jacobi method. We propose Maximum Data Sharing ordering (MDS ordering), which maximizes on-chip data reuse, and significantly reduce the expensive off-chip data movements and bandwidth requirement. Our SVD engine can flexibly decompose rectangular matrices with variable sizes, speed up SVD computation by 80x to 300x when compared with software SVD solvers such as the Eigen package running on high-performance CPUs, and process much larger matrices than the previously reported FPGA designs.