Dataset Serialize Outofmemoryexception Java
I am reading big xlsx file of 100mb with 28 sheets(10000 rows per sheet) and creating a single dataframe out of it . I am facing out of memory exception when running on cluster mode .My code looks like this.
def buildDataframe(spark: SparkSession, filePath: String, requiresHeader: Boolean): DataFrame = {
// getting inputstream from s3 for all sheet names
val s3FilePath = filePath.replaceAll(filePath.substring(0, filePath.indexOf('//') + 2), ')
val s3Bucket = s3FilePath.substring(0, s3FilePath.indexOf('/'))
val s3key = s3FilePath.substring(s3FilePath.indexOf('/') + 1, s3FilePath.length())
val s3Client: AmazonS3 = new AmazonS3Client();
val s3object: S3Object = s3Client.getObject(new GetObjectRequest(
s3Bucket, s3key));
val inputStream: = s3object.getObjectContent()
val workbook = new XSSFWorkbook(inputStream)
var sno = workbook.getNumberOfSheets
var sheetName: Array[String] = new ArrayString
for ((sheet, i) <- workbook.zipWithIndex) {
val initialDF ='com.crealytics.spark.excel')
.option('useHeader', requiresHeader)
.option('treatEmptyValuesAsNulls', 'true')
.option('inferSchema', true)
.option('addColorColumns', false)
.option('sheetName', sheet.getSheetName)
.option('path', filePath)
Any help is much appreciated.