by admin

Dataset Serialize Outofmemoryexception Java

Dataset Serialize Outofmemoryexception In Java. Sample Subjects from the Archives. The Archives of the Teradata. Forum contains over 3. To help navigate the Archives, there are additional indexes: For a list of the most recent threads, see: Recent Threads. Yearly Indexes containing the threads.

Dataset Serialize Outofmemoryexception Java

I am reading big xlsx file of 100mb with 28 sheets(10000 rows per sheet) and creating a single dataframe out of it . I am facing out of memory exception when running on cluster mode .My code looks like this.

DatasetSerialize

def buildDataframe(spark: SparkSession, filePath: String, requiresHeader: Boolean): DataFrame = {

// getting inputstream from s3 for all sheet names
val s3FilePath = filePath.replaceAll(filePath.substring(0, filePath.indexOf('//') + 2), ')
val s3Bucket = s3FilePath.substring(0, s3FilePath.indexOf('/'))
val s3key = s3FilePath.substring(s3FilePath.indexOf('/') + 1, s3FilePath.length())
val s3Client: AmazonS3 = new AmazonS3Client();
val s3object: S3Object = s3Client.getObject(new GetObjectRequest(
s3Bucket, s3key));
val inputStream: java.io.InputStream = s3object.getObjectContent()
val workbook = new XSSFWorkbook(inputStream)
var sno = workbook.getNumberOfSheets
var sheetName: Array[String] = new ArrayString
for ((sheet, i) <- workbook.zipWithIndex) {
val initialDF = spark.read.format('com.crealytics.spark.excel')
.option('useHeader', requiresHeader)
.option('treatEmptyValuesAsNulls', 'true')
.option('inferSchema', true)
.option('addColorColumns', false)
.option('sheetName', sheet.getSheetName)
.option('path', filePath)
.load()

Java Serialize To Byte Array

}

Any help is much appreciated.