I'm new to stackoverflow (regular reader, but I want to participate now). I'm also new to Scala and Spark and functional programming. Looking forward to contributing and learning on all fronts.
my question:
I am working with a variable record length (multiple sections in file) with fixed position fields (aka fixed width - where the format is specified by column widths). For example myfile.txt layout (starting at 1) is: 1-5 = column 1, 5-6 = column 2, 6-20 = column 3 and 20-28 = column 4; whereas sub-header-a2 to sub-footer-z2 has an entirely different layout 1-3 = column 1, 3-6 = column 2 and 6-11 = column 3
myfile.txt example:
header
sub-header-a1
1234a Mr. John Doe 19770101
4321a Mrs. Jane Doe19770101
sub-footer-z1
sub-header-a2
1203400001
4302100001
sub-footer-z2
footer
Using Spark/Scala I want to select sub-header-a1 to sub-footer-z1 section in one RDD and the other section into a second RDD for further processing (minus the sub-header/footer). Two separate RDD's should be created from baseRDDInput.
First RDD
1234a Mr. John Doe 19770101
4321a Mrs. Jane Doe19770101
Second RDD
1203400001
4302100001
I have searched high and low for code examples for selecting a range from a base RDD and transform into another RDD. Found this, but I have a StringRDD and I don't get the RangePartitioner part. All the other file reading examples I found are always csv and don't have nested sections.
Here's what I have so far:
// created a base RDD from raw file, I assumed that I need an index
val baseRDDinput = sc.textFile("myfile.txt") zipWithIndex ()
// get the start and end point of my range
val (start, end) = ("sub-header-a1", "sub-footer-z1")
// get the index of start and end point
???
// iterator over index in order (index is stable based on comments https://stackoverflow.com/questions/26828815/how-to-get-element-by-index-in-spark-rdd-java) and select elements between start and end index and create RDD-1 then do the same with next section.
???
// next based on code examples from (https://stackoverflow.com/questions/8299885/how-to-split-a-string-given-a-list-of-positions-in-scala) I will parse the element and make k/v using the first column of file as the key
Any suggestions on approach and/or code would greatly be appreciated. I just need a nudge in the right direction. Thanks in advance.
UPDATE: fixed links