2

I'm new to stackoverflow (regular reader, but I want to participate now). I'm also new to Scala and Spark and functional programming. Looking forward to contributing and learning on all fronts.

my question:

I am working with a variable record length (multiple sections in file) with fixed position fields (aka fixed width - where the format is specified by column widths). For example myfile.txt layout (starting at 1) is: 1-5 = column 1, 5-6 = column 2, 6-20 = column 3 and 20-28 = column 4; whereas sub-header-a2 to sub-footer-z2 has an entirely different layout 1-3 = column 1, 3-6 = column 2 and 6-11 = column 3

myfile.txt example:

header
sub-header-a1
1234a Mr. John Doe 19770101
4321a Mrs. Jane Doe19770101
sub-footer-z1
sub-header-a2
1203400001
4302100001
sub-footer-z2
footer

Using Spark/Scala I want to select sub-header-a1 to sub-footer-z1 section in one RDD and the other section into a second RDD for further processing (minus the sub-header/footer). Two separate RDD's should be created from baseRDDInput.

First RDD

1234a Mr. John Doe 19770101
4321a Mrs. Jane Doe19770101

Second RDD

1203400001
4302100001

I have searched high and low for code examples for selecting a range from a base RDD and transform into another RDD. Found this, but I have a StringRDD and I don't get the RangePartitioner part. All the other file reading examples I found are always csv and don't have nested sections.

Here's what I have so far:

// created a base RDD from raw file, I assumed that I need an index  
val baseRDDinput = sc.textFile("myfile.txt") zipWithIndex () 

// get the start and end point of my range
val (start, end) = ("sub-header-a1", "sub-footer-z1")

// get the index of start and end point
??? 

// iterator over index in order (index is stable based on comments https://stackoverflow.com/questions/26828815/how-to-get-element-by-index-in-spark-rdd-java) and select elements between start and end index and create RDD-1 then do the same with next section. 
???

// next based on code examples from (https://stackoverflow.com/questions/8299885/how-to-split-a-string-given-a-list-of-positions-in-scala) I will parse the element and  make k/v using the first column of file as the key

Any suggestions on approach and/or code would greatly be appreciated. I just need a nudge in the right direction. Thanks in advance.

UPDATE: fixed links

Community
  • 1
  • 1
rburg
  • 283
  • 3
  • 7
  • Hey @rburg, I know it has been a while now, but how did you actually managed to load such files? I'm having same challenge today... Thanks! – Bruno Moreira Sep 27 '21 at 17:42

1 Answers1

0

You are wrong with the approach to store the data this way, you need to have 2 separate files instead of putting everything in a single file.

In this article (http://0x0fff.com/spark-hdfs-integration/) I've got an example of using Hadoop InputFormats to read the data from HDFS in Spark. You need to use org.apache.hadoop.mapreduce.lib.input.TextInputFormat, it would return you a pair of values - first one is Long representing an offset from the beginning of the file, second one is the line from the file itself. This way you can find the lines with "sub-header-a1", "sub-footer-a1", etc. values, remember their offsets and filter on top of them to form 2 separate RDDs

0x0FFF
  • 4,948
  • 3
  • 20
  • 26
  • Thanks @0x0FFF. I should have been more clear in my problem descriptions. I will update. The "RDD-1" and "RDD-2" do represent two separate RDD's created from baseRDDinput. I will review the article. Thanks. – rburg Aug 06 '15 at 13:10