Spark DataFrame ORC Hive table reading issue

Question

I am trying to read a Hive table in Spark. Below is the Hive Table format:

# Storage Information       
SerDe Library:  org.apache.hadoop.hive.ql.io.orc.OrcSerde   
InputFormat:    org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
OutputFormat:   org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat    
Compressed: No  
Num Buckets:    -1  
Bucket Columns: []  
Sort Columns:   []  
Storage Desc Params:        
    field.delim \u0001
    serialization.format    \u0001

When I am trying to read it using the Spark SQL with the below command:

val c = hiveContext.sql("""select  
        a
    from c_db.c cs 
    where dt >=  '2016-05-12' """)
c. show

I am getting the below warning:-

18/07/02 18:02:02 WARN ReaderImpl: Cannot find field for: a in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40, _col41, _col42, _col43, _col44, _col45, _col46, _col47, _col48, _col49, _col50, _col51, _col52, _col53, _col54, _col55, _col56, _col57, _col58, _col59, _col60, _col61, _col62, _col63, _col64, _col65, _col66, _col67,

The read starts but it is very slow and getting network time out.

When i am trying to read the Hive table directory directly i am getting the below error.

val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.setConf("spark.sql.orc.filterPushdown", "true") 
val c = hiveContext.read.format("orc").load("/a/warehouse/c_db.db/c")
c.select("a").show()

org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [_col18, _col3, _col8, _col66, _col45, _col42, _col31, _col17, _col52, _col58, _col50, _col26, _col63, _col12, _col27, _col23, _col6, _col28, _col54, _col48, _col33, _col56, _col22, _col35, _col44, _col67, _col15, _col32, _col9, _col11, _col41, _col20, _col2, _col25, _col24, _col64, _col40, _col34, _col61, _col49, _col14, _col13, _col19, _col43, _col65, _col29, _col10, _col7, _col21, _col39, _col46, _col4, _col5, _col62, _col0, _col30, _col47, trans_dt, _col57, _col16, _col36, _col38, _col59, _col1, _col37, _col55, _col51, _col60, _col53]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

I can convert the Hive table to TextInputFormat but that should be my last option as i would like to get the benefit of OrcInputFormat to compress the table size.

Really appreciate your suggestion.

Can you execute `show create table c_db.c` and provide us with the output? — Abdulhafeth Sartawi, Jul 03 '18 at 07:35

score 2 · Answer 1 · answered Mar 12 '19 at 09:45

2

I found workaround with reading table such way:

val schema = spark.table("db.name").schema

spark.read.schema(schema).orc("/path/to/table")

answered Mar 12 '19 at 09:45

K. Kostikov

71
1
5

what's "db.name" param supposed to be? – Vivek Sethi Mar 15 '19 at 12:47
I'v mean "database.tableName" – K. Kostikov Mar 17 '19 at 08:52

score 2 · Answer 2 · edited Oct 22 '20 at 10:36

2

The issue occurs generally with large tables, as it fails to read to max field length. I added meta-store read as true (set spark.sql.hive.convertMetastoreOrc=true;) and it worked for me.

edited Oct 22 '20 at 10:36

logi-kal

7,107
6
31
43

answered Dec 24 '19 at 13:02

V.B

59
5

score 0 · Answer 3 · answered Jul 03 '18 at 06:17

0

I think the table doesnt have named columns or if it has, Spark isnt able to read the names probably. You can use the default column names that Spark has given as mentioned in the Error. Or also set column names in the Spark code. Use printSchema and toDF method to rename the columns. But yes, you will need the mappings. This might require selecting and showing columns individually.

answered Jul 03 '18 at 06:17

Vihit Shah

314
1
5

Thanks for replying. It'ss a huge table around 3 TB and have 60+ columns. Not sure i will be able to map all the columns individually. Surprisingly when i described the df it is showing all the column names correctly. – Subhasis Jul 03 '18 at 07:34

score 0 · Answer 4 · answered Nov 17 '20 at 04:13

0

Setting (set spark.sql.hive.convertMetastoreOrc=true;) conf is working. But its trying to modify metadata of hive table. Can you please explain me, What is going to modify and does it effect the table. Thanks

answered Nov 17 '20 at 04:13

Sreenath Vemireddy

1

Spark DataFrame ORC Hive table reading issue

4 Answers4