Is it possible to run Hive on Spark with YARN capacity scheduler?

Question

I use Apache Hive 2.1.1-cdh6.2.1 (Cloudera distribution) with MR as execution engine and YARN's Resource Manager using Capacity scheduler.

I'd like to try Spark as an execution engine for Hive. While going through the docs, I found a strange limitation:

Instead of the capacity scheduler, the fair scheduler is required. This fairly distributes an equal share of resources for jobs in the YARN cluster.

Having all the queues set up properly, that's very undesirable for me.

Is it possible to run Hive on Spark with YARN capacity scheduler? If not, why?

score 0 · Answer 1 · answered Jun 27 '20 at 01:12

0

I'm not sure you can execute Hive using spark Engines. I highly recommend you configure Hive to use Tez https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez which is faster than MR and it's pretty similar to Spark due to it uses DAG as the task execution engine.

answered Jun 27 '20 at 01:12

Kenry Sanchez

1,703
2
18
24

Thanks for the suggestion, here's why I prefer considering Hive on Spark with possible move from capacity to fair scheduler: [tez commit activity](https://github.com/apache/tez/graphs/commit-activity) vs [spark commit activity](https://github.com/apache/spark/graphs/commit-activity). From my perspective, Tez is not the best choice to introduce into the technology stack right now. – GoodDok Jun 30 '20 at 14:36
Thanks for your clarification!! Yeah, Spark has a bigger community than Tez. – Kenry Sanchez Jun 30 '20 at 15:45

Oscar Lopez M. · Answer 2 · 2020-07-01T17:10:58.810

0

We are running it at work using the command on Beeline as described https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started just writing it at the beginning of the sql file to run

set hive.execution.engine=spark;
select ... from table....

We are not using capacity scheduler because there are hundreds of jobs run per yarn queue, and when jobs are resource avid, we have other queues to let them run. That also allows designing a configuration based on job consumption per queue more realistic based on the actual need of the group of jobs

Hope this helps

edited Jul 01 '20 at 17:10

answered Jun 27 '20 at 13:11

Oscar Lopez M.

585
3
11

How would you run Hive scripts this way? Seems like you actually execute Spark job without of any interaction with Hive engine. Also didn't find `spark.hive.execution.engine` setting in the docs (both Spark and Hive). – GoodDok Jun 30 '20 at 14:30
Sorry @GoodDok. The reason why I typed the other one was because I thought you wanted to use Hive parameters from Spark as is described https://stackoverflow.com/a/61930270/13231481, but I can see that you are using from a client like beeline. Am I correct? Please let me know whether I am wrong – Oscar Lopez M. Jul 01 '20 at 17:20
yep, exactly, from beeline – GoodDok Jul 01 '20 at 21:04
Ok, so according to (https://stackoverflow.com/questions/36167378/hadoop-capacity-scheduler-and-spark) you can set the capacity scheduler on a Spark application pointing to the queue that has it configured. So if you run beeline and add that parameter on the query as `set hive.execution.engine=spark;` and make beeline to be run on that particular yarn queue. But please be aware of (https://issues.apache.org/jira/browse/HIVE-12611) – Oscar Lopez M. Jul 04 '20 at 23:21

Is it possible to run Hive on Spark with YARN capacity scheduler?

2 Answers2