0

I have a Flask application where I use the flask_executor library to run code in the background while returning an immediate response to the client. One of the endpoints in my application encounters a RuntimeError with the following message:

RuntimeError: cannot schedule new futures after interpreter shutdown.

Here are the key details of my situation:

The application is built on Flask, and the endpoint in question utilizes the flask_executor library for running code asynchronously. The specific error occurs when using the submit function of a thread in the background execution. The error is triggered when PyArrow attempts to write to a Parquet file within the background thread. The Parquet file writing code uses the to_parquet function from PyArrow. The relevant code snippet is as follows:

from flask_executor import Executor
    
@app.route('/my_endpoint', methods=['POST'])
def my_endpoint():
    # ... other code ...

    # Running code asynchronously using flask_executor
    future = executor.submit(my_background_function, args)

    # ... other code ...

    return 'Immediate response to the client'

def my_background_function(args):
     for account in accounts:
         my_background_save_function(account)

    # ... more background code ...

def my_background_save_function(args):
    # ... code running in the background ...

    # Writing to a Parquet file using PyArrow
    # data - pandas dataframe
    data.to_parquet(parquet_file)

    # ... more background code ...

I have read that this error can be related to PyArrow's interaction with asynchronous execution and parallel threads. I have tried various solutions, but I am still encountering the RuntimeError.

error trace :

RuntimeError('cannot schedule new futures after interpreter shutdown')

File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 169, in submit
arrays.append(executor.submit(convert_column, c, f))

File "/usr/local/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 620, in dataframe_to_arrays
File "pyarrow/table.pxi", line 3681, in pyarrow.lib.Table.from_pandas

table = self.api.Table.from_pandas(df, **from_pandas_kwargs)

File "/usr/local/lib/python3.9/site-packages/pandas/io/parquet.py", line 159, in write
impl.write(

File "/usr/local/lib/python3.9/site-packages/pandas/io/parquet.py", line 411, in to_parquet
return to_parquet(

File "/usr/local/lib/python3.9/site-packages/pandas/core/frame.py", line 2889, in to_parquet
data.to_parquet(parquet_buffer)

Is there a recommended approach to avoid the "cannot schedule new futures after interpreter shutdown" error when using Flask, flask_executor, and PyArrow's Parquet file writing together? How can I ensure the smooth execution of the Parquet writing process in the background while maintaining an immediate response to the client?

Any insights, suggestions, or alternative solutions would be greatly appreciated.

Thank you.

TheTwo
  • 97
  • 7
  • Hmm, pyarrow manages typically works with its own thread pool in C++. It doesn't normally interact with python threading or futures. So I wouldn't expect this to be an artifact of pyarrow's own parallelism. Is it possible to maybe use `pdb` to get a traceback of the exception? https://stackoverflow.com/questions/18960242/is-it-possible-to-automatically-break-into-the-debugger-when-a-exception-is-thro – Pace Jun 12 '23 at 17:59
  • Thank you for your response. Unfortunately, I am unable to use the pdb library you mentioned. However i can add the error trace. – TheTwo Jun 12 '23 at 19:43
  • 1
    It looks like the conversion from pandas to arrow is failing. You could try to disable multithreading in pyarrow, but as @Pace mentioned it's not likely to be the problem. What happens if you call `table = pyarrow.Table.from_pandas(data, nthreads=1)`? – 0x26res Jun 13 '23 at 07:59
  • @0x26res , thanks for trying, unfortunately it didn't make any difference – TheTwo Jun 14 '23 at 08:11
  • 1
    Do you get the same error? The traceback should be difference as it should use a different code path (https://github.com/apache/arrow/blob/4653918cf23067e540e05e71799e8004fab8c7a2/python/pyarrow/pandas_compat.py#L611) @Pace, it looks like pandas_concat.py manages its own thread pool (in python and not C++), https://github.com/apache/arrow/blob/main/python/pyarrow/pandas_compat.py#L615 – 0x26res Jun 14 '23 at 08:19
  • Yes, I see that now :(. But it does seem like it should be disabled if `nthread=1`. @TheTwo can you share what your `my_background_save_function` looks like with the suggestion from @0x26res ? – Pace Jun 14 '23 at 16:50
  • Hi, a little update. I changed the engine of the saving in Pandas to `fastparquet`, like this: `data.to_parquet(parquet_buffer, engine="fastparquet")` And that solved the problem. Thanks everyone for the help! – TheTwo Jun 18 '23 at 08:19

0 Answers0