Home » Python » python – Taking advantage of fork system call to avoid read/writing or serializing altogether?-Exceptionshub

python – Taking advantage of fork system call to avoid read/writing or serializing altogether?-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

I am using mac book and therefore, multiprocessing will use fork system call instead of spawning a new process. Also, I am using Python (with multiprocessing or Dask).

I have a very big pandas dataframe. I need to have many parallel subprocesses work with a portion of this one big dataframe. Let’s say I have 100 partitions of this table that needs to be worked on in parallel. I want to avoid having to need to make 100 copies of this big dataframe as that will overwhelm memory. So the current approach I am taking is to partition it, save each partition to disk, and have each process read them in to process the portion each of them are responsible for. But this read/write is very expensive for me, and I would like to avoid it.

But if I make one global variable of this dataframe, then due to COW behavior, each process will be able to read from this dataframe without making an actual physical copy of it (as long as it does not modify it). Now the question I have is, if I make this one global dataframe and name it:

global my_global_df
my_global_df = one_big_df

and then in one of the subprocess I do:

a_portion_of_global_df_readonly = my_global_df.iloc[0:10]
a_portion_of_global_df_copied = a_portion_of_global_df_readonly.reset_index(drop=True)
# reset index will make a copy of the a_portion_of_global_df_readonly

do something with a_portion_of_global_df_copied

If I do the above, will I have created a copy of the entire my_global_df or just a copy of the a_portion_of_global_df_readonly, and thereby, in extension, avoided making copies of 100 one_big_df?

One additional, more general question is, why do people have to deal with Pickle serialization and/or read/write to disk to transfer the data across multiple processes when (assuming people are using UNIX) setting the data as global variable will effectively make it available at all child processes so easily? Is there danger in using COW as a means to make any data available to subprocesses in general?

[Reproducible code from the thread below]
from multiprocessing import Process, Pool
import contextlib
import pandas as pd

def my_function(elem):

    return id(elem)

num_proc = 4
num_iter = 10
df = pd.DataFrame(np.asarray([1]))
print(id(df))

with contextlib.closing(Pool(processes=num_proc)) as p:
    procs = [p.apply_async(my_function, args=(df, )) for elem in range(num_iter)]
    results = [proc.get() for proc in procs]
    p.close()
    p.join()

print(results)
How to&Answers:

Summarizing the comments, on a forking system such as Mac or Linux, a child process has a copy-on-write (COW) view of the parent address space, including any DataFrames that it may hold. It is safe to use and modify the dataframe in child processes without changing the data in the parent or other sibling child processses.

That means that it is unnecessary to serialize the dataframe to pass it to the child. All you need is the reference to the dataframe. For a Process, you can just pass the reference directly

p = multiprocessing.Process(target=worker_fctn, args=(my_dataframe,))
p.start()
p.join()

If you use a Queue or another tool such as a Pool then the data will likely be serialized. You can use a global variable known to the worker but not actually passed to the worker to get around that problem.

What remains is the return data. It is in the child only and still needs to be serialized to be returned to the parent.