Skip to content Skip to sidebar Skip to footer

Use Docker For Google Cloud Data Flow Dependencies

I am interested in using Google cloud Dataflow to parallel process videos. My job uses both OpenCV and tensorflow. Is it possible to just run the workers inside a docker instance,

Solution 1:

2021 update

Dataflow now supports custom docker containers. You can create your own container by following these instructions:

https://cloud.google.com/dataflow/docs/guides/using-custom-containers

The short answer is that Beam publishes containers under dockerhub.io/apache/beam_${language}_sdk:${version}.

In your Dockerfile you would use one of them as base:

FROM apache/beam_python3.8_sdk:2.30.0
# Add your customizations and dependencies

Then you would upload this image to a container registry like GCR or Dockerhub, and then you would specify the following option: --worker_harness_container_image=$IMAGE_URI

And bing! you have a customer container.


It is not possible to modify or switch the default Dataflow worker container. You need to install the dependencies according to the documentation.


Solution 2:

If you have a large number of videos you will have to incur the large startup cost regardless. Thus is the nature of Grid Computing in general.

The other side of this is that you could use larger machines under the job than the n1-standard-1 machines, thus amortizing the cost of the download across less machines that could potentially process more videos at once if the processing was coded correctly.


Solution 3:

One solution is to issue the pip install commands through the setup.py option listed for Non-Python Dependencies.

Doing this will download the manylinux wheel instead of the source distribution that the requirements file processing will stage.


Post a Comment for "Use Docker For Google Cloud Data Flow Dependencies"