This guide shows you how to set up your Python development environment, get the Apache Beam SDK for Python, and run an example pipeline.

If you’re interested in contributing to the Apache Beam Python codebase, see the Contribution Guide.

The Python SDK supports Python 3.6, 3.7, and 3.8. Beam 2.24.0 was the last release with support for Python 2.7 and 3.5.

Set up your environment

Check your Python version

The Beam SDK requires Python users to use Python version 3.6 or higher. Check your version by running:

Install pip

Install pip, Python’s package manager. Check that you have version 7.0.0 or newer by running:

If you do not have pip version 7.0.0 or newer, run the following command toinstall it. This command might require administrative privileges.

Install Python virtual environment

It is recommended that you install a Python virtual environmentfor initial experiments. If you do not have virtualenv version 13.1.0 ornewer, run the following command to install it. This command might requireadministrative privileges.

If you do not want to use a Python virtual environment (not recommended), ensuresetuptools is installed on your machine. If you do not have setuptoolsversion 17.1 or newer, run the following command to install it.


Get Apache Beam


Create and activate a virtual environment

A virtual environment is a directory tree containing its own Python distribution. To create a virtual environment, create a directory and run:

A virtual environment needs to be activated for each shell that is to use it.Activating it sets some environment variables that point to the virtualenvironment’s directories.

To activate a virtual environment in Bash, run:

That is, execute the activate script under the virtual environment directory you created.

For instructions using other shells, see the virtualenv documentation.

Download and install

Install the latest Python SDK from PyPI:

Extra requirements

The above installation will not install all the extra dependencies for using features like the Google Cloud Dataflow runner. Information on what extra packages are required for different features are highlighted below. It is possible to install multiple extra requirements using something like pip install apache-beam[feature1,feature2].

  • Google Cloud Platform
    • Installation Command: pip install apache-beam[gcp]
    • Required for:
      • Google Cloud Dataflow Runner
      • GCS IO
      • Datastore IO
      • BigQuery IO
  • Amazon Web Services
    • Installation Command: pip install apache-beam[aws]
    • Required for I/O connectors interfacing with AWS
  • Tests
    • Installation Command: pip install apache-beam[test]
    • Required for developing on beam and running unittests
  • Docs
    • Installation Command: pip install apache-beam[docs]
    • Generating API documentation using Sphinx

Execute a pipeline

The Apache Beam examples directory has many examples. All examples can be run locally by passing the required arguments described in the example script.

For example, run with the following command:

After the pipeline completes, you can view the output files at your specifiedoutput path. For example, if you specify /dir1/counts for the --outputparameter, the pipeline writes the files to /dir1/ and names the filessequentially in the format counts-0000-of-0001.

Next Steps

  • Learn more about the Beam SDK for Pythonand look through the Python SDK API reference.
  • Walk through these WordCount examples in the WordCount Example Walkthrough.
  • Take a self-paced tour through our Learning Resources.
  • Dive in to some of our favorite Videos and Podcasts.
  • Join the Beam users@ mailing list.

Please don’t hesitate to reach out if you encounter any issues!

Last updated on 2021/07/20

