Rachel Silver
  • Blog
  • About
  • Contact

Python Environments for PySpark, Part 1: Using Condas

6/29/2017

0 Comments

 
Are you a data scientist, engineer, or researcher, just getting into distributed processing and PySpark, and you want to run some of the fancy new Python libraries you've heard about, like MatPlotLib?
If so, you may have noticed that it's not as simple as installing it on your local machine and submitting jobs to the cluster. In order for the Spark executors to access these libraries, they have to live on each of the Spark worker nodes.
You could go through and manually install each of these environments using pip, but maybe you also want the ability to use multiple versions of Python or other libraries like pandas? Maybe you also want to allow other colleagues to specify their own environments and combinations?
If this is the case, then you should be looking toward using condas to provide specialized and personalized Python configurations that are accessible to Python programs. Conda is a tool to keep track of conda packages and tarball files containing Python (or other) libraries and to maintain the dependencies between packages and the platform.
Read More...

0 Comments



Leave a Reply.

    Author

    Rachel Silver is the Product Management Lead for Machine Learning & AI @ MapR Data Technologies. 
    This blog is unaffiliated with her employer and does not represent their views.

    Archives

    February 2019
    January 2019
    August 2018
    July 2018
    June 2018
    May 2018
    March 2018
    February 2018
    January 2018
    November 2017
    October 2017
    June 2017
    September 2016

    RSS Feed


    Tweets by @RachelSilverPgh
  • Blog
  • About
  • Contact