AWS released its new IDE, EMR Studio, designed to help data scientists and data engineers develop, visualize and debug applications written in R, Python, Scala and PySpark.
The IDE was first previewed at AWS re:Invent 2020 and since then, new features were added such as the ability to use the Amazon EMR console and AWS CloudFormation to create and configure a new EMR Studio for teams.
To help with debugging, the IDE provides fully managed Jupyter notebooks and tools like Spark UI (which can now be launched directly from an EMR Studio notebook) and YARN Timeline Service.
The IDE is also suitable for developers who want to install custom kernels and libraries and run parameterized notebooks as part of scheduled workflows using orchestration services.
Developers can set up the IDE to run on existing EMR clusters and also create new clusters using Cloud Formation templates or the AWS CLI for Amazon EMR.
The guided steps included on the Amazon EMR console can help with setting up security features and access controls, which can then be used to assign users or groups to the IDE, according Shuang Li, a senior product manager for Amazon EMR in a blog post that contains additional details on all of the IDE’s features.
Support for Microsoft AD was also added as an identity source that can be used with EMR Studio via AWS SSO.
The new CLI helps administrators as well by offering the ability to create cluster templates and specify parameters for anyone using those templates.
New sample notebooks make it easier to start building data science applications in EMR Studio. For example, PySpark code can be used for querying a Hive metastore and Python code for visualization for a quick start.
Notebooks in EMR Studio can then be connected to GitHub, Bitbucket, GitLab, and AWS CodeCommit repositories on private networks.