Prologue

What motivated MakinaRocks to create Link?

MakinaRocks uses industrial AI to solve pressing problems in the manufacturing and energy industries.

Our approach involves close cooperation among committed data scientists. Link was born of our continued efforts to provide the best R&D environment for our data scientists. We then realized that all data scientists can utilize the advanced features of Link to improve their work processes.

How can we improve our work process and increase productivity?

To solve this problem, we began examining the work habits of data scientists. Eventually, we focused on the usability of JupyterLab, the most widely used development tool for data science. During the initial exploratory data analysis stage, data scientists can easily check intermediate results with JupyterLab by executing only part of the Python code in the Jupyter notebook. However, due to this flexibility, there are limitations when choosing JupyterLab as the main development tool for a machine learning project. First, a Jupyter notebook is not an ideal tool for reproducible results, and different results are obtained if the execution order of the commands is changed. Second, collaboration centered around a Jupyter notebook is extremely challenging because the dependency structure among the code cells is difficult to figure out.

We noted several common adaptations by data scientists to work around these limitations. JupyterLab users write the title or description to specify the function of a code cell using a markdown above the cell to increase the readability of the notebook. Next, they reorder the cells to produce the desired results when those cells are run sequentially. Finally, they agree on the convention that the code cells must be run sequentially from top to bottom at all times.

Although this ‘linearization’ approach works around some of the limitations, it fails to represent the tree-like dependency structure of the code cells in two ways. For example, in constructing a training pipeline, loading data and model definitions should be placed on different branches of the dependency tree. Therefore, linearization is not helpful for collaboration among data scientists. Furthermore, the execution of a cell still depends on the fragile ‘run-all-cells’ convention, which can easily go wrong. As a result, JupyterLab users face repeated kernel reloads and reruns, wasting a significant amount of time and effort.

We explored several possibilities to alleviate JupyterLab's limitations without sacrificing its usability, flexibility and approachability. Link is our answer.