Why is DVC Better Than Git and Git-LFS in Machine Learning Reproducibility

In recent years, machine learning has been a recurring theme at many AI conferences and in the popular press. Yet, for a topic so widely discussed and hyped, surprisingly little is known about how it works under the hood. An important question to consider is how we can be certain that a particular model will perform as expected or even work at all? How can we be sure our models are reproducible?

What is Machine Learning Reproducibility?

Reproducibility means that the framework you used to produce your results must be documented and made available for others with access to similar data and tools. You can run your algorithm on different datasets and get the same (or similar) results each time. Most machine learning orchestrations are end-to-end, which means they cover everything from data processing to model design, reporting, model analysis, and evaluation, all the way to successful deployment.

Reproducibility is beneficial to any continuous integration or continuous delivery cycle. It enables these operations to go seamlessly, making in-house adjustments and client deployments commonplace rather than becoming a nightmare.

Reproducibility in machine learning is dependent on four key components of every model:

Data: In today’s environment, data is always changing. If the data changes, we have an impact on the outcome. Adding new datasets, changing the data distribution, and changing the sample size all impact the model’s output. To preserve repeatability, dataset versioning and tracking must be thoroughly documented.

Recommended Reading: Importance of Version Control in ML

Code: You must keep track of and record changes in code and algorithms during the experiment to achieve reproducibility. Reusability of code is important when you update your code.

Environment: The environment in which a project was produced must be captured to be reproduced. The environment includes library dependencies, versions, parameters and a lot more. These details must be properly tracked to ensure that the model goes into production without a problem.

Compute: This is an important component as this is where you have to keep track of all the physical hardware used to train the model, all the GPUs and server configurations.

Now let’s see why reproducibility is important in our ML journey.

Why is Reproducibility Important in Machine Learning?

A replicable machine learning application is also built to scale with your company’s growth. The attention paid to ensuring that the pipeline is properly architected and coded will hopefully, result in reproducibility.

In our daily lives, we are largely reliant on machine learning. When we rely on these models in production systems, we might have difficulty if we can’t rebuild or explain them. End users want these systems to be objective, dependable, and transparent. It’s tough to describe such models if we can’t reproduce them. As a result, the models’ veracity is questioned. In simple terms, reproducibly can help us with -

Faster Improvements: Having all of your code’s records will allow you to recover prior versions and provide you adequate time to work on bugs while keeping the production code bug-free.

Complexity: It is a more complicated situation with machine learning because you have to maintain track of your datasets, models, and other parameters. As you keep track of everything, the work becomes more complicated. So having a reproducible machine learning application will ease your workload.

Continuous Integration: It is a method of integrating feature branches into master code that may be developed and tested automatically. This method aids developers in identifying problems and promptly resolving them. Continuous Integration allows you to fail quickly and improve quickly, which enhances the quality of your process.

How to Achieve Machine Learning Reproducibility?

Machine learning reproducibility may be achieved in various techniques and tools; we’ll go through a few of them here.

Versioning

The process of organizing controls, recording changes in the model or data, and implementing policies for the model is known as versioning. Few versioning tools which machine learning engineers and data scientists use:

DVC: Data version control(DVC) creates metafiles as a pointer to saved datasets and models while storing them on-premises or in the cloud, allowing for data and model versioning.

MLflow : The open-source tool MLflow is used to manage the machine learning lifecycle. Its different features make work easy for data scientists and developers. It may be used with a variety of machine learning libraries and tools.

Recommended Reading: MLflow vs DVC

WandB: Weights & Biases (WandB) is a platform that helps you keep track of your experiments, manage your datasets and models. WandB makes tracking, comparing, and versioning machine learning and deep learning experiments easy.

Drift / Continuous training

Drift is a common issue in production models. A slight change in environment or data will affect the result of the model. There are two common ways to reduce drift in production models — constant monitoring and continuous training of new data. These are a few common tools used for monitoring the ML system.

Amazon Sagemaker: Amazon SageMaker Model Monitor assists you in maintaining high-quality models by automatically recognizing and warning on erroneous predictions from models deployed in production. SageMaker is suitable for enterprise-level business.

Evidently: Evidently helps in evaluating and monitoring machine learning models during validation and production. It provides a lot of visualization features to detect the issue easily.

Experiments tracking and logging

Model training is a continuous process that entails altering parameter values, evaluating each algorithm’s performance, and fine-tuning it to achieve the best results possible, among other things. Here are a few tools which can help you track your experiments.

DVC: DVC tracks every metric associated with the project and contains a metric list where metric values are stored to track progress.

MLflow Tracking: The tracking feature in MLflow lets you automatically track and log parameters, metrics, and code versions for each model run and deployment.

Model registry

Model registry is a step in the MLOps (machine learning lifecycle). It’s a service that keeps track of and supervises models at various phases of the machine learning lifecycle. Here are a few tools which can help you with the model registry.

MLflow Model Registry: The registry feature in MLflow lets you keep track of everything from Proof of concept to production.

Azure Machine Learning: Azure Machine Learning is a cloud-based solution for training, deploying, and monitoring ML applications. You can either use the UI or an API to create and register a model in Azure.

Collaboration and communication

Building models or doing research involves teamwork from the data scientist to the researcher. A lack of communication during the construction process can quickly lead to problems. There are a few tools and platforms that can help you collaborate more effectively.

Pachyderm: It allows users to collaborate on machine learning projects and workflows.

WandB: It facilitates collaboration by allowing you to invite others to edit and comment on your project.

Comet: It allows you to collaborate and share creations with others.

You can achieve machine learning reproducibility using various tools and platforms, but Git is one of the most common. So, let’s have a look at what Git is and how it works.

Git and Git LFS

Git is the most popular version control system. Git maintains a record of your file modifications so you can see what you’ve done and revert to prior versions if necessary. Git also makes collaboration easier by allowing several people’s changes to be merged into a single source. We’ve all heard of Git, but what exactly is Git LFS, and how does it work?

What is Git LFS?

Git Large File Storage works with large files, such as audio samples, films, and large datasets, that are replaced with text pointers and saved on a remote server, such as GitHub or Enterprise, using Git. It lets you version large files, and more repository space allows for faster cloning and retrieval.

Git-LFS working — (Gif Source — https://Git-LFS.Github.com/)

How to use Git LFS?

Git uses a pointer system instead of real files or binary big objects. You write a pointer file instead of big files to a Git repository. Furthermore, the files are written to a different server. Git LFS allows you to utilize several servers. It’s quite simple to get started. You install the extension and set the file types you want to use.

Git LFS install

  • Setup the Git account and Select the file types you want Git LFS to manage in each Git repository

Git LFS track “*.psd/csv”

  • There is no step three. Just commit and push to GitHub as you normally would; for instance, if your current branch is named main:

Git add file.psd

Git commit -m “Add design file”

Git push origin main

Git-LFS Limitations

  • GIT-LFS is complex for new users
  • Adopting or removing Git LFS from a repository is a permanent action that necessitates rewriting history and erasing your original commit SHAs.
  • Lack of configuration management, as it is currently known.
  • It needs an LFS server, which is not provided by every Git hosting service.
  • It is not suitable for machine learning datasets, as Git-LFS storage has some limits.
  • Training data is kept on a remote server and must be retrieved over the Internet, which poses bandwidth difficulties when using a hosted Git-LFS solution.
  • There are issues with the ease with which data files can be uploaded to a cloud storage system because the key Git-LFS products from the big three Git providers let you store your LFS files on their server.

What is DVC?

Data version control(DVC) is an open-source machine learning platform. Data versioning, workflow management, and experiment management are all things that DVC assists data scientists and developers with. Users can make use of new features while reusing existing ones because DVC is easily customizable.

  • Support for several languages and frameworks.
  • One of the best MLOps tools.
  • Large volumes of data can be versioned.
  • Easy to install, with few commands.
  • DVC remembers the precise command sequence used at any given time.
  • The DVC files keep track of not only the files used in each execution stage, and the commands that are run during that stage.
  • DVC makes it simple for team members to share data and code.
The graphic illustrates how the Git repository interacts with the DVC-defined remote repository (Image from dvc.org)

Why is DVC better?

Data version control(DVC) is designed to make machine learning models shared and reproducible. Managing, storing, and reusing models and algorithms is a major difficulty in deep learning and machine learning projects. Let’s have a look at some of the benefits of using DVC.

Reproducibility

DVC data registries might be useful for using ML models in cross-project studies. These work in a similar way to a package management system in terms of increasing reproducibility and reusability. DVC repositories can employ no-code pulls to update requests with a single commit and store the history for all artifacts, including what was modified and when. With dvc get and dvc import commits, users may reproduce and organize feature stores using a simple command line interface.

Organized Data

We know how important data is for ML engineers and data scientists. Adequate data management is required to train models efficiently. To version data using Git, DVC uses the concept of a data pipeline. These lightweight pipelines enable you to organize and replicate your workflows. For machine learning, dataset versioning increases automation, reproducibility, and CI/CD.

Share Models via Cloud Storage

With DVC, teams find it easier to conduct experiments utilizing a shared single computer after centralizing data storage, which leads to better resource use. DVC enables groups to maintain a development server for the use of shared data.

Track & Visualize ML Models

In DVC, data science features are versioned and stored in data repositories. Regular Git workflows, such as pull requests, are used to achieve versioning. DVC employs a built-in cache to store all ML artifacts, which is then synchronized with distant cloud storage. DVC enables the tracking of data and models for future versioning in this fashion. Writing a dvc.yaml file is a basic step in creating artifacts by tracking ML models.

Boost in productivity

Consider going from a 100GB file to a huge metafile and model in seconds with a simple Git checkout command, or employing a combination of similar instructions to train systems in less time and provide faster results.

A set of upgraded features enables fast-paced machine learning innovation. The features include versioning metafiles, simple text-based metrics tracking, switching, data sharing via a centralized development server, lightweight pipelines, and data-driven directory navigation.

Image courtesy dvc.org

A machine learning research team can use DVC to verify that their data, settings, and code are all in sync. It’s a simple system that efficiently handles shared data repositories while storing configuration and code in an SCM system (like Git).

Conclusion

Adopting good ML practices, proper versioning of data, code, and configuration files, and automating processing steps to achieve reproducible outcomes will make your machine learning journey easy and faster. We saw the importance of reproducibility and why DVC is better than Git LFS. Hope you liked the article.

Originally Published On : https://censius.ai/blogs/dvc-vs-git-and-git-lfs-in-machine-learning-reproducibility

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store