Why is DVC Better Than Git and Git-LFS in Machine Learning Reproducibility

‍What is Machine Learning Reproducibility?

Reproducibility means that the framework you used to produce your results must be documented and made available for others with access to similar data and tools. You can run your algorithm on different datasets and get the same (or similar) results each time. Most machine learning orchestrations are end-to-end, which means they cover everything from data processing to model design, reporting, model analysis, and evaluation, all the way to successful deployment.

Why is Reproducibility Important in Machine Learning?

A replicable machine learning application is also built to scale with your company’s growth. The attention paid to ensuring that the pipeline is properly architected and coded will hopefully, result in reproducibility.

How to Achieve Machine Learning Reproducibility?

Machine learning reproducibility may be achieved in various techniques and tools; we’ll go through a few of them here.


The process of organizing controls, recording changes in the model or data, and implementing policies for the model is known as versioning. Few versioning tools which machine learning engineers and data scientists use:

Drift / Continuous training

Drift is a common issue in production models. A slight change in environment or data will affect the result of the model. There are two common ways to reduce drift in production models — constant monitoring and continuous training of new data. These are a few common tools used for monitoring the ML system.

Experiments tracking and logging

Model training is a continuous process that entails altering parameter values, evaluating each algorithm’s performance, and fine-tuning it to achieve the best results possible, among other things. Here are a few tools which can help you track your experiments.

Model registry

Model registry is a step in the MLOps (machine learning lifecycle). It’s a service that keeps track of and supervises models at various phases of the machine learning lifecycle. Here are a few tools which can help you with the model registry.

Collaboration and communication

Building models or doing research involves teamwork from the data scientist to the researcher. A lack of communication during the construction process can quickly lead to problems. There are a few tools and platforms that can help you collaborate more effectively.

Git and Git LFS

Git is the most popular version control system. Git maintains a record of your file modifications so you can see what you’ve done and revert to prior versions if necessary. Git also makes collaboration easier by allowing several people’s changes to be merged into a single source. We’ve all heard of Git, but what exactly is Git LFS, and how does it work?

What is Git LFS?

Git Large File Storage works with large files, such as audio samples, films, and large datasets, that are replaced with text pointers and saved on a remote server, such as GitHub or Enterprise, using Git. It lets you version large files, and more repository space allows for faster cloning and retrieval.

Git-LFS working — (Gif Source — https://Git-LFS.Github.com/)

How to use Git LFS?

Git uses a pointer system instead of real files or binary big objects. You write a pointer file instead of big files to a Git repository. Furthermore, the files are written to a different server. Git LFS allows you to utilize several servers. It’s quite simple to get started. You install the extension and set the file types you want to use.

Git-LFS Limitations

  • GIT-LFS is complex for new users
  • Adopting or removing Git LFS from a repository is a permanent action that necessitates rewriting history and erasing your original commit SHAs.
  • Lack of configuration management, as it is currently known.
  • It needs an LFS server, which is not provided by every Git hosting service.
  • It is not suitable for machine learning datasets, as Git-LFS storage has some limits.
  • Training data is kept on a remote server and must be retrieved over the Internet, which poses bandwidth difficulties when using a hosted Git-LFS solution.
  • There are issues with the ease with which data files can be uploaded to a cloud storage system because the key Git-LFS products from the big three Git providers let you store your LFS files on their server.

What is DVC?

Data version control(DVC) is an open-source machine learning platform. Data versioning, workflow management, and experiment management are all things that DVC assists data scientists and developers with. Users can make use of new features while reusing existing ones because DVC is easily customizable.

  • One of the best MLOps tools.
  • Large volumes of data can be versioned.
  • Easy to install, with few commands.
  • DVC remembers the precise command sequence used at any given time.
  • The DVC files keep track of not only the files used in each execution stage, and the commands that are run during that stage.
  • DVC makes it simple for team members to share data and code.
The graphic illustrates how the Git repository interacts with the DVC-defined remote repository (Image from dvc.org)

Why is DVC better?

Data version control(DVC) is designed to make machine learning models shared and reproducible. Managing, storing, and reusing models and algorithms is a major difficulty in deep learning and machine learning projects. Let’s have a look at some of the benefits of using DVC.

Image courtesy dvc.org


Adopting good ML practices, proper versioning of data, code, and configuration files, and automating processing steps to achieve reproducible outcomes will make your machine learning journey easy and faster. We saw the importance of reproducibility and why DVC is better than Git LFS. Hope you liked the article.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Harshil Patel

Harshil Patel

Software Developer and Technical Writer.