Automatic generation of documentation and comments on source codes using Codist AI

Applying machine learning to source code is now widely used. Various applications are being built to help developers create better source codes, such as autocomplete software, type check / index, unit test wizard, code summary, checker Code plagiarism for codes written in different programming languages ​​but performing the same tasks, finding bugs, repairing and inducing program, and finally generating docstrings or commenting on the codes to understand what it is doing.

Source: Connect

Register for FREE Workshop on Data Engineering>>

For large code bases, suppose 1 million lines of code, it can be very difficult to provide comments and documentation. The comments cannot be avoided as the next group of developers and maintainers will refer to them and understand their work to build further advancements on top of it. A well-structured code base along with consistent comments will help developers understand and maintain code easily. Another aspect is to reach out to users. Software vendors should provide good documentation on software products so that their customers can understand how they work.

It’s a big challenge for developers to deal with the pressure of writing code, testing modules as well as keeping up with documentation.


Codist is a Platform as a Service (PaaS), designed for MLonCode and programming language theory to automatically interpret source code, and then help developers understand and maintain source code faster. It facilitates the documentation of source code with its large-scale functionality. It analyzes both new and legacy code by auditing the documentation. Codist automatically updates any missing or obsolete documentation.

Docly by Codist

Docly is CLI based which examines and supplements the required code documentation with a single command line. This can be useful right before pushing code to any version control system like the GitHub repository.


pip install docly


docly-gen /path/to/file_or_folder_with_python_files

This line will print an interactive prompt asking if you want to see and apply the changes. [y/n]. By default, this command generates the comment of the function and lists all the declared arguments. So as not to generate the list of arguments. The following will appear.

To use docly in jupyter notebooks:

pip install ‘docly[jupyter]’

To run docly on the .ipynb file from the CLI:

docly-gen --run_on_notebooks /path/to/file_or_folder_with_python_files 

Save the comments generated:

docly-gen --no_generate_diff --print_report /path/to/file_or_folder_with_python_files

Reinstate the changes:

-- docly-restore

Currently, Codist provides access to its beta version of Docly. At the moment, it is only available on macOS and Linux. Soon the Windows version will also be released. Docly uses source code embeddings using vectors, programming language theory to parse it and make it automatically understandable, and finally natural language processing to build semantic understanding between computers and humans. Docly will soon be open-source.

Docly uses the source code graphical navigation

Tree hug

See also

To build a large-scale commenting system, one has to remove a large number of codes from various sources and model them to give accurate predictions. This is why Codist has developed Tree-Hugger to extract the source code repositories. Tree-hugger is a lightweight, high-level library that provides Python APIs to scrape Git repositories and universal code analyzers built on top of tree-sitter. It now supports different programming languages. So far, it supports parsers on Python, Java, Javascript, PHP, C ++. With the advent of HuggingFace transformer library and other open source libraries working with NLP just got easier. Codist has open-source Tree hugger.

Source: Connect

Installation: pip install -U tree-hugger PyYAML


 from tree_hugger.core import PythonParser
 pp = PythonParser()

 ['first_child', 'second_child', 'say_whee', 'wrapper', 'my_decorator', 'parent']

 {'parent': '"""This is the parent functionn    n    There are other lines in the doc stringn    This is the third linenn    And this is the fourthn    """',
  'first_child': "'''n        This is first childn        '''",
  'second_child': '"""n        This is second childn        """',
  'my_decorator': '"""n    Outer decorator functionn    """',
  'say_whee': '"""n    Hellooooooooonn    This is a function with decoratorsn    """'} 

An end-to-end pipeline implementation of Tree Hugger is featured in this notebook.


Codist has generated a package to automatically check if the source code documentation is up to date. code-bert currently works for Python code.

Recently, Microsoft also released codeBERT. The difference being that Codist’s model is made up of MLM and Next Word Prediction while Microsoft has MLM and replaced token detection.

Open source version of CodistAI to easily use the refined model based on the MLM open source code model codeBERT-small-v2 which is a RoBERTa model, trained using the Hugging Face Transformer library, and then refined the model.


While these tools / packages do an amazing job analyzing and providing great information on documentation and code comments, there are some drawbacks that researchers are still working to address. These language-based models are highly skilled but can fail. In addition, the use in the code is quite difficult because many prerequisites are expected.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Join our Telegram Group. Be part of an engaging community

Jayita Bhattacharyya

Passionate about machine learning and data science. Eager to learn new technological advances. A self-taught technician who enjoys doing cool stuff using technology for fun and worth.

Source link

Sam D. Gomez