Microsoft’s CodeBERT ingests public GitHub repositories to help you find code

Large pretrained language models have improved the state-of-the-art on a range of natural language processing tasks, chiefly because they’re able to learn contextual representations from text without supervision. In a preprint paper, a team of researchers at Microsoft Research Asia used this to their advantage to create a system — CodeBERT — for programming languages like Python, Java, JavaScript, and more that supports natural language understanding tasks (like code search) and generation tasks (like code documentation generation).

CodeBERT — the “BERT” acronym within which refers to Google’s BERT architecture for natural language processing — builds upon a multi-layer, bidirectional Transformer neural framework. As with all deep neural networks, Transformers contain neurons (mathematical functions) arranged in interconnected layers that transmit signals from input data and slowly adjust the synaptic strength (weights) of each connection. That’s how all AI models extract features and learn to make predictions, but Transformers uniquely have attention such that every output element is connected to every input element. The weightings between them are calculated dynamically, in effect.

Read More