Deobfuscating Name Scrambling as a Natural Language Generation Task

This paper wraps up the keywords4bytecodes project. It was presented at the Argentinian Symposium on Artificial Intelligence (ASAI) in 2018. An extended version in the electronic Journal of the SADIO (linked below) was published in 2019.

The research question is whether the bytecodes that execute a given method can represent the semantics of behind the name of the method. If that is the case, this is valuable for practitioners doing reserve engineering to obfuscated code (which lacks names). The experiments use 5 million methods and trains a Random Forest model to predict the first term in the method name.

As features, it uses the opcodes of the bytecodes (that is, bytecodes without any parameters).

The results showed that itcan distinguish the 15 most popular terms from the others at 78% recall, helping a programmer performing reverse engineering to reduce half of the methods in a program they should further investigate.

The code is available on GitHub and the paper itself is also available on-line.