Showing posts sorted by date for query computer science. Sort by relevance Show all posts
Showing posts sorted by date for query computer science. Sort by relevance Show all posts

AI - Technological Singularity

 




The emergence of technologies that could fundamentally change humans' role in society, challenge human epistemic agency and ontological status, and trigger unprecedented and unforeseen developments in all aspects of life, whether biological, social, cultural, or technological, is referred to as the Technological Singularity.

The Singularity of Technology is most often connected with artificial intelligence, particularly artificial general intelligence (AGI).

As a result, it's frequently depicted as an intelligence explosion that's pushing advancements in fields like biotechnology, nanotechnology, and information technologies, as well as inventing new innovations.

The Technological Singularity is sometimes referred to as the Singularity, however it should not be confused with a mathematical singularity, since it has only a passing similarity.

This singularity, on the other hand, is a loosely defined term that may be interpreted in a variety of ways, each highlighting distinct elements of the technological advances.

The thoughts and writings of John von Neumann (1903–1957), Irving John Good (1916–2009), and Vernor Vinge (1944–) are commonly connected with the Technological Singularity notion, which dates back to the second half of the twentieth century.

Several universities, as well as governmental and corporate research institutes, have financed current Technological Singularity research in order to better understand the future of technology and society.

Despite the fact that it is the topic of profound philosophical and technical arguments, the Technological Singularity remains a hypothesis, a guess, and a pretty open hypothetical idea.

While numerous scholars think that the Technological Singularity is unavoidable, the date of its occurrence is continuously pushed back.

Nonetheless, many studies agree that the issue is not whether or whether the Technological Singularity will occur, but rather when and how it will occur.

Ray Kurzweil proposed a more exact timeline for the emergence of the Technological Singularity in the mid-twentieth century.

Others have sought to give a date to this event, but there are no well-founded grounds in support of any such proposal.

Furthermore, without applicable measures or signs, mankind would have no way of knowing when the Technological Singularity has occurred.

The history of artificial intelligence's unmet promises exemplifies the dangers of attempting to predict the future of technology.

The themes of superintelligence, acceleration, and discontinuity are often used to describe the Technological Singularity.

The term "superintelligence" refers to a quantitative jump in artificial systems' cognitive abilities, putting them much beyond the capabilities of typical human cognition (as measured by standard IQ tests).

Superintelligence, on the other hand, may not be restricted to AI and computer technology.

Through genetic engineering, biological computing systems, or hybrid artificial–natural systems, it may manifest in human agents.

Superintelligence, according to some academics, has boundless intellectual capabilities.

The curvature of the time curve for the advent of certain key events is referred to as acceleration.

Stone tools, the pottery wheel, the steam engine, electricity, atomic power, computers, and the internet are all examples of technological advancement portrayed as a curve across time emphasizing the discovery of major innovations.

Moore's law, which is more precisely an observation that has been viewed as a law, represents the increase in computer capacity.

"Every two years, the number of transistors in a dense integrated circuit doubles," it says.

People think that the emergence of key technical advances and new technological and scientific paradigms will follow a super-exponential curve in the event of the Technological Singularity.

One prediction regarding the Technological Singularity, for example, is that superintelligent systems would be able to self-improve (and self-replicate) in previously unimaginable ways at an unprecedented pace, pushing the technological development curve far beyond what has ever been witnessed.

The Technological Singularity discontinuity is referred to as an event horizon, and it is similar to a physical idea linked with black holes.

The analogy to this physical phenomena, on the other hand, should be used with care rather than being used to credit the physical world's regularity and predictability to technological singularity.

The limit of our knowledge about physical occurrences beyond a specific point in time is defined by an event horizon (also known as a prediction horizon).

It signifies that there is no way of knowing what will happen beyond the event horizon.

The discontinuity or event horizon in the context of technological singularity suggests that the technologies that precipitate technological singularity would cause disruptive changes in all areas of human life, developments about which experts cannot even conjecture.

The end of humanity and the end of human civilization are often related with technological singularity.

According to some research, social order will collapse, people will cease to be major actors, and epistemic agency and primacy would be lost.

Humans, it seems, will not be required by superintelligent systems.

These systems will be able to self-replicate, develop, and build their own living places, and humans will be seen as either barriers or unimportant, outdated things, similar to how humans now consider lesser species.

One such situation is represented by Nick Bostrom's Paperclip Maximizer.

AI is included as a possible danger to humanity's existence in the Global Catastrophic Risks Survey, with a reasonably high likelihood of human extinction, placing it on par with global pandemics, nuclear war, and global nanotech catastrophes.

However, the AI-related apocalyptic scenario is not a foregone conclusion of the Technological Singularity.

In other more utopian scenarios, technology singularity would usher in a new period of endless bliss by opening up new opportunities for humanity's infinite expansion.

Another element of technological singularity that requires serious consideration is how the arrival of superintelligence may imply the emergence of superethical capabilities in an all-knowing ethical agent.

Nobody knows, however, what superethical abilities might entail.

The fundamental problem, however, is that superintelligent entities' higher intellectual abilities do not ensure a high degree of ethical probity, or even any level of ethical probity.

As a result, having a superintelligent machine with almost infinite (but not quite) capacities but no ethics seems to be dangerous to say the least.

A sizable number of scholars are skeptical about the development of the Technological Singularity, notably of superintelligence.

They rule out the possibility of developing artificial systems with superhuman cognitive abilities, either on philosophical or scientific grounds.

Some contend that while artificial intelligence is often at the heart of technological singularity claims, achieving human-level intelligence in artificial systems is impossible, and hence superintelligence, and thus the Technological Singularity, is a dream.

Such barriers, however, do not exclude the development of superhuman brains via the genetic modification of regular people, paving the door for transhumans, human-machine hybrids, and superhuman agents.

More scholars question the validity of the notion of the Technological Singularity, pointing out that such forecasts about future civilizations are based on speculation and guesswork.

Others argue that the promises of unrestrained technological advancement and limitless intellectual capacities made by the Technological Singularity legend are unfounded, since physical and informational processing resources are plainly limited in the cosmos, particularly on Earth.

Any promises of self-replicating, self-improving artificial agents capable of super-exponential technological advancement are false, since such systems will lack the creativity, will, and incentive to drive their own evolution.

Meanwhile, social opponents point out that superintelligence's boundless technological advancement would not alleviate issues like overpopulation, environmental degradation, poverty, and unparalleled inequality.

Indeed, the widespread unemployment projected as a consequence of AI-assisted mass automation of labor, barring significant segments of the population from contributing to society, would result in unparalleled social upheaval, delaying the development of new technologies.

As a result, rather than speeding up, political or societal pressures will stifle technological advancement.

While technological singularity cannot be ruled out on logical grounds, the technical hurdles that it faces, even if limited to those that can presently be determined, are considerable.

Nobody expects the technological singularity to happen with today's computers and other technology, but proponents of the concept consider these obstacles as "technical challenges to be overcome" rather than possible show-stoppers.

However, there is a large list of technological issues to be overcome, and Murray Shanahan's The Technological Singularity (2015) gives a fair overview of some of them.

There are also some significant nontechnical issues, such as the problem of superintelligent system training, the ontology of artificial or machine consciousness and self-aware artificial systems, the embodiment of artificial minds or vicarious embodiment processes, and the rights granted to superintelligent systems, as well as their role in society and any limitations placed on their actions, if this is even possible.

These issues are currently confined to the realms of technological and philosophical discussion.


~ Jai Krishna Ponnappan

Find Jai on Twitter | LinkedIn | Instagram


You may also want to read more about Artificial Intelligence here.



See also: 


Bostrom, Nick; de Garis, Hugo; Diamandis, Peter; Digital Immortality; Goertzel, Ben; Kurzweil, Ray; Moravec, Hans; Post-Scarcity, AI and; Superintelligence.


References And Further Reading


Bostrom, Nick. 2014. Superintelligence: Path, Dangers, Strategies. Oxford, UK: Oxford University Press.

Chalmers, David. 2010. “The Singularity: A Philosophical Analysis.” Journal of Consciousness Studies 17: 7–65.

Eden, Amnon H. 2016. The Singularity Controversy. Sapience Project. Technical Report STR 2016-1. January 2016.

Eden, Amnon H., Eric Steinhart, David Pearce, and James H. Moor. 2012. “Singularity Hypotheses: An Overview.” In Singularity Hypotheses: A Scientific and Philosophical Assessment, edited by Amnon H. Eden, James H. Moor, Johnny H. Søraker, and Eric Steinhart, 1–12. Heidelberg, Germany: Springer.

Good, I. J. 1966. “Speculations Concerning the First Ultraintelligent Machine.” Advances in Computers 6: 31–88.

Kurzweil, Ray. 2005. The Singularity Is Near: When Humans Transcend Biology. New York: Viking.

Sandberg, Anders, and Nick Bostrom. 2008. Global Catastrophic Risks Survey. Technical Report #2008/1. Oxford University, Future of Humanity Institute.

Shanahan, Murray. 2015. The Technological Singularity. Cambridge, MA: The MIT Press.

Ulam, Stanislaw. 1958. “Tribute to John von Neumann.” Bulletin of the American Mathematical Society 64, no. 3, pt. 2 (May): 1–49.

Vinge, Vernor. 1993. “The Coming Technological Singularity: How to Survive in the Post-Human Era.” In Vision 21: Interdisciplinary Science and Engineering in the Era of Cyberspace, 11–22. Cleveland, OH: NASA Lewis Research Center.


AI - Symbolic Logic

 





In mathematical and philosophical reasoning, symbolic logic entails the use of symbols to express concepts, relations, and positions.

Symbolic logic varies from (Aristotelian) syllogistic logic in that it employs ideographs or a particular notation to "symbolize exactly the item discussed" (Newman 1956, 1852), and it may be modified according to precise rules.

Traditional logic investigated the truth and falsehood of assertions, as well as their relationships, using terminology derived from natural language.

Unlike nouns and verbs, symbols do not need interpretation.

Because symbol operations are mechanical, they may be delegated to computers.

Symbolic logic eliminates any ambiguity in logical analysis by codifying it entirely inside a defined notational framework.

Gottfried Wilhelm Leibniz (1646–1716) is widely regarded as the founding father of symbolic logic.

Leibniz proposed the use of ideographic symbols instead of natural language in the seventeenth century as part of his goal to revolutionize scientific thinking.

Leibniz hoped that by combining such concise universal symbols (characteristica universalis) with a set of scientific reasoning rules, he could create an alphabet of human thought that would promote the growth and dissemination of scientific knowledge, as well as a corpus containing all human knowledge.

Boolean logic, the logical underpinnings of mathematics, and decision issues are all topics of symbolic logic that may be broken down into subcategories.

George Boole, Alfred North Whitehead, and Bertrand Russell, as well as Kurt Gödel, wrote important contributions in each of these fields.

George Boole published The Mathematical Analysis of Logic (1847) and An Investigation of the Laws of Thought in the mid-nineteenth century (1854).




Boole zoomed down on a calculus of deductive reasoning, which led him to three essential operations in a logical mathematical language known as Boolean algebra: AND, OR, and NOT.

The use of symbols and operators greatly aided the creation of logical formulations.

Claude Shannon (1916–2001) employed electromechanical relay circuits and switches to reproduce Boolean algebra in the twentieth century, laying crucial foundations in the development of electronic digital computing and computer science in general.

Alfred North Whitehead and Bertrand Russell established their seminal work in the subject of symbolic logic in the early twentieth century.

Their Principia Mathematica (1910, 1912, 1913) demonstrated how all of mathematics may be reduced to symbolic logic.

Whitehead and Russell developed a logical system from a handful of logical concepts and a set of postulates derived from those ideas in the first book of their work.

Whitehead and Russell established all mathematical concepts, including number, zero, successor of, addition, and multiplication, using fundamental logical terminology and operational principles like proposition, negation, and either-or in the second book of the Principia.



In the last and third volumes, Whitehead and Russell were able to demonstrate that the nature and reality of all mathematics is built on logical concepts and connections.

The Principia showed how every mathematical postulate might be inferred from previously explained symbolic logical facts.

Only a few decades later, Kurt Gödel's On Formally Undecidable Propositions in the Principia Mathematica and Related Systems (1931) critically analyzed the Principia's strong and deep claims, demonstrating that Whitehead and Russell's axiomatic system could not be consistent and complete at the same time.

Even so, it required another important book in symbolic logic, Ernst Nagel and James Newman's Gödel's Proof (1958), to spread Gödel's message to a larger audience, including some artificial intelligence practitioners.

Each of these seminal works in symbolic logic had a different influence on the development of computing and programming, as well as our understanding of a computer's capabilities as a result.

Boolean logic has made its way into the design of logic circuits.

The Logic Theorist program by Simon and Newell provided logical arguments that matched those found in the Principia Mathematica, and was therefore seen as evidence that a computer could be programmed to do intelligent tasks via symbol manipulation.

Gödel's incompleteness theorem raises intriguing issues regarding how programmed machine intelligence, particularly strong AI, will be realized in the end.


~ Jai Krishna Ponnappan

Find Jai on Twitter | LinkedIn | Instagram


You may also want to read more about Artificial Intelligence here.


See also: 

Symbol Manipulation.



References And Further Reading


Boole, George. 1854. Investigation of the Laws of Thought on Which Are Founded the Mathematical Theories of Logic and Probabilities. London: Walton.

Lewis, Clarence Irving. 1932. Symbolic Logic. New York: The Century Co.

Nagel, Ernst, and James R. Newman. 1958. Gödel’s Proof. New York: New York University Press.

Newman, James R., ed. 1956. The World of Mathematics, vol. 3. New York: Simon and Schuster.

Whitehead, Alfred N., and Bertrand Russell. 1910–1913. Principia Mathematica. Cambridge, UK: Cambridge University Press.



AI - Symbol Manipulation.

 



The broad information-processing skills of a digital stored program computer are referred to as symbol manipulation.

From the 1960s through the 1980s, seeing the computer as fundamentally a symbol manipulator became the norm, leading to the scientific study of symbolic artificial intelligence, now known as Good Old-Fashioned AI (GOFAI).

In the 1960s, the emergence of stored-program computers sparked a renewed interest in a computer's programming flexibility.

Symbol manipulation became a comprehensive theory of intelligent behavior as well as a research guideline for AI.

The Logic Theorist, created by Herbert Simon, Allen Newell, and Cliff Shaw in 1956, was one of the first computer programs to mimic intelligent symbol manipulation.

The Logic Theorist was able to prove theorems from Bertrand Russell's Principia Mathematica (1910–1913) and Alfred North Whitehead's Principia Mathematica (1910–1913).

It was presented at Dartmouth's Artificial Intelligence Summer Research Project in 1956. (the Dartmouth Conference).


John McCarthy, a Dartmouth mathematics professor who invented the phrase "artificial intelligence," convened this symposium.


The Dartmouth Conference might be dubbed the genesis of AI since it was there that the Logic Theorist first appeared, and many of the participants went on to become pioneering AI researchers.

The features of symbol manipulation, as a generic process that underpins all types of intelligent problem-solving behavior, were thoroughly explicated and provided a foundation for most of the early work in AI only in the early 1960s, when Simon and Newell had built their General Problem Solver (GPS).

In 1961, Simon and Newell took their knowledge of AI and their work on GPS to a wider audience.


"A computer is not a number-manipulating device; it is a symbol-manipulating device," they wrote in Science, "and the symbols it manipulates may represent numbers, letters, phrases, or even nonnumerical, nonverbal patterns" (Newell and Simon 1961, 2012).





Reading "symbols or patterns presented by appropriate input devices, storing symbols in memory, copying symbols from one memory location to another, erasing symbols, comparing symbols for identity, detecting specific differences between their patterns, and behaving in a manner conditional on the results of its processes," Simon and Newell continued (Newell and Simon 1961, 2012).


The growth of symbol manipulation in the 1960s was also influenced by breakthroughs in cognitive psychology and symbolic logic prior to WWII.


Starting in the 1930s, experimental psychologists like Edwin Boring at Harvard University began to advance their profession away from philosophical and behavioralist methods.





Boring challenged his colleagues to break the mind open and create testable explanations for diverse cognitive mental operations (an approach that was adopted by Kenneth Colby in his work on PARRY in the 1960s).

Simon and Newell also emphasized their debt to pre-World War II developments in formal logic and abstract mathematics in their historical addendum to Human Problem Solving—not because all thought is logical or follows the rules of deductive logic, but because formal logic treated symbols as tangible objects.

"The formalization of logic proved that symbols can be copied, compared, rearranged, and concatenated with just as much definiteness of procedure as [wooden] boards can be sawed, planed, measured, and glued [in a carpenter shop]," Simon and Newell noted (Newell and Simon 1973, 877).



~ Jai Krishna Ponnappan

Find Jai on Twitter | LinkedIn | Instagram


You may also want to read more about Artificial Intelligence here.



See also: 


Expert Systems; Newell, Allen; PARRY; Simon, Herbert A.


References & Further Reading:


Boring, Edwin G. 1946. “Mind and Mechanism.” American Journal of Psychology 59, no. 2 (April): 173–92.

Feigenbaum, Edward A., and Julian Feldman. 1963. Computers and Thought. New York: McGraw-Hill.

McCorduck, Pamela. 1979. Machines Who Think: A Personal Inquiry into the History and Prospects of Artificial Intelligence. San Francisco: W. H. Freeman and Company

Newell, Allen, and Herbert A. Simon. 1961. “Computer Simulation of Human Thinking.” Science 134, no. 3495 (December 22): 2011–17.

Newell, Allen, and Herbert A. Simon. 1972. Human Problem Solving. Englewood Cliffs, NJ: Prentice Hall.

Schank, Roger, and Kenneth Colby, eds. 1973. Computer Models of Thought and Language. San Francisco: W. H. Freeman and Company.


Malware Analysis Using Artificial Intelligence.


    Malware detection and analysis have been hot topics in cybersecurity research in recent years. 

    Indeed, the development of obfuscation methods such as packing, for example, need extra caution in order to discover new malware versions. 

    The standard detection techniques do not always provide tools for interpreting the data. 

    As a result, we propose a model based on the conversion of binary data into grayscale images, which achieves an 88 percent accuracy rate. 

    Furthermore, the proposed model has an accuracy of 85 percent in determining whether a sample is packed or encrypted. 

    It enables us to assess data and take relevant action. 


    Furthermore, by using attention mechanisms on detection models, we can determine whether parts of the files are suspicious. 

    This kind of tool should be highly valuable for data analysts since it compensates for the standard detection models' lack of interpretability and may assist to understand why certain harmful files go unnoticed. 


    Introduction.


    The quantity of viruses and assaults has expanded dramatically in recent years. 


    The amount of online submissions to sandboxes like Virustotal or Any.run, among others, is an example of this phenomena. 

    Furthermore, owing to sophisticated evasion techniques, these infections are becoming increasingly difficult to detect. 

    While certain elements of polymorphic malware evolve, its functional aim stays the same. 


    Signature-based detection becomes outdated as a result of these advancements. 


    To handle both massive numbers and complicated malware, researchers and businesses have resorted to artificial intelligence methods. 

    We'll look at static analysis of malware for computational concerns like time and resources in this study. 

    Dynamic analysis produces excellent findings, but it causes resource issues for firms with thousands of suspicious files to evaluate since a sandbox might take two to three minutes per file. 



    State of the Art 


    Malware detection and analysis are burgeoning topics of research. 

    Several strategies have been presented in this area in recent years. 


    Signature-based detection is the most widely used detection approach [1,2]. 


    This approach involves storing signatures, which are sections of code from both benign and malicious files. 

    It involves comparing a suspicious file's signature to a signature database. 

    This approach has a flaw in that it requires opening the file first, establishing its type, and recording its signature. 



    Dynamic analysis is another popular and successful strategy. 


    It tries to execute suspicious files in sandbox settings (physical or virtual) [3]. 

    It permits analysts to examine the file's activity without danger. 

    This method is very useful for identifying fresh malware or malware that has been altered using obfuscation methods. 

    However, this approach may be a waste of time and money. 

    Furthermore, some malware detects virtual environments and does not execute in order to conceal its origin and activity. 



    Many techniques to static analysis connected with machine learning have been researched in recent works in order to get excellent results in malware detection and overcome signature-based detection and dynamic analysis shortcomings. 

    The goal of static analysis is to examine a file without executing it in order to determine its purpose and nature. 

    The most natural method is to extract features based on binary file bit statistics (entropy, distributions, etc.) and then do a binary classification using machine learning techniques (Random Forest, XGBoost, LightGBM for example). 


    The quality of detection models is influenced by the features used for training and the quantity of data available. 


    • Anderson et al. [4] provide Ember, a very nice dataset for training machine learning methods. 
    • Raff et al. [5] on the other hand, analyze bit sequences derived from binary files using Natural Language Processing methods. 


    Their MalConv approach produces excellent results but needs a significant amount of CPU effort to train. 



    Furthermore, padding and GAN-based evasion strategies have recently been demonstrated to be particularly sensitive to this strategy. 


    • To address these flaws, Fleshman et al. [6] created NonNegative MalConv, which lowers the evasion rate without sacrificing accuracy. 
    • Grayscale photos were used by Nataraj et al. [7] to categorize 25 malware types. 


    The authors transform binary information to pictures and extract significant characteristics using the GIST technique. 

    They used these features to train a K-NN and got a percentage of accuracy of 97.25 percent. 

    This approach, in addition to having a high classification rate, has the advantage of being more resistant to obfuscation, particularly packing, which is the most common obfuscation strategy. 



    Vu et al. [8] suggested the use of RGB (Red Green Blue) pictures for malware classification using their own transformation approach dubbed Hybrid Image Transformation in continuation of this study (HIT). 


    They store syntactic information in the green channel of an RGB picture, whereas entropy data is stored in the red and blue channels. 

    Given the growing interest in picture identification, as seen by ImageNet [9] and performance improvements [10] over the years, several writers suggested employing a Convolutional Neural Network (CNN) to classify malware using binary data transformed to grayscale images. 


    Rezende [11] used transfer learning on ResNet-50 to classify malware families and obtained a 98.62 percent accuracy rate. 


    Yakura et al. [12] employed the attention mechanism in conjunction with CNN to highlight spots in grayscale pictures that aid categorization. 

    They also link important portions of the code to its dismantled function. 


    Another major area in malware research is the development of detection models that are resistant to obfuscation tactics. 


    There is a lot of malware out there, but it has been updated to make it invisible. 

    Polymorphic [13] and metamorphic [14] malicious files, for example, have techniques that seem to change their code but not their behavior. 



    Malware developers may also change them manually. 


    • To disturb the detection model without modifying its functions, Kreuk et al. [15] insert bytes directly into the binary file. 
    • Another change is to include malware, which is one of the most prevalent ways to get beyond antivirus protection. 
    • Aghakhani et al. [16] provide an outline of the detection methods' limitations in detecting packed malware. 




    Research Outline


    The study's contributions may be summarized as follows: 


    • A genuine database containing complicated malware gathered in the firm is used to test different detection algorithms. 

    On our own dataset of binary files, we present detection algorithms that employ grayscale image and HIT preprocessing. 

    We compare the outcomes of our models to those of models trained using the Ember dataset and preprocessing (LGBM, XGBoost, DNN). 

     

    • We present models that account for the possibility of binary data being compressed or encrypted. 

    One goal of this strategy is to lower the false positive rate caused by certain models' assumption that updated files are always harmful. 

    Another goal is to equip malware experts with a tool that allows them to learn more about the nature of a suspicious file. 

     

    • To understand the outcomes of our image recognition systems, we use attention techniques. 

    This approach is used to extract the sections of the picture, and therefore the binary, that contributed the most to the classification's score. 

    This information may then be sent on to security experts to help them reverse engineer the virus more quickly. 

    Their feedback is then utilized to better understand and fix algorithm flaws. 


    This work is arranged as follows: 


    1. We describe our dataset and highlight its benefits as well as the various preprocessing techniques used. 
    2. We provide the various models trained using Ember or our own datasets next. The models are compared, and the findings and performances are discussed. 
    3. The next section is devoted to the analysis of changed samples and attention mechanisms, two techniques that might be useful to analysts. 
    4. Lastly we summarize the findings and brings the work to a close. 



    Dataset and Preprocessing. 


    Binaries Dataset Description.


    There are 22,835 benign and malicious Portable Executable (PE) files in our sample, including packed and encrypted binary files. 





    The figure above depicts the dataset's precise distribution. 

    The malware has been gathered in organizations and on sandboxes, and the innocuous files are produced from harvested Windows executables. 

    The key aspect of this dataset is that the virus is rather tough to detect. 

    As proof, certain sandboxes and antivirus applications failed to identify them. 


    Because our dataset comprises complicated and non-generic malware, overfitting should be avoided during model training. 


    Then utilize the Ember dataset, which contains 600,000 PE files, to train machine learning algorithms, and we evaluate the findings on our own dataset. 

    We divided the dataset into 80 percent training data, ten percent testing data, and ten percent validation data for the image-based algorithm. 

    This distribution is the best for keeping a big enough training sample and a complex enough testing sample. 



    Is the Malware tampered with? 


    The analysis of packed or encrypted executables, which we refer to as "modified" files in the remainder of the article, is a common challenge when doing static analysis. 

    Even though many innocuous executables are updated for industrial or intellectual property purposes, artificial intelligence algorithms will frequently label them as harmful. 

    This is acceptable considering that these operations would substantially affect the executable's entropy and bytes distribution. 

    Taking into account the changing nature of binary files during the training of detection models is one line of thinking for enhanced performance. 

    Use of software like ByteHist [17] before the analysis offers a sense of the nature of a file. 

    Indeed, ByteHist is a program that generates byte-usage histograms for a variety of files, with a concentration on binary executables in the PE format. 


    ByteHist shows us how bytes are distributed in an executable. 


    The more compressed the executable, the more uniform the distribution. 




    Figure above shows unpacked byte distributions of malware and benign code, as well as their UPX-transformed analogues. 


    As can be seen, UPX alters the byte distribution of binary files, especially in malware cases that have more alterations than the benign file. 

    It's also a popular packer, thus unpacking binary files made with UPX is simple. 

    Many malware, on the other hand, contains more complicated software, making analysis more challenging. 



    Malware Transformation Using Images. 


    Before we go into how to convert a binary to an image, let's go over why we use images. 

    First and foremost, when a binary is turned into an image, distinct portions of the binary may be plainly viewed, providing a first direction to an analyst as to where to search, as we shall see in the following section. 


    The malware writers may then tweak sections of their files or utilize polymorphism to change their signatures or create new versions, as we outlined in the beginning. 


    Images may record minor changes while maintaining the malware's overall structure. 

    We immediately transfer a static binary to an array of integers between 0 and 255. 


    As a result, each binary is translated into a one-dimensional array v [0, 255], which is then reshaped into a two-dimensional array and resized according to the [7] technique. 


    In other words, the width is governed by the file size. 

    The entire length of the one-dimensional array divided by the width gives the file's height. 

    If the width is not divisible by the file size, we rounded up the height and pad zeros. 

    This approach converts a binary picture to a grayscale image. 

    The main benefit of this method is that it is extremely quick. 

    It just takes a few minutes to process 20,000 binaries. 


    Vu et al. [8] discuss many approaches for converting a binary picture to an RGB image. 


    Green is the most sensitive to human eyesight and has the greatest coefficient value in picture grayscale conversion, thus their color encoding method is based on it. 

    They encode syntactic information into the green channel of an RGB picture using their HIT approach, while the red and blue channels collect entropy information. 

    As a result, clean files will seem to have more green pixels than malicious ones, which have greater entropy and red/blue values. 


    With picture recognition algorithms, this modification produces excellent results. 


    The only disadvantage is the time it takes to morph. 

    The HIT approach takes an average of 25 seconds to convert a binary into a picture. 

    The grayscale and HIT modifications of the binary file introduced before are shown in Figure below. 






    Static Methods for Detection.


    We examine and evaluate three techniques to malware detection based on static methods and machine learning algorithms in this section: 


    • First, we use the Ember dataset to train three models, each with its unique feature extraction strategy. 

    • We next propose a CNN and three hybrid models to identify malware utilizing these temporal grayscale pictures as input. 

    • Finally, we use the HIT approach to train another CNN on an RGB picture. 




    Binary Files Algorithms. 


    We will evaluate three methods for static analysis: 


    1. XGBoost, 
    2. LightGBM, 
    3. and a deep neural network (DNN). 


    XGBoost [18] is a well-known data-testing technique, but it might take a long time to run on a big dataset. 

    As a result, we compare it to LightBGM [19], which Ember uses in conjunction with their dataset. 

    Let's take a brief look at the LightGBM algorithm, which is currently relatively unknown. 

    It employs a revolutionary approach known as Gradient-based One-Side Sampling (GOSS) to filter out data instances in order to get a split value, while XGBoost use a pre-sorted algorithm and a histogram-based algorithm to determine the optimal split. 



    In this case, examples are observations. 


    • Faster training speed and greater efficiency as compared to other algorithms such as Random Forest or XGBoost. 

    • Improved accuracy using a more complicated tree (replaces continuous values with discrete bins, resulting in reduced memory consumption). 


    If we concentrate on it in this research, it is mostly because of its ability to manage large amounts of data. 


    When compared to XGBoost, it can handle huge datasets just as effectively and takes much less time to train. 

    To begin, we train the XGBoost and LightGBM algorithms on the Ember dataset and then put them to the test on our own data. 



    In addition, we train a DNN on the Ember learning dataset since this kind of model works well with a huge dataset with a lot of characteristics. 


    To compare models, we utilize the F1 score and the accuracy score. 



    The table below summarizes the findings. 





    The performances of LightGBM and DNN are relatively similar in this table, while XGBoost is less efficient (either in precision or computing time). 




    Grayscale Image Algorithms. 



    We convert our dataset into grayscale photos and use them to train CNN based on the work of Nataraj et al. [7]. 


    Three convolutional layers, a dense layer with a ReLU activation function, and a sigmoid function for binary scoring make up our CNN. 

    In addition, we propose hybrid models integrating CNN with LightGBM, RF, or Support Vector Machine, as inspired by [20]. (SVM). 


    To begin, we utilize CNN to minimize the number of dimensions, reducing each binary picture from 4,096 to 256 features. 


    The RF, LightGBM, and SVM models are then trained using these 256 additional features. 

    F1 and accuracy ratings are still utilized to compare models, as seen in the Table. 

    The hybrid model combining CNN and RF beats the four grayscale models, as can be observed, although the results are close overall. 


    Furthermore, the results are comparable to those of the LightGBM and DNN described in Sect. 3.1. 


    It's worth noting that the grayscale models are trained using just 19,400 binary data, compared to 600,000 binary files for the prior models. 

    In comparison to conventional models and preprocessing, our grayscale models remain reliable for malware detection with the grayscale transformation and a dataset thirty times smaller. 




    RGB Image Algorithms.



    We are now assessing our CNN utilizing RGB photos and HIT transformation. 


    F1 and accuracy scores on the test sample are shown in Table. 

    Even though the RGB model outperforms the others previously discussed, training on a local system using RGB photos takes a long time, while scoring on a single one is quick. 

    Because of the complexity of the HIT technique, converting binary to pictures takes a lengthy time, on average 25 seconds per sample, compared to less than one second for grayscale conversions. 


    To begin with, adding the time it takes to convert the 24,000 samples significantly lengthens the learning process. 

    Furthermore, the score is received in less than one second when predicting malware, but the time for transforming the binary into an image is added. 

    Given this, using HIT transformation instead of grayscale transformation in a corporate setting is pointless, which is why we will not focus on training additional models using HIT preprocessing. 




    Attention Mechanism and Modified Binary Analysis.


    Aside from obtaining the most accurate findings possible, the goal is to make them useable by an analyst. 


    To do so, we need to understand why our algorithms offer high or low ratings. 

    This not only helps us to better the learning process in the event of a mistake, but it also allows analysts to know where to look. 

    To make malware easier to comprehend and analyze, we offer two approaches: • The first technique is to utilize knowledge about the nature of binary files to train our algorithms. 

    We know whether the training set's binary files have been changed in particular. 


    The goal is to limit the number of false positives produced by these two obfuscation approaches while simultaneously providing additional information about the new suspicious files. 


    The use of an attention mechanism on model trains with grayscale images is the second approach. 

    To detect suspicious patterns, we can create a heatmap using the attention mechanism. 

    Furthermore, the heatmap aids in comprehending the malware detection model's results. 



    Binaries that have been modified. 


    We also give two models that are trained while taking into consideration the changed nature of the binary file in order to lower the false positive rate due to obfuscation. 


    Both models utilize grayscale photographs as their input. 


    1. The first model is a CNN that outputs information on the binary file's nature, whether it is malware or not, and whether it is obfuscated or not. 

    We now have double knowledge of the binary file's attributes thanks to a single CNN. 

    The F1 score for this model is 0.8924, and the accuracy score is 0.8852. 


    2. The second model is a three-CNN superposition. 

    The first is used to distinguish binary files with an accuracy of 85 percent based on whether they are obfuscated or not. 

    The two others are used to determine if a binary file is malicious or benign, and each model is trained on changed and unmodified binary files, respectively. 


    The key benefit of this approach is that each CNN may be retrained independently from the other two. 


    They also employ various architectures in order to improve the generalization of the data they use to train them. 

    This model has an F1 score of 0.8797 and an accuracy score of 0.8699. 



    As can be seen, the first model performs better than the second. 


    It can also tell whether a binary file has been edited with an accuracy rating of 84%. 

    This information might aid malware analysts in improving their knowledge. 

    This may explain why some innocuous files are mistakenly identified as malware. 


    Furthermore, if certain suspicious files are modified and the result of malware detection is ambiguous, it may encourage the use of sandboxes. 




    Results Interpretability and Most Important Bytes 


    We provide a strategy in this part that may assist analysts in interpreting the findings of detection models based on the conversion of binary data into grayscale pictures. 

    Depending on the relevant portions of the file, a grayscale picture representation of an executable has different texture [21]. 

    We can extract information from the binary using tools like PE file and observe the connection between the PE file and its grayscale visual representation. 


    The figure displays an executable converted to an image, the PE file's matching sections (left), and information about each texture (right). 





    The connections between the PE file and the grayscale picture might help an analyst rapidly see the binary file's regions of interest. 

    To proceed with the investigation, it will be important to determine which elements of the picture contributed to the malware detection algorithm's conclusions. 

    To achieve so, we use attenuation techniques, which consist of emphasizing the pixels that have the most impact on our algorithm's prediction score. 


    With our own CNN provided in Sect. 3.2, we employ GradCAM++ [22]. 


    The GradeCAM++ method retrieves the pixels that have the most effect on the model's conclusion, i.e. those that indicate whether the file is benign or malicious, from the CNN. 

    It produces a heatmap, which may be read as follows: the warmer the coloring, the more the picture region influences the CNN prediction. 


    Heatmaps of the four binaries described in Sect. 2.2 are shown in Figure below. 





    We see that the virus and its compressed form have different activation zones. 

    We can say the same thing about the good. 

    In addition, since the packed malware's byte distribution has been drastically modified, we find that more zones and pixels are lighted up than in the original infection. 

    This indicates that the model requires more data to make a decision. 

    On the other hand, this kind of representation could be useful in figuring out why a binary file has been misclassified. 


    Padding, for example, is a typical evasion tactic that involves adding byte sequences to a file to artificially expand its size and deceive the antivirus. 

    This kind of method is readily detectable via picture representation. 





    However, as shown in Figure above, the padding zone is considered a significant portion of the file in both situations. 


    The innocuous file is misclassified and branded as malware, even though the virus is successfully discovered. 

    As a result, padding is seen as a malevolent criterion. 

    This information might be used to improve the malware detection model's performance. 


    The activation map on binary file images appears to be a fun and useful tool for malware researchers. 


    To fully maximize the promise of this method, however, further research is required. 

    Indeed, we demonstrate the usage of heatmap for binary file packing and padding analysis, but there are other alternative obfuscation approaches. 

    We also concentrate on malware and benign pictures here, but an expansion of this technique will be to extract code straight from the binary file based on the hot zone of the associated heatmap. 




    Results and conclusion



    Let us begin by summarizing the findings before concluding this paper. 

    Indeed, CNN trained on RGB pictures using HIT produces superior results. 

    The transformation time, on the other hand, is much too crucial to give an effective approach for industrial deployment. 

    The DNN and LightGBM models then exhibit the Ember dataset's predicted efficacy. 


    Our models are marginally less efficient than theirs since they utilize grayscale photographs as input. 


    However, the findings are equivalent to a thirty-fold smaller training sample. 

    Finally, the two models were trained on grayscale photographs with information on whether the original binary file had been modified or not, indicating that this approach may be used to identify malware. 

    They also provide you more information about binary files than standard detection models. 


    The CNN algorithm, when combined with RF, LGBM, and SVM, has promising detection potential. 


    In future study, we'll concentrate on determining the capacity or limit of these models. 

    In this post, we've discussed many techniques of detecting static malware. 

    The processing time needed for dynamic malware analysis in a sandbox is a recurrent issue in many businesses. 

    However, we know that in certain cases, customized malware is the only option. 

    We make no claim to be able to replace this analysis; rather, we provide an overlay. 


    Our algorithms allow us to swiftly evaluate a vast volume of malware and decide whether ones are harmful or not, while also moderating the outcome based on whether the binary is updated or not. 


    This will enable us to do dynamic analysis solely on those binaries, saving time and resources. 

    Furthermore, analyzing the image's most essential pixels and regions might reveal valuable information to analysts. 


    By specifying where to hunt for the questionable binary, they may save time in the deeper investigation. 


    We will focus on attention processes in future research. 

    The goal is to link the regions of relevance, as determined by attention mechanisms, with the harmful code connected with them to aid analysts in their job. 

    We wish to apply reinforcement learning to understand and prevent malware evasion strategies, on the other hand.



    ~ Jai Krishna Ponnappan

    Find Jai on Twitter | LinkedIn | Instagram


    You may also want to read and learn more Cyber Security Systems here.

    You may also want to read more about Artificial Intelligence here.




    References And Further Reading


    1. Sung, A.H., Xu, J., Chavez, P., Mukkamala, S.: Static analyzer of vicious executables. In: 20th Annual Computer Security Applications Conference, pp. 326–334. IEEE (2004)

    2. Sathyanarayan, V.S., Kohli, P., Bruhadeshwar, B.: Signature generation and detection of malware families. In: Australasian Conference on Information Security and Privacy, pp. 336–349. Springer (2008)

    3. Vasilescu, M., Gheorghe, L., Tapus, N.: Practical malware analysis based on sandboxing. In: Proceedings - RoEduNet IEEE International Conference, pp. 7–12 (2014)

    4. Anderson, H.S., Roth, P.: Ember: an open dataset for training static PE malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)

    5. Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., Nicholas, C.K.: Malware detection by eating a whole exe. arXiv preprint arXiv:1710.09435 (2017)

    6. Fleshman, W., Raff, E., Sylvester, J., Forsyth, S., McLean, M.: Non-negative networks against adversarial attacks. arXiv preprint arXiv:1806.06108 (2018)

    7. Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S.: Malware images: visualization and automatic classification. In: Proceedings of the 8th International Symposium on Visualization for Cyber Security, pp. 1–7 (2011)

    8. Vu, D.L., Nguyen, T.K., Nguyen, T.V., Nguyen, T.N., Massacci, F., Phung, P.H.: A convolutional transformation network for malware classification. In: 2019 6th NAFOSTED Conference on Information and Computer Science (NICS), pp. 234–239. IEEE (2019)

    9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    10. Alom, M.Z., et al.: The history began from alexnet: a comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164 (2018)

    11. Rezende, E., Ruppert, G., Carvalho, T., Ramos, F., De Geus, P.: Malicious software classification using transfer learning of resnet-50 deep neural network. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1011–1014. IEEE (2017)

    12. Yakura, H., Shinozaki, S., Nishimura, R., Oyama, Y., Sakuma, J.: Malware analysis of imaged binary samples by convolutional neural network with attention mechanism. In: Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy, pp. 127–134 (2018)

    13. Sharma, A., Sahay, S.K.: Evolution and detection of polymorphic and metamorphic malwares: a survey. Int. J. Comput. Appl. 90(2), 7–11 (2014)

    14. Zhang, Qinghua, Reeves, Douglas S.: MetaAware: identifying metamorphic malware. In: Proceedings - Annual Computer Security Applications Conference, ACSAC, pp. 411–420 2007 (2008)

    15. Kreuk, F., Barak, A., Aviv-Reuven, S., Baruch, M., Pinkas, B., Keshet, J.: Adversarial examples on discrete sequences for beating whole-binary malware detection. arXiv preprint arXiv:1802.04528, pp. 490–510 (2018)

    16. Aghakhani, H.: When malware is packin’heat; limits of machine learning classifiers based on static analysis features. In: Network and Distributed Systems Security (NDSS) Symposium 2020 (2020)

    17. Christian Wojner. Bytehist. https://www.cert.at/en/downloads/software/software-bytehist

    18. Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016)

    19. Ke, G.: Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems, vol. 30, pp. 3146–3154 (2017)

    20. Xiao, Y., Xing, C., Zhang, T., Zhao, Z.: An intrusion detection model based on feature reduction and convolutional neural networks. IEEE Access 7, 42210–42219 (2019)

    21. Conti, G., et al.: A visual study of primitive binary fragment types. Black Hat USA, pp. 1–17 (2010)

    22. Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: GradCAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE (2018)





    What Is Artificial General Intelligence?

    Artificial General Intelligence (AGI) is defined as the software representation of generalized human cognitive capacities that enables the ...