Cyber Crime Classification and Analysis of Various Indian States and Union Territories During 2011 to 2022 Using Data Mining Techniques

This paper attempts to identify the performance and to classify the cybercrime rate in Indian states during the period of 2022 with twenty-seven states and seven union teritories. The secondary source of data were collected from National Crime Records Bureau in India. Intially, the researchers selected 27 parameters related to crime. After data mining few variables are discarded from the analysis. The discarded variables are considered as outliers. The remaing twenty parameters were considered for further analysis, the variables like hacking the system, sexual exploitation, business, personal emotion, purchase of illegal drugs, spreading piracy, motives of blackmailing, etc. In this connection the main objectives are (i) to identify in which states crime rate is more, (ii) to identify which factors influence more for cybercrime, and (iii) to classify the performance based on the k-means clustering techniques. The results are classifed and are labelled as High Cyber Crime Rate States (HCCRS), Moderate Cyber Crime Rate States (MCCRS) and Low Cyber Crime Rate States (LCCRS)


Introduction
This research paper addresses state and national security, which has become significantly more important since the attacks on the Indian Parliament and Mumbai, as well as local and international terrorism, social media crime, etc.
To stop future attacks and hacking of the government networks, the Indian security agencies are actively gathering local and international intelligence data.In response, local security forces have been encouraged to keep a closer eye on cybercrime activity throughout all Indian states and union territories.Given that India has the second-largest population in the world after China, law enforcement and intelligence gathering in all Indian states have significant challenges in reliably and effectively obtaining crime data and analyzing the massive volume of data.Digital India is encouraged and implemented by the current administration.Nowadays, the majority of urban and rural residents use the internet, mobile devices, net banking, and other sources to conduct online transactions.Due to heavy network traffic and frequent online transactions that produce vast amounts of data, only a small part of cybercrime can be detected.
Data mining is a powerful tool that enables criminal investigations who may lack extensive training as data analysis to explore large amount of data quickly and efficiently [1].The computation can process thousands of commands in seconds, saving valuable time.In addition, installing and traing software often costs less level of errors than human analysis, especially those who work extensive hours.This research paper analyses the Indian states cybercrime data using Orange data mining software and conclude with results and suggestions.

Literature Review
Numerous scholars have recently examined data on cybercrime and shared their own opinions based on their databases.In order to research crime databases and other related fields, applied datamining techniques are used.These approaches are mostly concerned with classification, clustering, data reduction, social networking analysis, etc. [2] The other strategies advocate using log files as historical nline at: le o b ila Ava -81 -information to seek for relationships based on how frequently incidents occur [3].The government typically creates institutions like courts, prosecutors, and police that are in charge of upholding law and order in their respective nations.Controlling the frequency and occurrence of crimes is the responsibility of these agencies and other associated entities.Crime prevention plans must be developed and put into action by the crime prevention organizations [4].

Database and Cyber Crime Parameters
In this section, a discussion of the database and the cybercrime parameters selected for the analysis using data mining techniques are presented.

Database
The cybercrime data considered for this study is published by National Crime Records Bureau, which covers major crime records during the period of 2011-2022 in all the Indian states.Out of 27 variables after applying data mining techniques few parameters excelled from the database and the remaining 20 variables are considered for the present analyses 5 .Few variables are listed in the following Table (Table 1).

The Crime Parameters
Cybercrime parameters are simple and easy to understand.Many researchers have used to analyze some of the aspects of the cyber condition and performances.Recently, cybercrime parameters are used to find natural groups in large databases using Factor analysis and k-mean clustering techniques.

Materials And Methods
This Three data mining tools are applied for Indian cybercrime data, viz., Factor analysis, k-mean clustering technique and classification methods which are used to assess the cybercrime rate based on data mining techniques.

Data Mining Techniques
Data mining is often used interchangeably with Knowledge Discovery in Databases (KDD), which is the non-trivial process of extracting implicit, previously undiscovered, and potentially beneficial knowledge from data.Data mining in this situation uses the techniques of factor analysis and k-means clustering to show structural trends.The final stage of data mining involves presenting the user with the structures that were discovered in the data.Data mining is a relatively new concept, but the technology is not.The following diagram illustrates the iterative sequence that makes up a knowledge discovery process in general.

Fig. 1: Data mining iterative sequence
In order to put their company's results in perspective, mining also enables business owners to ascertain the effects of sales, customer satisfaction, and corporate profitability.The most crucial phases of the mining process are data mining and knowledge presentation, which make new and previously undetected structural patterns in the data visible [6].

Orange Data Mining k-means Clustering Algorithms
The cybercrime data is subjected to the K-means clustering technique by the orange data mining widget, which produces a new data set with the cluster index utilized as a class attribute.The original class attribute is transferred to the meta-attributes section, if it still exists.The widget also displays scores for various k for clustering results.The data on cybercrime is categorized using the k-mean clustering algorithm below.
Step 1: Select the number of clusters using distance measure with their centroids.The measures of distances are calculated using arithmetic means of clusters.
Step 2: Select initialization method, k = 2, 3, 4, Step 3: k-Means++, first center is selected randomly, subsequent are chosen from the remaining points with probability proportioned to squared distance from the closest center.
Step 4: Random initialization, the clusters are assigned randomly at first and then simplified with further iterations.
Step 5: Re-runs (how many times the algorithm is run) and maximal iterations (the maximum number of iterations within each algorithm run) can be set manually.
Step 6: The widget outputs a new data set with appended cluster information.Select how to append cluster information (as class, feature or meta-attribute) and name the column.
Step 7: If Run on every change is ticked, the widget will commit changes automatically [5].

Orange Data Mining Algorithms
From the orange data mining software, a schema is drawn with utmost care as per the research requirement.The step-by-step construction of the schema is given below and represented in figure 2.
Step 1: Select file widget and load your database in the form of file in .taband.xls format.
Step 2: Select a data table widget and connect to the file widget, then file widget is connected to distance widget.
Step 3: Select k-means clustering widget and connect to selected row and table widget.
Step 4: Finally, Double click file widget, data table widget, selected row widget and k-means clustering widget one by one.All these widgets assign their output and display their results in the report and output window.
Step 5: Open scatter widget which shows the two-dimensional k-means clustering results of cybercrime with the label of all districts in the study period [7].
In the following sections the results are interpreted based on cybercrime data (5).To explore the widget the following schema is depicted (Figure 2).nline at: le o b ila Ava -83 -

Factor Analysis
Various types of factor analysis are utilized to examine the stability of cybercrime trends across the study period.Although there are several methods for reducing data and variables, factor analysis is by far the most used approach.Factor analysis lowers the variable space under examination to fewer patterns while retaining the majority of the data from the original data matrix, much like any other data reduction technique.In the current situation, principal component analysis is initially started to identify structural patterns using a linear combination of the Indian state-specific cybercrime factors.
However, the top m factors, which account for 85% of the variance, are regarded as significant when using the factor extraction approach.By looking at a variable's factor loading, orthogonal rotations like Varimax and Quartimax rotations can be used to assess how comparable two variables are.In factor analysis, the parameter of the factor model that estimates values of the common factor is the main point of interest [8].

k-Means Clustering Algorithm
Clustering methods are frequently used in classification issues in data mining applications.Since no assumptions are made about the group structures existing in the database, the non-hierarchical clustering approach proposed by MacQueen-also known as unsupervised classification-is used in the current work.In applied statistics, the k-means clustering method is used to identify appropriate classes [9].
With each group's members as close to one another as possible and the members of other groups as far apart as possible, this method divides or groups the data set into mutually exclusive groups.This method typically computes Euclidean distance using variables.

Pruning Method
In order to remove the outliers, a method to prune the data for each of the study period is described below: Step 1: Factor analysis is initiated to find the structural pattern underlying the data set.
Step 2: k-means analysis is used to partition the data set into k-clusters using cybercrime parameters as input matrix.
Step 3: Repeat Steps 1 and 2 until meaningful groups are obtained, by removing outliers in each cycle, where an outlier is a group with only a few cybercrime parameters.

Results and Discussion
The number of Classes is determined as 3 that had relevant interpretations, and the researcher discusses both the Varimax and Quartimax criterion of orthogonal rotation that have been applied for the pruned

Cyber Crime Classification and analysis of various Indian states and union territories during 2011 to 2022 Using Data Mining Techniques
Available online at: https://jazindia.com-84 -data, aggregated by pruning method for different values of k.Both techniques of component analysis produced extremely comparable results, although the varimax rotation offered a comparatively stronger clustering of cybercrime parameters.Five factors were consistently found using factor analysis to account for 85% of the overall variation in the data across the research period, with eigenvalues close to or equal to unity (Table 3).This research shows that the clustering of the variables related to cybercrime is unstable during the study period.In original database slight changes are encountered due to statistical variations.With only 3 clusters to take into account, there are 3 classes.Assigning initial group labels to cybercrime data comes after factor analysis in the data mining process, and is followed by iterative discriminant analysis using each of the three suggested methodologies.Although the results for each method of the study periods processed by the suggested algorithms were included, only the summary statistics are reported in Table 2 after the application of discriminant analysis achieved by zero percent misclassification and a sample scatter diagram are shown in figures 3 and 4.

Conclusion
The Orange data mining program effectively displays the outcome visually.In this work, data on cybercrime are estimated utilizing orange k-means clustering, factor analysis, and classification techniques.The approaches of grouping and classification produced perfect results.In this survey, states with high rates of cybercrime include Uttar Pradesh, followed by Karnataka, while those with moderately high rates include Maharashtra, Rajasthan, and Telangana.The remaining states have low rates of cybercrime.In the Union Territory, Delhi has the highest rate of cybercrime; Chandigarh has

Fig. 4 :
Fig. 4: Cyber Crime Classification for the study Period (Piracy) Finally, the two methods achieved three clusters based on cybercrime data and are labelled as High Cyber Crime Rate States (HCCRS), Moderate Cyber Crime Rate States (MCCRS) and Low Cyber Crime Rate States (LCCRS).In addition, the cybercrime data provides the same results over the study period using data mining tools like, Neural Network Classification, Self-Organizing Map, Support Vector Machine, Expectation Maximization (EM) Algorithm, DBASCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm, etc,

Table 1 :
Variable and variable Names