**
Statistics, Cybersecurity [Year 2023 - 24]**

Topics on Statistics with intensive computer applications

$ \int_0^t d S_u = \int_0^t \mu(S_u, u) du + \int_0^t\sigma(S_u, u) dW_u $

*
Supporto al corso e alla didattica telematica, by T. Gastaldi #Sapienzanonsiferma #Sapienzadoesnotstop*

(Instructor: tommaso.gastaldi@gmail.com,

https://www.datatime.eu/public/cybersecurity/)

**Whatsapp**** group for the students of this
course**

Invitation to join the Whatsapp group for this
course: https://chat.whatsapp.com/Kk3wRGmmxWH9RNUo01zFdX

(When first joining, send a message with your name and id ("matricola"))

____________________________________________________________________________________

-Implement exercises in both winform (c# or vb.net if you prefer, etc...) and web version (Js). For Js, always use latest ECMAScript (use classes, let, const, no var, etc...) and

-All important code must be shown and possibly discussed (as to the the crucial parts only) in the homework web page so that one can understand the main points.

(Full version can be stored on github or as zip file containing the "solution", if you like, but that is not required.)

-Never use any third part library or higher level languages (e.g., sas, r, python, matlab, minitab, etc.) because our purpose is to actually implement

from scratch the very basics to deeply understand our topics. (Using other people's "black boxes" would defy completely our learning purpose.)

-Always exercise your capacity of abstraction. Never write algorithms that works only on specific cases or data, but, on the contrary, try to be as general as possible in any of your creations and logic. Use smart personal implementations to show your intelligence and insight! Originality and deep thinking are the most appreciated values in this course.

-Always acknowledge your sources and use quotes when you just copy paste text from other sources (note that what you copy may be wrong!).

- What is Statistics and its relationship with other disciplines. Difference between Descriptive and Inferential Statistics.

- Describe the concepts of Population, Sample Attribute, Variable, Level of measurement and Dataset.

- Briefly describe the main sampling methods

- Briefly describe the main experiment designs

- Download Visual Studio

- Write a program in C# or VB.NET that creates a window containing a single line, point, circle, rectangle

- Write a program in JavaScript or TypeScript that creates a window containing a single line, point, circle, rectangle.

Some resources:

https://en.wikipedia.org/wiki/Variable_and_attribute_(research)

https://www.investopedia.com/terms/s/statistics.asp

https://www.scribbr.com/methodology/sampling-methods/#:~:text=Probability%20sampling%20methods%20include%20simple,a%20chance%20of%20being%20included.

https://en.wikipedia.org/wiki/Design_of_experiments

https://www.surveymonkey.com/mp/open-ended-questions-get-more-context-to-enrich-your-data/#:~:text=open%2Dended%20questions%3F-,So%20what%20are%20open%2Dended%20questions%3F,or%20other%20closed%2Dended%20format.

https://en.wikipedia.org/wiki/Level_of_measurement

https://www.youtube.com/watch?v=uHRqkGXX55I&ab_channel=SimpleLearningPro

https://www.youtube.com/watch?v=EZrP_av3cmA&ab_channel=SimpleLearningPro

https://www.youtube.com/watch?v=pTuj57uXWlk&ab_channel=SimpleLearningPro

https://www.youtube.com/watch?v=10ikXret7Lk&ab_channel=SimpleLearningPro

1.

Choose 3 variables from our surveys:

- one Qualitative

- one Quantitative discrete

- one Quantitative continuous (use

create the most efficient algorithms to compute the frequency (absolute/relative/percentage) distribution of:

- the 3 variables

- the joint distribution of 2 variables (use a general "logic", where variables could be any number, k=2.3,...).

Double check/compare the results using some DBMS functionalities you prefer (eg., access, oracle online, postgres, ...) wherever possible.

2.

For the following most important data structures (or others that you may want to suggest) recall how to:

-loop (break/continue)

-add/remove/get/set/check the existence of key/value

array, list, dictionary, sorted list, hashset, sortedset, queue, stack, linkedlist (or any other structure you think to be useful)

Note in a very concise way your finding also in your Js Cheatsheet, and, in case a corresponding Js object does not exists, create a simple equivalent class with all necessary corresponding methods to use it, similarly to c# (or any vb.net if you prefer).

3.

Generate N uniform random variates in [0,1) and determine the distribution into

Play with N and k values and draw some conclusion on the "shape" of the distribution.

Some resources:

https://online.stat.psu.edu/stat800/lesson/1/1.1

https://en.wikipedia.org/wiki/Frequency_(statistics)

https://en.wikipedia.org/wiki/Contingency_table

https://learn.microsoft.com/en-us/dotnet/api/system.random?view=net-7.0

https://www.w3schools.com/js/js_random.asp

https://www.mathstips.com/frequency-distribution-discrete-continuous-variables/

https://webreference.com/javascript/basics/versions/

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode

https://www.freecodecamp.org/news/var-let-and-const-whats-the-difference/

Part a

M systems are subject to a series of N attacks. On the x-axis, we indicate the attacks and on the Y-axis we

simulate the accumulation of a "security score" (-1, 1), where the score is -1 if the system is penetrated

and 1 if the system was successfully "shielded" or protected. Simulate the score "trajectories" for all systems,

assuming, for simplicity, a constant penetration probability p at each attack.

Part b

Same as before, but simulate the cumulated frequency, say f, of penetration. Do the same with the relative

frequency f/number of attacks and the "normalized" ratio: f/ √number of attacks.

For any of the above 4 charts (which will be actually an instance of a

a vertical histogram at some point x (day or attack number, user parameter) and at the last abscissa

value and make your personal considerations on the shape of the distributions.

Make sure that each animation is enclosed into a "frame" (a rectangle) resizable by the user, by using the mouse

(you can make a separate, reusable, "ResizableRectangle" object for that).

Discussion point:

Is what you see what you expected? What about the averages of the distributions and the shapes of the histograms:

do you see regularities, differences and can you attempt to explain what you see or guessing what are

the "theoretical" limit distribution, when as N increases, and you can make the distribution simulation "more detailed" by increasing M ?

--------------------------------------

Optional Part c

Given M computer systems. For each system, consider N days, where an attack can happen with probability p.

[You can allow p to change over systems or over days, if you wish, and note if there are differences in the (asymptotic) distribution.]

Chart the cumulative number of attacks at each day (all systems) or for each system (all days) with an animation

that shows, at any time, either the total number of attacks for each system or the attacks in each day for all systems.

Optional Part d

Do the same as in the previous part but, instead of counting the days with attacks, count the actual attacks each day which

we assume to be (0, 1, ..., k) with respective (constant wrt time) probabilities (p0, ..., pk).

--------------------------------------

Recall briefly the definition and math notions relevant to "probability space" and make some simple examples, indicating among the triple of the space the meaning of each element in your particular example.

If you wanted to model probabilistically the homework Exercise 1, explain what are the 3 sets of your probability space and their elements, in this case.

-----------------------

For simplicity, all the charts should be done using essentially

you want to use). In fact, from the code point of view, you will readily see that all the parts of exercise 1, and even the optional parts, are essentially

a

with the same few reusable objects. And these few classes (of objects) will also be reusable for future homeworks or projects.

For simplicity, I would suggest to create first the histogram object and the resizable rectangle object as separate objects and use them as accessories of the chart object.

Some resources:

https://en.wikipedia.org/wiki/Probability_theory

https://en.wikipedia.org/wiki/Probability_space

https://en.wikipedia.org/wiki/Probability_axioms

https://en.wikipedia.org/wiki/Probability_measure

https://en.wikipedia.org/wiki/Measure_(mathematics)

https://learn.microsoft.com/en-us/dotnet/api/system.windows.forms.timer?view=windowsdesktop-7.0

https://developer.mozilla.org/en-US/docs/Web/API/setInterval

https://video.search.yahoo.com/search/video?fr=mcafee&ei=UTF-8&p=graphics+window+transformation+linear&type=E210US91082G0#id=3&vid=a4a7dbb9067f4be971d1e9acd4f52280&action=click

https://stackoverflow.com/questions/41745072/how-to-create-a-resizable-rectangle-in-javascript

Since most of the programs you created about the distributions were wrong, we will do a "revision", due to the importance of the distribution concept.

Revise and optimize you previous programs to compute the joint distribution of any number of 2,3, ...k, continuous quantitative variables

where, for each variable, the user can specify the number of subdivisions ("class intervals") of a range containing the observed values.

Revise also your previous homework taking into account that qualitative variables can be ordered and therefore the order needs to be preserved.

For quantitative variables, include the possibility to specify class intervals too.

[If you think you created the best (original) logic, please send me your solution for an extra point (+1)

on final grade (essentially identical solutions will be excluded).]

Create a some visual representation of the distribution. Use creatively your fantasy and skills (you may invent new representations, if you like).

(Revise and improve your simulations in homework 3, where necessary.)

Search on the web about the

Search on the web about the

Based on the CLT, how could you modify ("normalize") the "security score" to obtain an asymptotic convergence to a proper distribution?

M servers are subject to attacks during a period of time T (for instance 1 year).

Subdivide the interval T in N subinterval of size T/N and in each of this suppose that

an attack can occur with probability λ T/N.

Simulate the attacks to the M servers and represent each of them with a line which

makes jumps of 1 at each attack event.

Using the same objects ("movable/resizable rectangle", histogram, etc.) of the previous homework 3

draw vertically on the line chart the 2 histograms representing the distribution of the number

of attacks at the end of the period and one internal istant for comparison.

Study what happens asymptotically, for N large, and a number of systems M a sufficient to give shape to

a simulated distribution. Make some personal considerations about the shape and the average of the distributions that you see.

Find out on the web about a

Consider a scheme similar to Homework 3, Part a

where M systems are subject to a series of N attacks.

A system is discarded as "unsecure" if it reaches a penetration score of P

Simulate and represent the probabilities of a system being discarded, for various values of P, example: P = k*10 (k=2,...,10),

conditional on the 3 cases for S: S = 20, S = 60, S = 100 (or any other value of S of your choice that you find useful to explore

(it could be a user parameter).

Find out on the web about the "Gambler's Ruin Problem". See if you can see any analogy with this exercise and make your personal

consideration about what your simulation is suggesting to you.

Consider a scheme similar to Homework 3

First of all realize that the general scheme that you used so far (random walk and also Poisson process, etc.), can, more in general,

be used for any stochastic differential equations SDE (see, for instance, Euler–Maruyama method https://en.wikipedia.org/wiki/Euler%E2%80%93Maruyama_method).

In other words, with minor additions to your program you can now generalize this tool to simulate almost any stochastic

differential equations of interest for many applications

Create a

Do a research on the web and include any SDE that you think its interesting. Some examples of popular processes:

Arithmetic Brownian

Geometric Brownian (Black–Scholes)

Ornstein–Uhlenbeck (mean-reverting)

Vasicek

Hull–White

Cox–Ingersoll–Ross

Black–Karasinski

Heston

Chen model

[... any other interesting ...]

Compare also with other possible simulation schemes which have been proposed

eg, Milstein, Runge-Kutta, Heun's, ...), pointing out possible differences.

Allow the user to

Best 3 original (and different) programs, based on students' votes, will receive

(with attribution, obviously) to the respective authors.