Statistics
, Cybersecurity [Year 2023 - 24]


Topics on Statistics with intensive computer applications

$ \int_0^t d S_u = \int_0^t \mu(S_u, u) du + \int_0^t\sigma(S_u, u) dW_u $

Supporto al corso e alla didattica telematica, by T. Gastaldi   #Sapienzanonsiferma  #Sapienzadoesnotstop

(Instructor: tommaso.gastaldi@gmail.com,
https://www.datatime.eu/public/cybersecurity/)


Whatsapp group for the students of this course
Invitation to join the Whatsapp group for this course: https://chat.whatsapp.com/Kk3wRGmmxWH9RNUo01zFdX

(When first joining, send a message with your name and id ("matricola"))


____________________________________________________________________________________




General notes for all homeworks

-Implement exercises in both winform (c# or vb.net if you prefer, etc...) and web version (Js). For Js, always use latest ECMAScript (use classes, let, const, no var, etc...) and strict mode (in case, webstorm or rider can also be of great help to stay up to date with latest language updates and to check syntax.) Put the javascript programs directly online as webpage.

-All important code must be shown and possibly discussed (as to the the crucial parts only) in the homework web page so that one can understand the main points.
(Full version can be stored on github or as zip file containing the "solution", if you like, but that is not required.)

-Never use any third part library or higher level languages (e.g., sas, r, python, matlab, minitab, etc.) because our purpose is to actually implement
from scratch the very basics to deeply understand our topics. (Using other people's "black boxes" would defy completely our learning purpose.)

-Always exercise your capacity of abstraction. Never write algorithms that works only on specific cases or data, but, on the contrary, try to be as general as possible in any of your creations and logic. Use smart personal implementations to show your intelligence and insight! Originality and deep thinking are the most appreciated values in this course.

-Always acknowledge your sources and use quotes when you just copy paste text from other sources (note that what you copy may be wrong!).



Homework 1

Theory (intro)
- What is Statistics and its relationship with other disciplines. Difference between Descriptive and Inferential Statistics.
- Describe the concepts of Population, Sample Attribute, Variable, Level of measurement and Dataset.
- Briefly describe the main sampling methods
- Briefly describe the main experiment designs

Applications (intro)
- Download Visual Studio
- Write a program in C# or VB.NET that creates a window containing a single line, point, circle, rectangle
- Write a program in JavaScript or TypeScript that creates a window containing a single line, point, circle, rectangle.

Some resources:

https://en.wikipedia.org/wiki/Variable_and_attribute_(research)
https://www.investopedia.com/terms/s/statistics.asp
https://www.scribbr.com/methodology/sampling-methods/#:~:text=Probability%20sampling%20methods%20include%20simple,a%20chance%20of%20being%20included.
https://en.wikipedia.org/wiki/Design_of_experiments
https://www.surveymonkey.com/mp/open-ended-questions-get-more-context-to-enrich-your-data/#:~:text=open%2Dended%20questions%3F-,So%20what%20are%20open%2Dended%20questions%3F,or%20other%20closed%2Dended%20format.
https://en.wikipedia.org/wiki/Level_of_measurement

https://www.youtube.com/watch?v=uHRqkGXX55I&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=EZrP_av3cmA&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=pTuj57uXWlk&ab_channel=SimpleLearningPro
https://www.youtube.com/watch?v=10ikXret7Lk&ab_channel=SimpleLearningPro


Homework 2 (12-18/10/2023)

- Please, complete our survey (or add new variables), so we have data

1.
Choose 3 variables from our surveys:
- one Qualitative
- one Quantitative discrete
- one Quantitative continuous (use class intervals in this case, obviously)

create the most efficient algorithms to compute the frequency (absolute/relative/percentage) distribution of:

- the 3 variables
- the joint distribution of 2 variables (use a general "logic", where variables could be any number, k=2.3,...).

Double check/compare the results using some DBMS functionalities you prefer (eg., access, oracle online, postgres, ...) wherever possible.


2.
For the following most important data structures (or others that you may want to suggest) recall how to:

-loop (break/continue)
-add/remove/get/set/check the existence of key/value

data structures:
array, list, dictionary, sorted list, hashset, sortedset, queue, stack, linkedlist (or any other structure you think to be useful)

Note in a very concise way your finding also in your Js Cheatsheet, and, in case a corresponding Js object does not exists, create a simple equivalent class with all necessary corresponding methods to use it, similarly to c# (or any vb.net if you prefer).

3.
Generate N uniform random variates in [0,1) and determine the distribution into class intervals [i/k, (i+1)/k), i = 0,..., k-1.
Play with N and k values and draw some conclusion on the "shape" of the distribution.

Some resources:

https://online.stat.psu.edu/stat800/lesson/1/1.1
https://en.wikipedia.org/wiki/Frequency_(statistics)
https://en.wikipedia.org/wiki/Contingency_table
https://learn.microsoft.com/en-us/dotnet/api/system.random?view=net-7.0
https://www.w3schools.com/js/js_random.asp
https://www.mathstips.com/frequency-distribution-discrete-continuous-variables/



https://webreference.com/javascript/basics/versions/
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode
https://www.freecodecamp.org/news/var-let-and-const-whats-the-difference/



Homework 3 (19-25/10/2023)

Exercise 1

Part a
M systems are subject to a series of N attacks. On the x-axis, we indicate the attacks and on the Y-axis we
simulate the accumulation of a "security score" (-1, 1), where the score is -1 if the system is penetrated
and 1 if the system was successfully "shielded" or protected. Simulate the score "trajectories" for all systems,
assuming, for simplicity, a constant penetration probability p at each attack.

Part b
Same as before, but simulate the cumulated frequency, say f, of penetration. Do the same with the relative
frequency f/number of attacks and the "normalized" ratio: f/ √number of attacks.


For any of the above 4 charts (which will be actually an instance of a unique "object", from a coder's point of view), plot
a vertical histogram at some point x (day or attack number, user parameter) and at the last abscissa
value and make your personal considerations on the shape of the distributions.
Make sure that each animation is enclosed into a "frame" (a rectangle) resizable by the user, by using the mouse
(you can make a separate, reusable, "ResizableRectangle" object for that).

Discussion point:

Is what you see what you expected? What about the averages of the distributions and the shapes of the histograms:
do you see regularities, differences and can you attempt to explain what you see or guessing what are
the "theoretical" limit distribution, when as N increases, and you can make the distribution simulation "more detailed" by increasing M ?


--------------------------------------
Optional Part c

Given M computer systems. For each system, consider N days, where an attack can happen with probability p.
[You can allow p to change over systems or over days, if you wish, and note if there are differences in the (asymptotic) distribution.]
Chart the cumulative number of attacks at each day (all systems) or for each system (all days) with an animation
that shows, at any time, either the total number of attacks for each system or the attacks in each day for all systems.

Optional Part d

Do the same as in the previous part but, instead of counting the days with attacks, count the actual attacks each day which
we assume to be (0, 1, ..., k) with respective (constant wrt time) probabilities (p0, ..., pk).

--------------------------------------

Exercise 2

Recall briefly the definition and math notions relevant to "probability space" and make some simple examples, indicating among the triple of the space the meaning of each element in your particular example.
If you wanted to model probabilistically the homework Exercise 1, explain what are the 3 sets of your probability space and their elements, in this case.

-----------------------

Hints and details for exercise 1

For simplicity, all the charts should be done using essentially the same "chart object" (just suitably parameterized to the various variations
you want to use). In fact, from the code point of view, you will readily see that all the parts of exercise 1, and even the optional parts, are essentially
a "unique exercise" with minor variations which you can easily accommodate in your code. In other words, once implemented, you can manage all the cases
with the same few reusable objects. And these few classes (of objects) will also be reusable for future homeworks or projects.
For simplicity, I would suggest to create first the histogram object and the resizable rectangle object as separate objects and use them as accessories of the chart object.


Some resources:

https://en.wikipedia.org/wiki/Probability_theory
https://en.wikipedia.org/wiki/Probability_space
https://en.wikipedia.org/wiki/Probability_axioms
https://en.wikipedia.org/wiki/Probability_measure
https://en.wikipedia.org/wiki/Measure_(mathematics)

https://learn.microsoft.com/en-us/dotnet/api/system.windows.forms.timer?view=windowsdesktop-7.0
https://developer.mozilla.org/en-US/docs/Web/API/setInterval

https://video.search.yahoo.com/search/video?fr=mcafee&ei=UTF-8&p=graphics+window+transformation+linear&type=E210US91082G0#id=3&vid=a4a7dbb9067f4be971d1e9acd4f52280&action=click
https://stackoverflow.com/questions/41745072/how-to-create-a-resizable-rectangle-in-javascript




Homework 4 (26-1/11/2023)

Exercise

Since most of the programs you created about the distributions were wrong, we will do a "revision", due to the importance of the distribution concept.
Revise and optimize you previous programs to compute the joint distribution of any number of 2,3, ...k, continuous quantitative variables
where, for each variable, the user can specify the number of subdivisions ("class intervals") of a range containing the observed values.

Revise also your previous homework taking into account that qualitative variables can be ordered and therefore the order needs to be preserved.

For quantitative variables, include the possibility to specify class intervals too.

[If you think you created the best (original) logic, please send me your solution for an extra point (+1)
on final grade (essentially identical solutions will be excluded).]


Optional
Create a some visual representation of the distribution. Use creatively your fantasy and skills (you may invent new representations, if you like).


Research

(Revise and improve your simulations in homework 3, where necessary.)
Search on the web about the Law of large numbers LLN and compare it with Part b of your homework 3 and express in your own words whether your simulation is somehow related with this theorem, and why.
Search on the web about the Central Limit Theorem CLT and compare it with Part a of your homework 3 and say in your own words whether your simulation is somehow related with this theorem, and why.
Based on the CLT, how could you modify ("normalize") the "security score" to obtain an asymptotic convergence to a proper distribution?




Homework 5 (2-8/11/2023)

Exercise

M servers are subject to attacks during a period of time T (for instance 1 year).
Subdivide the interval T in N subinterval of size T/N and in each of this suppose that
an attack can occur with probability λ T/N.
Simulate the attacks to the M servers and represent each of them with a line which
makes jumps of 1 at each attack event.

Using the same objects ("movable/resizable rectangle", histogram, etc.) of the previous homework 3
draw vertically on the line chart the 2 histograms representing the distribution of the number
of attacks at the end of the period and one internal istant for comparison.

Study what happens asymptotically, for N large, and a number of systems M a sufficient to give shape to
a simulated distribution. Make some personal considerations about the shape and the average of the distributions that you see.


Research
Find out on the web about a Poisson point process. See if you can see any analogy with this Exercise and verify whether your distributions come close (for N, M sufficiently large) to the theoretical asymptotic distribution.




Homework 6 (9-15/11/2023)

Exercise 1

Consider a scheme similar to Homework 3, Part a
where M systems are subject to a series of N attacks.

A system is discarded as "unsecure" if it reaches a penetration score of P before reaching, instead, a security score of S.
Simulate and represent the probabilities of a system being discarded, for various values of P, example: P = k*10 (k=2,...,10),
conditional on the 3 cases for S: S = 20, S = 60, S = 100 (or any other value of S of your choice that you find useful to explore
(it could be a user parameter).


Research

Find out on the web about the "Gambler's Ruin Problem". See if you can see any analogy with this exercise and make your personal
consideration about what your simulation is suggesting to you.


Homework 9 (2 weeks from 16/11)

Exercise

Consider a scheme similar to Homework 3

First of all realize that the general scheme that you used so far (random walk and also Poisson process, etc.), can, more in general,
be used for any stochastic differential equations SDE (see, for instance, Euler–Maruyama method https://en.wikipedia.org/wiki/Euler%E2%80%93Maruyama_method).

In other words, with minor additions to your program you can now generalize this tool to simulate almost any stochastic
differential equations of interest for many applications

Create a web only version where you allow the user to explore (selectable by user) any useful stochastic process.

Do a research on the web and include any SDE that you think its interesting. Some examples of popular processes:

Arithmetic Brownian
Geometric Brownian (Black–Scholes)
Ornstein–Uhlenbeck (mean-reverting)
Vasicek
Hull–White
Cox–Ingersoll–Ross
Black–Karasinski
Heston
Chen model
[... any other interesting ...]


Optional (+1 grade):
Compare also with other possible simulation schemes which have been proposed
eg, Milstein, Runge-Kutta, Heun's, ...), pointing out possible differences.


Optional (+2 grade):
Allow the user to input manually an SDE (on-the-fly compilation) and simulate that


Best 3 original (and different) programs, based on students' votes, will receive +1 and will be kept online on our site
(with attribution, obviously) to the respective authors.