6.3 PROTECTING YOURSELF AND YOUR DATA BIG DATA Big data is a combination of structured, semi-structured and unstructured data collected by organisations that can be used to extract relevant information for machine learning projects and for other advanced applications. 1 Features of big data According to an early definition in 2001, big data is characterised by 3 Vs: • the large of data in many environments; • the wide of data types frequently stored in big data systems; • the at which much of the data is generated, collected and processed volume variety velocity More recently, other Vs have been added: • , which refers to the level of accuracy of data; • , because data can have real business value; • , as data can have different meanings and be formatted in different ways. veracity value variability Although the term big data does not correspond to any specific volume of data, it usually involves terabytes and even exabytes of data created and collected over time. Big data storage Big data is often stored in a , i.e., a large which can store large amounts of unprocessed data of different kinds in its original form. Many big data environments combine multiple systems in a distributed architecture. For example, a central data lake might be integrated with other platforms, including relational databases or a data warehouse . The data can be left in its form and then filtered and organised as needed for particular uses or pre-processed using data mining tools and data preparation software. data lake repository 2 raw 3 Big data processing Big data processing requires a large amount of computing power, often provided by hundreds or even thousands of server computers working together and using special technologies. Since this process is both cost-effective and challenging, clouds are popular locations for big data systems. Big data analytics Big data analytics is the name given to the process of gathering and analysing large volumes of data in order to extract information. To get relevant results from big data analytics applications, data scientists and data analysts must have a detailed understanding of the available data and a precise idea of what they are looking for. For this reason, data preparation is necessary. Data preparation includes: Profiling Cleansing Validation Transformation of data sets. Then, different tools can be used to analyse data such as data mining, statistical analysis, etc. MORE Examples of structured data are transactions and financial records. As for semi-structured data, there are web server logs and streaming data sensors. For unstructured data: texts, documents and multimedia files. 1 MORE A data warehouse is a type of data management system to perform queries on big data. 2 MORE Data mining tools are a set of techniques which use special algorithms, statistical analysis, artificial intelligence and database systems to analyse data from different dimensions and perspectives. 3