(PDF) Big Data Management - Technical Po

IT UNIVERSITY OF COPENHAGEN SUBMISSION OF WRITTEN WORK Class code: 1013004U Name of course: Big Data Management (Technical) Course manager: Björn Thór Jónsson Course e-portfolio: Thesis or project title: Exam Report Supervisor: Full Name: Birthdate (dd/mm-yyyy): E-mail: 1. Vlad Alexandru Ilie 21/03 - 1995 vali @itu.dk 2. @itu.dk 3. @itu.dk 4. @itu.dk 5. @itu.dk 6. @itu.dk 7. @itu.dk Big Data Management – Technical Vlad Alexandru Ilie Question 1: 1 Question 2 3 Sub-question A.A 3 Sub-question A.B 3 Sub-question A.C 4 Sub-question A.D 4 Sub-question A.E 5 Sub-question B 5 Question 3 7 Sub-question A 7 Sub-question B 9 References: 11 Appendix: 13 Appendix A: 13 Word count: 3,570 Symbols: 18,225 Question 1: First and foremost, it is important to note the difference between Amazon Kinesis and the Lambda Architecture (LA) at a conceptual level. The former represents a set of tools and services offered by Amazon Web Services (AWS) and is dedicated to collecting, processing and analysing streaming data (Kinesis, 2013). On the other hand, the latter is a generic distributed data processing architecture (Marz & Warren, 2015) which provides a solution to the necessity for low-latency, robust and fault-tolerant system that can be used for a wide range of workloads and use cases (AWS, 2017). The LA is described as a set of principles that define the manner in which the underlying batch, serving and speed layers cooperate to achieve the desired outcome of the big data processing applications (Marz & Warren, 2015). The fundamental interactions between these can be described as follows: the batch layer stores a master dataset via a system such as the Hadoop HDFS and creates system-wide batch views through, for example, MapReduce (Hausenblas, 2013) or Apache Hive. Next, the serving layer indexes the batch views so they can be queried in an ad-hoc way (Hausenblas & Bijnens, n.d.) without re-reading the entire database and using software such as Apache Hbase (Apache, 2008). In addition to this, the speed layer is used to compensate for the long running times of batch computations and it exclusively handles new data to create real-time views via software such as Storm or Apache Impala (Hausenblas, 2014). Lastely, batch and speed views are then merged to form a complete answer to a user’s query. All these components can be connected and unified using a fast and general-purpose cluster computing system, for example, Apache Spark (Spark Overview, n.d.). A high-level visualisation of these components can be seen in Appendix A. It is worth mentioning, that the LA can handle both static and streaming data, however for the later, additional components such as Apache Kafka have to be integrated in the system infrastructure. In contrast to this, as part of the Kinesis package, AWS offers three tools: Amazon Kinesis Streams, Amazon Kinesis Firehose, and Amazon Kinesis Analytics, which aim to provide a holistic framework for big data management and can be used to implement end-to-end real-time applications (Al-Saadoon, 2017). One way of collecting data so it can be used inside the AWS Management Console as well as by the variety of AWS services, is though the Kinesis Firehose. It represents a stream management service (AWS, 2018a) capable of capturing, pre-processing 1 and loading data. However, if the necessity of more flexibility arises, Kinesis Streams, either video or text-based, can be used to create custom applications that allow for integration with popular stream processing methods and can load data into any data store (AWS, 2018b). Once data has been loaded into as AWS storage service, such as Amazon S3, it can then be processed using the Kinesis Analytics. This enables the use of an interactive SQL editor, open-source Java libraries or other AWS services to analyse the data (AWS, 2018c). Traditionally, big data management implies creating custom streaming data pipelines that are capable of collecting and analysing data from a wide range of sources (AWS, 2017). In addition to this, the hardware infrastructure, servers and clusters, need to be maintained, secured and monitored for possible failures; all of these tasks require significant resources (AWS, 2017). To tackle these issues, the AWS tools and Amazon Kinesis provide cloud-computing solutions that offer users a fully managed and highly scalable service with no minimum or up-front costs. Lastly, it is important to note that based on the currently available AWS services, the principles that govern the Lambda Architecture can be replicated within the Kinesis and AWS platform. Figure 1: Lambda Architecture replicated inside the Amazon Kinesis by Al-Saadoon, 2017 2 Question 2 Sub-question A.A Based on the Hadoop Documentation, a block size represents the smallest unit of data in the filesystem (DataFlair, 2017a). All files on an HDFS are first separated into data blocks and are then sent to multiple machines on the cluster. This represents a vital concept within big data management as it enables the processing of very large files, which can be larger than an individual’s machine storage capabilities. The size of data blocks is directly correlated to the performance of the overall system, as the number of blocks = collection size / block size (DataFlair, 2017a). Based on the exercise scenario, a block size of 64 KB would create 128 ∗ 109 / 64 = 2 ∗ 109 blocks. Furthermore, files sizes are not directly proportional to the block size, thus additional blocks are created for the last segment of each file, and moreover each block is replicated for fault-tolerance purposes three times by default (Borthakur, 2014). Overall, the proposed setup would create over 2 billion data blocks each having its own metadata and replication, this would lead to huge overhead and network traffic (DataFlair, 2017a). A better configuration would be to use the default block size of 128MB (DataFlair, 2017a), which would create 128 ∗ 106 / 128 = 106 blocks, which is a more manageable amount. Sub-question A.B Whereas the block size is a physical representation of data on a machine, the InputSplit represents a logical reference to the data which will be processed by an individual Mapper (DataFlair, 2017b). As a result, the split size directly affects the number of MapReduce task. The relation between the block and split sizes can take many variations depending on the cluster configuration and the data being processed, however based on the default value, the block and split sizes should be equal (DataFlair, 2017c). The rationale is that, by having equal value for both parameters, the cluster will avoid unnecessary data transfers between machines. The exercise scenario assumes a split size of 64 KB, and, while it aligns with the previous question, a better configuration would be to have both the block and split sizes as the default 3 128MB. This would create128 ∗ 106 /128 = 106 (total input size / split size) (DataFlair, 2017b) maps. Because “right level of parallelism for maps seems to be around 10-100 maps per-node” (Apache, 2014), a different, much larger split size should be tested. However, in this situation the performance is depended on the hardware configurations of the machines that form the cluster. Sub-question A.C After the data has been split and passed through the maps to create the corresponding key-value pairs (DataFlair, 2017d), it is then passed to a Reducer which shuffles, sorts and then aggregates and reduces the data into a final output (Apache, 2018) . The default number of Reducers is 1 (DataFlair, 2017d), however a configuration that connects one map to one reducer is equivalent to a non-distributed, local system (Apache, 2018). On the other hand, having a large number of reducers would be detrimental for performance as the system would run out of computational resources. While there is no standard for the number of reducers, according to the Apache documentation, the “right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>)” (Apache, 2018). Based on this equation, the correct number of reducers for the situation in this exercise would be greater or equal to 950. Sub-question A.D There are three sets of API within Spark: RDDs, DataFrames and DataSets, however as of Spark 2.0 the latter two have been unified. This has been done as a way to offer more ways of processing structured data as well as to limit the number of overlapping concepts (DataBricks, 2016). Firstly, RDD stands for resilient distributed dataset and represents a collection of distributed elements across nodes of the cluster and can be used in parallel (Apache, 2016). Another vital aspect of Spark 2.0 is that users can change between the three data structures seamlessly tough API method calls. The reason for this is that DataFrames and DataSets are extensions of RDDs (DataBricks, 2016). The main difference between them is the level of control they enable for data manipulation and the ease-of-usage. Because there is no information about the type of data within the collection of documents, it is a fair assumption to consider the data unstructured. Because of this, RDDs can be considered more useful. However, using 4 DataFrames and DataSets is more user-friendly and in some situations can be highly efficient from a hardware resource usage point of view (DataBricks, 2016). Sub-question A.E According to the Apache Spark documentation, the . 𝑐𝑜𝑙𝑙𝑒𝑐𝑡() method returns an array that contains all the elements in the Dataset. However, simply running the command 𝑤𝑜𝑟𝑑𝐶𝑜𝑢𝑛𝑡𝑠. 𝑐𝑜𝑙𝑙𝑒𝑐𝑡() would attempt to transfer all contents of the Dataset into the application's driver process. Depending on the number of distinct words in the overall collection and hardware resources available, the command is likely to cause an OutOfMemoryError (Apache Spark, n.d.). This is because the method should be used only if the expected result is small and can be handled by a single machine (Apache Spark, n.d.). In this situation, the 𝑤𝑜𝑟𝑑𝐶𝑜𝑢𝑛𝑡𝑠 Dataset can be transformed into an RDD, using the . 𝑟𝑑𝑑 function and then saved it as a distributed file on the HDFS, using the method . 𝑠𝑎𝑣𝑒𝐴𝑠𝐻𝑎𝑑𝑜𝑜𝑝𝐹𝑖𝑙𝑒𝑠() (Apache Spark, 2016). Sub-question B A consistency model is defined by Tanenbaum and Steen as “essentially a contract between processes and the data store”, where the processes agree to obey certain rules and the store promises to work correctly. A different way of conceptualizing a consistency model is from the point of view of the distributed system. Depending on the DFS configuration, data is most likely replicated over various nodes in the cluster for the purposes of availability, reliability and performance. Keeping this in mind, the problem of choosing the correct consistency model is equivalent to defining the manner in which replicas update themselves (Tanenbaum & Steen, 2007). Depending on the desired outcome, the level of consistency can be strong, where updates are propagated across the network as soon as possible, thus ensuring that the latest data is available. At the opposing end of the spectrum, weak consistency models are characterized by the lack of simultaneous updates, or when such updates happen, they can easily be resolved. In the situation of an email mobile app, most operations revolve around reading data and a client-centric 5 consistency models can hide many inconsistencies in relatively cheap ways (Tanenbaum & Steen, 2007). This is aligned with the CAP theorem which has shown that it is impossible to achieve both high availability and consistency in the same system in the presence of network partitions (Mehra, 2017). Furthermore, the system described in the exercise should have the top priority to always be available despite possible network errors and delayed email synchronisation. Because of this “the best consistency property you can have in a highly available system is eventual consistency” (Tanenbaum & Steen, 2007). An ideal model for an email app is based on the Bayou system, which was designed to be used in conjunction with mobile computing, where network connectivity and performance issues are common (Terry, et al., 1994). Within this methodology, four client-centric consistency can be identified: Monotonic Reads, Read Your Writes, Monotonic Writes and Writes Follow Reads. For the situation presented it the exercise, I believe that the best approach is to implement a system that supports Monotonic Reads and also incorporates elements of the Read Your Writes client-centric consistency models. The two features are closely related and together they define the order of read and write operations for both senders and receivers of emails. The core idea behind this client-centric models is that the email app has to look consistent for the user even if the underlying database has not finished updating. In practice this means that as long as the same device is used for interacting with the app at different times and locations, subsequent uses must reflect the changes made in the past even if the replicas on the DFS might have not been updated already. To achieve this, the Monotonic Reads model states that once a process has read the value of a data item X, any subsequent read on X by the same process will return the same or more recent values of X (Tanenbaum & Steen, 2007). This property defines how incoming emails can be read and accessed. Following such an approach, updates to distributed replicas can be done in an on- demand manner, only when they need additional data for consistency (Tanenbaum & Steen, 2007). In practice, assuming a user read their emails in location A without altering anything, they will still have access to that version regardless of location, time or internet connection. Furthermore, the system should incorporate features of the Read Your Writes consistency model, which guarantees that “the effect of a write operation by a process on data item X will always be seen by a successive read operation on X by the same process” (Tanenbaum & Steen, 2007). This 6 has the practical effect that local modifications are persistent when replicas cannot be updated due to external factors such as network connectivity issues. As part of the email application, this means that users can make modifications locally and attempt to send emails even without an internet connection and once it is available, all changes will be synchronized. Question 3 Sub-question A The situation presented in this exercise represents a classic Big Data management challenge, in the sense that the initial purpose of the data collection was not defined, and it was done in the manner that best fits with the devices that generate the data. This is a reversed way of attempting to solve a problem by first collecting as much data as possible and only later trying to figure out the best way to use it rather than having a clear end-goal and tailoring the data generation for a specific scope. Before any kind of analysis can be done and potential solution explored, it is important to review the type of data available. While an in-depth description of the data can be seen in Project 2, the following paragraphs will present a summary of the datasets and the manner in which these can be combined in order to evaluate the evolution of course attendance. Students have been access to three datasets corresponding to Rooms, Meta and Readings, which can be used to connect the following relations: from the Rooms, all ITU scheduled activities, e.g: lectures and labs, are connected to a specific data and room number, from the Meta dataset, all rooms are associated with a unique router name and lastly, from the Readings, the unique router names is related to a list of all connected users for each timestep throughout the day. Based on these relations, it is possible to connect the datasets over the course of a semester to generate a user-specific dataset that tracks its complete history of WiFI connections including the router it was connected to and its location. From this, the general location of a user can be identified for each timestep in the database. To be able to evaluate course attendance, a different dataset that can tell for any student which courses they are enrolled in, and when all the lectures and labs are scheduled. ITU already has the mit.itu.dk, which keeps track of student enrollments. Based on this and the Rooms data, an intermediary dataset can be created and it should connect a user’s unique ID to the courses they should be attending and all dates when this should happen. 7 From a users complete WiFi history and the schedule of all lectures and their location, the databases can be cross referenced to figure out whether or not a student was connected to a wifi router that is close to the lecture location. The process can then be replicated for all students to obtain the overall course attendance evolution. On the other hand there are a few significant limitations created by the data available, the physical layout of the ITU building and manner in which routers and Wifi signals function. First and foremost, ITU does not have a router for each room and as a result, it is impossible to figure out the exact location of a student, but rather the general location based on the router they are connected to. Furthermore, assuming there was a router for each room or that lectures are held only in rooms with Wifi routers, there is no guarantee that a device will connect to that router. In addition to this, at any given time and for most classrooms there are multiple routers that a device can connect to. This is a beneficial feature of the building as it ensures maximum wifi coverage with a stable and optimal connection. Yet another challenge of using Wifi data to calculate course attendance is the situation of role-model students, who do not spend time on other activities during lecture and thus do not even need to connect to wifi. At the other end of the spectrum are students who do not bring their laptops to lectures or who only use their phone or tablet for engaging with the course. Students with devices that can have access to the internet via a sim card with internet subscription without connecting to the Wifi network would also be unaccounted for in the provided datasets. This last paragraph represents a disclaimer of personal experience. In late 2016, ITU has tested an attendance tracking system that asked students to swipe their access cards against a custom built RaspberryPi-powered device that provided students with an audio-visual feedback whether or not their card swipe was successfully logged in the system. The project was a collaboration between ITU, Ethos Lab and Pit Lab. This is a much more efficient system, as the data generated is representative of whether or not a particular student was present in a specific class at a particular time. 8 Sub-question B The situation described in this exercise could be described as the first step toward a mass- surveillance Orwellian society. However, the exercise makes a number of assumptions which, in real-life, would heavily limit the government’s ability to spy on its citizens. First, it is unreasonable to expect that cars will directly send its location history to a centralised government database. This could be achieved either by a collaboration between car manufactures and governments without the knowledge of the public, or by asking citizens to install a device on their car that would enable this sort of data collection. The former scenario implies that cars would have to be specifically customised for the country where they would be sold, in practice this would generate additional production costs and contradictory goals for manufacturers. The latter scenario is more plausible, as ethernet usb adapters could be used to make sure all cars are connected to the internet and can send their location directly to a governmental database. On the other hand, this would not work with older cards that do not have an on-board computer and in that situation a black-boxed data collection device would have to be used but it would experience issues in terms of power consumption and internet connectivity. These are only a few issues that arise from hardware limitations and would heavily impact the performance of such a system. Given enough funding, prototypes and citizen willingness to participate, the technical challenges can be overcome. Assuming that the infrastructure for the system is functioning optimally, the government's ability to spy on its citizens become a trivial task. Because the exercise implied that data is captured from the point of view of the car rather than sensors or cameras throughout the city, it becomes much easier to generate a cars complete location history. This can be done by simply appending all data sent by a car into a single file. Even if each car has a completely anonymised identifier that is not related to the owner of the vehicle in any way, someone who has access to the vehicle’s location history could be able to figure out who the owner is. The biggest difference between this data set and the one in Project 3, is that the latter did not record the very first and last locations of a vehicle corresponding to the time and place of when the car was started or parked. By logging data at these two timestept, someone with access to the system can easily figure out the most frequent start and end times, which would correspond to 9 where a citizen lives, works as well as the top places where they drive to. The pseudonymisation of car identifiers would only slightly hinder the ability to figure out who is the owner of a vehicle. This is because the location history can reveal personal information. Citizens in rural areas or who live in houses with driveways can be accurately identified without any external data because the most frequent start and end positions of a car would correspond to a remote or private driveway, which is a clear indication of who the owner is. On the other hand, citizens in urban areas are more protected from this type of identification because, the most frequent start and end data entries would correspond to shared parking lots where the owner could be one of multiple people. While the actual name of the owner would still be unknown in both situations, being able to accurately calculated, without their consent, where someone lives, works and what places they frequently go to, represents an invasion of personal privacy. 10 References: 1. Amazon Kinesis. (2013, November 4). Retrieved December 16, 2018, from https://aws.amazon.com/kinesis/ 2. Marz, N., & Warren, J. (2015). Big data: Principles and best practices of scalable realtime data systems. Sebastopol: Oreilly media. 3. AWS. (2017, July). Whitepaper: Real-time Streaming Analytics – Amazon Web Services (AWS). Retrieved December 16, 2018, from https://aws.amazon.com/kinesis/whitepaper/ 4. Hausenblas, M. (2013, November 12). Applying the Big Data Lambda Architecture. Retrieved December 16, 2018, from http://www.drdobbs.com/database/applying-the-big- data-lambda-architectur/240162604 5. Hausenblas, M., & Bijnens, N. (n.d.). Lambda Architecture. Retrieved December 16, 2018, from http://lambda-architecture.net/ 6. Apache. (2008, March 28). Apache HBase – Apache HBase™ Home. Retrieved December 16, 2018, from https://hbase.apache.org/ 7. Hausenblas, M. (2014, July 24). Lambda Architecture with Apache Spark. Retrieved December 16, 2018, from https://speakerdeck.com/mhausenblas/lambda-architecture- with-apache-spark 8. Spark Overview. (n.d.). Retrieved December 16, 2018, from https://spark.apache.org/docs/latest/ 9. Al-Saadoon, L. (2017, November 29). Unite Real-Time and Batch Analytics Using the Big Data Lambda Architecture, Without Servers! | Amazon Web Services. Retrieved December 16, 2018, from https://aws.amazon.com/blogs/big-data/unite-real-time-and- batch-analytics-using-the-big-data-lambda-architecture-without-servers/ 10. AWS. (2018). Amazon Kinesis Data Firehose. Retrieved December 16, 2018, from https://aws.amazon.com/kinesis/data-firehose/ 11. AWS. (2018). Amazon Kinesis Data Streams. Retrieved December 16, 2018, from https://aws.amazon.com/kinesis/data-streams/ 12. AWS. (2018). Amazon Kinesis Data Analytics. Retrieved December 16, 2018, from https://aws.amazon.com/kinesis/data-analytics/ 13. DataFlair. (2017, April 26). Data Block in HDFS | HDFS Blocks & Data Block Size. Retrieved December 16, 2018, from https://data-flair.training/blogs/data-block/ 14. Borthakur, D. (2014, November 13). HDFS Architecture. Retrieved December 16, 2018, from https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop- hdfs/HdfsDesign.html 11 15. DataFlair. (2017, April 25). InputSplit in Hadoop MapReduce - Hadoop MapReduce Tutorial. Retrieved December 16, 2018, from https://data-flair.training/blogs/inputsplit- in-hadoop-mapreduce/ 16. DataFlair. (2017, April 30). MapReduce InputSplit vs HDFS Block in Hadoop. Retrieved December 16, 2018, from https://data-flair.training/blogs/mapreduce-inputsplit-vs-block- hadoop/ 17. Apache. (2014, July 17). How Many Maps And Reduces. Retrieved December 16, 2018, from https://wiki.apache.org/hadoop/HowManyMapsAndReduces 18. DataFlair. (2017, April 19). Hadoop Reducer - 3 Steps learning for MapReduce Reducer. Retrieved December 16, 2018, from https://data-flair.training/blogs/hadoop-reducer/ 19. Apache. (2018, November 13). Class Reducer. Retrieved December 16, 2018, from https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Reducer.html 20. DataBricks. (2016, July 14). A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets. Retrieved December 16, 2018, from https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds- dataframes-and-datasets.html 21. Apache. (2016, June 21). Class RDD. Retrieved December 16, 2018, from https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/rdd/RDD.html 22. Apache Spark. (n.d.). Abstract Class RDD. Retrieved December 16, 2018, from http://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.rdd.RDD@colle ct():Array[T] 23. Apache Spark. (2016, November 07). Class JavaPairDStream. Retrieved December 16, 2018, from https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/streaming/api/java/JavaPair DStream.html#saveAsHadoopFiles 24. Tanenbaum, A. S., & Steen, M. V. (2007). Sistemas distribuídos: Princípios e paradgmas. São Paulo: Pearson Prentice Hall. 25. Mehra, A. (2017, September 06). Understanding the CAP Theorem - DZone Database. Retrieved December 16, 2018, from https://dzone.com/articles/understanding-the-cap- theorem 26. Terry, D., Demers, A., Petersen, K., Spreitzer, M., Theimer, M., & Welch, B. (1994). Session guarantees for weakly consistent replicated data. Proceedings of 3rd International Conference on Parallel and Distributed Information Systems. doi:10.1109/pdis.1994.331722 12 Appendix: Appendix A: Lambda Architecture overview. Directly copied from http://lambda-architecture.net/ 13 Project 1 - Digital Tracing Group 10: Andreas Devald Bisgaard Pernille Gross Olesen Sofie Sung Mee Rømer Vlad Alexandru Ilie Critical Big Data Management, 2018 Technical Big Data Management, 2018 IT University of Copenhagen Word Count: 4.616 Symbols: 28.245 1. Introduction 1 1.1. The research questions 2 2. The Data Tracing Process 2 3. Data Traces Limiting 6 4. Personal Data as a Social Responsibility 9 5. Databox as a solution to the data problem? 11 6. References 17 Project 1 - Big Data Management adbi, vali, sosr, peol 1. Introduction We rely more and more on information communication technologies (ICT) as they are continuously integrated with processes. It is beneficial to do so partly because they are drivers of efficiency and connectivity. As we have adopted these technologies they have become an integral part of our everyday-life. Essentially making it very difficult for a person living in a developed country to avoid the use of ICT. In Denmark, communication with the government and commune has moved online, only to be avoided if an exemption is granted. With the use of ICT also comes the concern of being monitored. We have embraced ICT to an extent where some companies such as Google, Facebook, Apple are able to get a pretty clear picture of who you are as a person and what your tendencies are, that is if they want and if you have provided enough data. Countless traces are monitored and tracked when one browses the internet and uses different services. Recently many have been made aware what Facebook data can be used to, based on the Facebook-Cambridge Analytica case in relation to the 2016 election in the United States (The Guardian, 2018). This has brought previously unseen attention to tracking, data analysis, and the potential misuse of personal information collected online. In Denmark, there is a current case related to the tracking of users through ICT devices. One might expect tracking or logging to be happening on privately owned platforms, but also the Danish government is forcing tele companies to log all mobile activity even though it is condemned illegal by the European Union (Tele2, 2016). The Danish government makes no differentiation in its logging, therefore it can be defined as mass surveillance. The minister of Department of Justice has personally forced the tele-industry to continue illegal logging until the European Union provides a widely applicable solution (Poulsen, 2017). The argument pro mass-surveillance is to keep the Danes safe and free of terror, the counter argument is that there is no proof that the nation-wide surveillance has prevented any severe crimes (Gjerding, 2014), such as terror, and that privacy without 1 Project 1 - Big Data Management adbi, vali, sosr, peol surveillance is a fundamental human right. The “union against illegal logging” has now filed a citizen supported lawsuit against the state (Ulovliglogning, 2018). When the Minister of Justice is making a political decision to continue the logging or not, he is weighing the advantages against the disadvantages, and has found the advantages of continuing logging are more important than the disadvantages. The same process is essentially carried out by any ICT , whether one is aware of it or not. As aware users, we are constantly assessing the trade (data vs. service) at hand, but what do we actually give away by accepting the terms of different ICT services? How extensive is the personal data trace when we browse online? Is it valuable data for the companies? Are we as users personally compromised? And finally, is there a viable solution to collecting and controlling your own personal data? 1.1. The research questions The point of interest for this project will be the investigation of the group members production of personal data. By tracing the process of everyday life of the group members a story may be unfolded that may create awareness, however, also creating more questions. In this investigation a limitation of data production will be tested followed by a discussion of the different reflections that occurred during the eksperiment. This will lead to more ethical discussions surrounding personal data, and what possible solutions that exist. Furthermore, will the concept of social responsibility in relation to personal data be discussed. Finally, the project about Databox will be discussed in accordance to our previous findings to investigate the possibility of the Databox being a plausible solution in personal data debate. 2. The Data Tracing Process The manner in which the data collection was conducted, represents our varied backgrounds; all members of the group took different approaches and tried using various software in order to 2 Project 1 - Big Data Management adbi, vali, sosr, peol allow their digital traces to be collected as well as later on to take back control over what data is collected about them. Regardless of the approach, we have identified three main categories of digital traces. These have been delimited based on the device they reside on and who has access to them: Local Traces: 1. Phone number 2. Operating system updates 3. Logs a. MAC Address Logs on Router b. Date/Time Access logs of Internet c. Visited IP Address Logs d. Internet Services Access Logs e. Device Name Logs on Router f. Browser History 4. MAC Address 5. IMSI 6. Language 7. Flash Cookies, Local Storage and Web Beacons 8. Computer serial number 9. Computer specification (Hardware) 10. Cookies 11. Browser History 12. Browser Fingerprint All of these digital traces can be found locally on any computer, mobile or smart device. The owner of the device is the one who has ownership rights over this information. Even though this information can be found locally on any smart device, operating system publishers, such as Microsoft, routinely collect information about their users through, more or less anonymised usage 3 Project 1 - Big Data Management adbi, vali, sosr, peol reports. The purpose of these is for software manufacturers to be able to identify potential problems with their system across many different users and under different conditions in order to be able to better protect and serve their customers. This is one of the only acceptable scenarios in which local traces can be used to bring a surplus of value for both the organisation and the customer. Network Traces: 13. Internet Service Access Logs 14. Device name logs on router This category of traces are created and managed by the network administrator and the internet service provider (ISP). For both public and private networks, there exists a domain controller that manages and keeps track of all user connections and their access points to the specific network. It is highly important to note that regardless of the level of tracking and control enforced by a network administrator; the ISP has complete access to it’s users internet traffic. A graph visualisation of a user’s internet history, where the small white triangles represent third party components. While some of them are vital to the optimal functioning of websites and are shared across seemingly completely separate websites, other ones are used for tracking user usage statistics; one of the most common third party website plugins is the Google analytics software used for marketing research. Internet Traces: 15. Visited IP Address Logs 16. Credit Card / Debit Card / Cash 17. Date and Time 18. Application access logs 19. Application Updates This category of traces is outside the control of the user. Once created, it is very difficult if not impossible to clear. These are recorded by simply using devices and software that require an Internet connection in order to function. 4 Project 1 - Big Data Management adbi, vali, sosr, peol All of these categories represent the minimum amount of digital traces that are created just by having a smart device however in reality, the amount of traces produced far exceeds these entries. Figure 1: Graph visualisation of interest of internet history where the white triangles are 3rd party connections 5 Project 1 - Big Data Management adbi, vali, sosr, peol 3. Data Traces Limiting Despite the large amount of digital traces that are created while using any smart device, a significant portion of these are stored locally on device used and are inaccessible to anyone but the owner. It is theoretically possible for software publishers to secretly collect data from someone’s device, however this is highly illegal and is punishable under the General Data Protection Regulations. Moreover, local traces, such as cookies and internet browsing history, contain information that could lead to identity theft, this category is most commonly under attack by illegitimate third parties under the form of phishing schemes, malware and predatory ads. In order to protect yourself from these incidents, there is a certain degree of social responsibility one must consider when using any smart device. Even though local traces from one’s personal computer can be heavily reduced by accessing it via an offline account; there are certain traces that cannot be avoided. For example, one does not need to have a Microsoft account in order to use Windows, but only the login credential of the administrator account. Despite this, Microsoft will still log some information such as applications and services used as well as the internal search history. By using an offline account, your device ID will be set to “null” and only Microsoft and Google published and trusted apps will be logged. No matter the settings on your account, Microsoft will log the actual search term and whether or not, you accessed any webpages in this way. A general assumption is that: if you have to sign in with an account, your activity is heavily tracked. The reason for this, is that most companies and websites have some form of key performance indicators that are used for analysing how successful their website or product are. To avoid direct activity tracking, one can use a variety of web extensions to anonymise or to securely encrypt their internet traffic. Even if your activity is not tracked directly by the web browser, it is still tracked by the internet service provider. This might not be done with the intent of analysing or selling that information but it has to be done to ensure that all subscribers receive the services they are paying for. At a corporate level, this information can be vital for implementing the correct strategy and whether or not changes need to be made. Another reason, 6 Project 1 - Big Data Management adbi, vali, sosr, peol is for marketing and targeted advertisement purposes. Moreover, a lot of websites employ some form of analytic tracker which is used for collecting as much information as possible in regards to how the website is used; in the hope that more data means better data and that at some point, somebody in the organisation will be able to analyse it and extract some form of information out of it. Figure 2a (Left): example of product and service usage data collected by Microsoft, while using data traces limiting techniques. Figure 2b (Right): example of search request and queries data collected by Microsoft, while using data traces limiting techniques 7 Project 1 - Big Data Management adbi, vali, sosr, peol Figure 2a and 2b present an example of personal data collected about the usage of certain Microsoft services. It is important to note that by employing data traces limiting techniques, certain attributes of the data collected can be denied the permission to be collected. Being able to nullify the deviceID in both datasets as well as the Latitude and Longitude in the second one; is a meaningful step towards obtaining complete control over one’s personal data. Paying close attention at the type of digital traces we created, as group we found it surprising just how many third party software have access to your data. All data in this project has been collected using suggestions provided by the myshadow.org1 website, the Firefox addon: Lightbeam as well as the official privacy management service by Microsoft. Not all of these are used for the purpose of logging your activity but also to make sure the website functions as intended. Another surprising aspect, is that without some kind of data tracking or ad blocking software, one’s device and digital traces are so much more vulnerable. Because of this, as a group we believe that it is our own responsibility to protect our devices and personal information from phishing schemes and malware. We found it close to impossible to completely reduce the amount of digital traces produced as one would have to stop using any electronic smart device. Since we acknowledge that smart devices, and ultimately the digital trace generated, are an integral part of our every day function, only one group member pursuited the effort to limit her digital trace. The limiting was only possible due to specific circumstances, e.g. it was tested on Sunday, where the group member had no obligations that required the use of smart devices. Striving to restrict one’s digital trace on a normal school - and work day would be close to infeasible, which abstained the remaining group members from pursuing a limited digital trace. Hence, we will now argue why it is a social responsibility. 1 https://myshadow.org/ 8 Project 1 - Big Data Management adbi, vali, sosr, peol 4. Personal Data as a Social Responsibility Working together as a group made it obvious that we are individuals with highly different backgrounds. As seen in previous discussions, all of our our opinions on the subject of digital traces are alike but we view the core of the issue from different angles. This means, that words like “social responsibility” may present different meanings for the group members. When discussing whether privacy should be a social responsibility, we first looked upon the meaning of the word. “Social responsibility is an ethical framework and suggests that an entity, be it an organization or individual, has an obligation to act for the benefit of society at large. Social responsibility is a duty every individual has to perform so as to maintain a balance between the economy and the ecosystems ”- (Investopedia). This created new reflections from the critical side of the group. This created a debate around the idea of whether or not the creation of personal data is a distant and uncontrollable matter. By tracking our digital footsteps the volume of data tracking became apparent and resulted in the notion that by making it a social responsibility could create tools for control. However, what degree of social responsibility should be used? Social responsibility has a more corporate perspective: “Social responsibility is the idea that businesses should balance profit-making activities with activities that benefit society. It involves developing businesses with a positive relationship to the society in which they operate ”- (Investopedia). The focus is then taken from the individual and forced upon the institutions that uses our personal data. This could in turn mean that the individual can expect certain actions from the 9 Project 1 - Big Data Management adbi, vali, sosr, peol companies, making it a social responsibility. This seemed to be the agreement within the group when thinking of personal data as a social responsibility. However, would this mean that certain rules would be enforced? - And would there be consequences not following them? As mentioned in the reflections none of the members of the group are “afraid” of the internet. We are all aware of the tracking that in turn can provide a service. This may not be the case for other individuals. Personal data and data protection have become a popular topic on many different platforms, that may have the effect of giving the internet a negative reputation. One statement that comes to mind is Evelyn Rupperts article “Data Politics” that states: “Data politics that emerges from this reaction is one of urging people to protect themselves as individuals. It is almost as if the narrative says ‘yes, there is collective work that needs done but ultimately it is up to you to change your behaviour to protect yourself from the dark forces of the Internet (Ruppert et al, 2017.p.25)”. As Rupert emphasises, a potential alienation of the web can create a rift that may destroy many different services that the internet and certain companies can provide. Furthermore, Ruppert also raises the question of whether or not the dangers and insecurities are a worthy tradeoff for the many benefits of having people analysing our personal data. Returning to the argument at hand, could making personal data a social responsibility a positive approach to a solution for the many debates. However, our argument entails that the social responsibility is split in two. As mentioned could a specific rule set be necessary, so that people may be aware of how to handle their data correctly. This lead to the discussion that it should be possible to spread awareness of volume of data that an individual creates and shares. As seen in the reflection surrounding the data limiting it became apparent just how vast the amount of data and how difficult controlling it can be. Creating an awareness may help to reduce the uncertainty surrounding the internet and personal data. Furthermore, it could be expected that if awareness is increased, it will help limit the personal data that the individual generated followed by a common sense when using the internet. 10 Project 1 - Big Data Management adbi, vali, sosr, peol The other aspect of the social responsibility could be the corporate angle. This means, that the individual may r equire specific measures, that companies using personal data must uphold. This argument is created based on the fact that when reading reports surrounding private and personal data for companies (such as Apple, eBay and Amazon), which states that most of the effort are centered around the protection of the data, but nothing concerning a plan for a potential breach of security or leak of data. In essence, it could be expected that companies need several processes that will respond to data breaches, such as GDPR regulations. Especially in terms of who will be responsible for handling the incident. These processes could be focused on communication, the channels and the platforms that will be used in response to a breach. By doing this a company may protect personal data, as if it had been its own and by placing data protection for their customers on level with product quality, may help change the mindset that is negative towards the sharing of personal data with services. 5. Databox as a solution to the data problem? As part of this projects overall discussion, the following section will present our discussion of the project Databox. The purpose of this technology is to create a personal networking device that collects and mediates access to individuals’ personal data (Haddadi et al., 2015), while allowing users to "regain" control over the digital data they make accessible. Thus, Databox is an attempt to create a compromise for users and their privacy when they use different digital services. In such situations, users may find themselves in a dilemma by having to hand over their personal data in order to use these services. Therefore, Databox seeks to be highly beneficial for both users and those organisations that aim to get access to individuals’ data by trying to find a compromise between personal privacy and providing the necessary data for the application or service to run. One proposal is that individuals wishing to use various digital services can continue to do this by "paying" with their private data, while others can limit access to their data. This proposal could potentially offer private users a greater degree of privacy and control of one’s data when using different services. 11 Project 1 - Big Data Management adbi, vali, sosr, peol As shown in the report of Haddadi et al., (2015), similar concepts have previously been attempted with different success rates. However, these previous attempts have produced several interesting insights: for example, it has been found problematic that all private data about an individual is gathered in one place, creating a form of "treasure box" that could potentially tempt hackers. Therefore, security and accessibility are extremely important, and the handling of a security breach would depend on comprehensive problem solving. The discussion about individuals' personal data and security of it - or the lack of it - has become a global phenomenon in recent years where there have been several examples of security breaches in the handling of people's private data. For example, the previous mentioned case about Facebook, which revealed how private users' personal data had been passed on to third parties without users being aware (Koroluk, 2018). This case is not unique, but rather a symptom of a digital world, where the use of different technologies and services often comes at the expense of users' private data. The case about Facebook rises questions such as how the storage of users' private data should be handled, and how they should be managed and by whom. Figure 3: Databox Architecture 12 Project 1 - Big Data Management adbi, vali, sosr, peol The people behind Databox see an advantage in storing encrypted data on local user servers, and let the cloud service be an optional choice to try to drastically reduce the number of organisations or third parties having access to your data. In other words, they believe that users' encrypted data should be stored on the users’ own servers rather than in the cloud. In the proposed architecture of the “Databox setup” (Databox 2018), one will have an encrypted backup of the “IoT Hub” stored in the cloud. The databox can collect and piece together information from many different sources, that be your smart tv, sensors in your house or any other IoT enabled device. If personal data is stored in the cloud, these should be encrypted so that other providers would not be able to misuse this data. When using an internet service in present time, there are not many permission levels to choose between. With the databox design, one has ability to control on a case to case basis whether to accept processing of one’s data, providing awareness and control of whom have access and to what data. It requires many resources for organisations that store and process personal data, since they must be compliant to the new general data protection regulation ("EUR-Lex", 2016). To be compliant with the GDPR one must account for data collection and usage. The databox design enables apps to be sent locally and process the data on the user’s own computer, returning only what is required to run the service. This would release the aforementioned resources in the effort of being compliant, since the data is still only locally stored by the user and not by the company. An important aspect of Databox is its applicability, which is why the system design should play a major role in the construction of this. This relate to the fact that it would be useless to try to restore the level of control to the users if they do not understand what their data can reveal about them and which organisations are collecting data. This raises questions about the general user actually having the right competencies to handle such digital control, without manuals about procedures for security breaches for instance. However, the idea of allowing users to "regain" power is considered a positive solution in today's data discussions, but it is also recognised that there may also be several implications associated with this. For example, it could be considered a problem if all individuals had the opportunity to edit their data as they please, trying to create a 13 Project 1 - Big Data Management adbi, vali, sosr, peol polished rather than ideal self to the people around them. That said, it is good that individuals have the ability to remove data they want to keep private, isolating that piece of data to be part of a trade to service. However, we do believe that this form of "control transfer" to the users could potentially create a positive effect, meaning that the users will feel empowered by "regaining" control over the information about them as well as how and in which contexts these are used. This could lead to a changed perception of consent, from renouncing information to derive specific assumptions about that individual. But in order to get to this stage, the ethical implications and the technical difficulties of collecting and joining data from several different sources needs to be taken into consideration, as well as the extensive maintenance and updating of these data. Moreover, even if the technical formalities should work as planned, a project like Databox will still depend on both private individuals and organisations finding it beneficial before it will be fully embraced. An essential question is whether a technological solution like the Databox could help users with being aware of the data collected about them by major organisations such as Facebook and Google. In relation to this, we do not assume that a solution like this will make the users responsible for their data owned by large Internet organisations. With the help of APIs, Databox will be able to assemble and merge all data into one place, even if this data is already stored elsewhere by other major organisations. As such, a technology such as Databox could function as a storage house that retrieves data from different data sources, and stores it in a way that is relatively easy to access. Despite the fact that some companies already share certain types of data, which is mainly related to an owner or ownership, which is why the majority of data analysis is performed "inside" these databases. Should the Databox become a reality in the future, one can imagine how organisations will be able to make business decisions based on data that is directly relatable to one specific individual. Though only necessary data to provide the service should be processed, which might not include any personal relatable data. 14 Project 1 - Big Data Management adbi, vali, sosr, peol According to Danezis and Gürses (2010), privacy can be defined as confidentiality or as they explain it: "(...) privacy is therefore defined as avoiding making personal information accessible to a greater public. If the personal data becomes public, privacy is lost. "(Ibid: 4). In addition, privacy is explained as a form of control in the sense of: "(...) the right of the individual to decide what information about himself should be communicated to others and under what circumstances." (Ibid: 7). If the privacy perspective is transferred as confidentiality to Databox, one could argue that it protects private users by deciding who their data is shared with and for how long. In this way, the users themselves have control over their private data, even if they allow an organisation access, they would still be able to control the access. If the Databox is looked through the perspective of privacy as a control, one could point out the benefits for individuals being able to select what information should be available to whom and under what circumstances. Nevertheless, one could question how a solution like Databox could prevent large organisations, such as Facebook or Google accessing of storing people's private data. Thus, it is our point that a technological solution such as Databox might be able to assist users with data protection against smaller organisations, whereas it seems inappropriate to larger organisations such as Facebook and Google, which we constantly use at the expense of our private data. Therefore, we argue that Databox is only a partial solution. While the data collection and the manner in which it offers a case by case data control is undoubtedly useful and is a great step in the direction of personal data control, the manner in which the backup structure is handles represents the main issue with the system. The idea of cloud storing methods had a period of over-optimistic usefulness, however storing a complete backup of one’s complete digital data on a cloud platform is counterintuitive with the purpose of data protection. Assuming that the encryption system is highly reliable and will never fail, for example SHA-256, and that the servers belong to a reputable highly trusted provider, Databox can provide a complete solution to the problem of loose personal data. However, in real-life, protection systems can fail 15 Project 1 - Big Data Management adbi, vali, sosr, peol and servers can be hacked, furthermore, because data is stored on the cloud, the hosting company has access to all data on it’s systems and in some situations, it is required by law to detect illegal activity. One solution to this issues, would be for Databox to incorporate a private home server that is fully controlled by the customer. To ensure complete protection, Databox could offer an implementation service where the initial personal server is configured and deployed privately. To sum up, the focal point of Databox is to offer an alternative way in how private data is stored and managed. If organisations continue to store private data and furthermore control and determine how the data is used, this could have major implications for the users and their stored data, in case of inconsistencies between organisations and third parties. However, we believe that such technology could potentially help neutralise this control and some of the ethical implications of this. 16 Project 1 - Big Data Management adbi, vali, sosr, peol 6. References Danezis, G., & Gürses, S. (2010). A critical review of 10 years of privacy technology. Proceedings of surveillance cultures: A global surveillance society, 1-16. Databox. (2016, October). About. Retrieved December 16, 2018, from https://www.databoxproject.uk/about/ Gjerding, Sebastian in Information 2014 17-03 – Politicians wants to stop pointless surveillance(visited 21-09-18) https://www.information.dk/indland/2014/03/politikere-stoppe-nyttesloes-overvaagning EUR-Lex Access to European Union law. (2016, May 4). Retrieved December 16, 2018, from https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=OJ:L:2016:119:TOC Haddadi, H., Howard, H., Chaudhry, A., Crowcroft, J., Madhavapeddy, A., & Mortier, R. (2015). Personal Data: Thinking Inside the Box (visited 20-09-18) https://arxiv.org/pdf/1501.04737.pdf Koroluk, K. (2018). Data breaches a concern for construction companies. Daily Commercail News, 91(72), 1-2. https://myshadow.org/trace-my-shadow Poulsen, Søren Pape 2017 16-03 – Letter to the director of the tele industry (visited 21-09-18)https://ulovliglogning.dk/assets/files/PapesBrevTilTeleindustrien.pdf 17 Project 1 - Big Data Management adbi, vali, sosr, peol Tele2 2016 21-12 – Tele2 won data rentention case in European Court of Justice (visited 21-09-18)https://www.tele2.com/media/news/tele2-won-data-retention-case-in-european-court-o f-justice The Guardian 2018 – Facebook says Cambridge Analytica may have gained 37m more users' data (visited 21-09-18) https://www.theguardian.com/technology/2018/apr/04/facebook-cambridge-analytica-user-data-l atest-more-than-thought Ulovliglogning 2018 - (visited 21-09-18) https://ulovliglogning.dk Vertesi, J. (2014, May 1, 2014). My experiment opting out of big data made me look like a criminal. Time (visited 20-09-18) http://time.com/83200/privacy-internet-big-data-opt-out/ 18 Project 1 - Big Data Management adbi, vali, sosr, peol Appendix Appendix 1 - Reflections Project 1 – Data Tracing and Data Limiting – dealing with personal data (NOTES) Part 1 Engaging with personal data A. Digital Data Tracing (Wednesday – Monday) Notes: When using myshadow.org, I got an overview of what kind of data I actually produce over a period of six days. The results of this data tracking gave some interesting insights: Although I am familiar with data tracking as a general phenomenon, a surprising insight was the very extent of this tracking of the amount of data I produced - more or less involuntarily. In relation to browser tracking, a surprising insight was the amount of information about my devices that is being collected and passed on to third parties through cookies. Transmission of information such as my IP address, time zone and browser history and operating system. Especially the principle of location tracking came as a big surprise. Having my mobile phone constantly on me, I was unaware of how extensive the amount of information I produced. After getting an overview of my own digital footsteps, I reflected a lot about my data activities. At first, I was left with the feeling of installing all the technical measures on all my devices to avoid this data tracking. But after a while, I realized that it would not be a realistic solution. In today's society, many citizens are more or less forced to leave digital footsteps through their use of the Internet, for example through communication with the state. This is an interesting aspect as it raises questions about the principle of privacy in relation to the digital services we use. The fact that the majority of websites now use cookies and other tracking devices to retrieve information about an individual now seems more like a universal rule than the exception, and therefore the increasing focus on privacy and GDPR in today's society is not surprising but needed. 19 Project 1 - Big Data Management adbi, vali, sosr, peol B. Data Limiting After the data tracing, the exercise about data limiting seemed like a welcomed idea. However, the pursuit of 24 hours without data production seemed impossible after the insights from the first exercise. Nevertheless, the exercise was easier than expected. In the 24 hours I left all my devices at home on Sunday and went to my parents. This seemed surprisingly easy and it was actually nice with a day without being online. However, the practice left me with more reflections: Even though keeping a digital day off seemed easy, the preparations were extensive in relation to reporting to family, friends and work that I was not available. The principle of not being available therefore became the solution of not being online, which is not an ideal solution, although it seemed like the only useful solution. ____________________________________________________________________________ Part 1 - Engaging with personal data A. Digital Data Tracing All data types collected by Windows: ● Apps and services used o All apps used on the PC on a daily basis. o When you launched the app o When you closed the app o The device ID used for the app o The publisher of the app ● Voice o All cortana usages ● Search o Any search done by the windows built-in feature o Date of the search o Device ID used for the search o Longitude, Latitude, Accuracy Rating 20 Project 1 - Big Data Management adbi, vali, sosr, peol o Navigation to URL o Search term ● Browse o Any Browsing done by Microsoft Edge, Internet Explorer, Bing and Cortana. ● Media ● Location However, all of this information that is collected about you can be heavily reduced by not signing in into Windows with a Microsoft Account. Just set up the PC in such a way that you access it via a locally made account. However, even if you do not use a Microsoft account, windows will still log some information such as Apps and services and Search history. By using an offline account, your device ID will be set to null and very few apps besides Microsoft / Google published will be logged. In addition to this, Microsoft will also log your searches but if you use an offline account, the device ID and location will be nullified. No matter the settings on your account, Microsoft will log the actual search term and whether or not, you accessed any webpages. All data types collected by your phone: ● The phone itself does not log user usage in the same way as Microsoft. ● However it will still log anything associated with the browser ● The problem is that most of the times the browser is either Google or the iOS version ● If one uses Google play or the App Store, it will log what apps have been installed on the account. Which will also contain the apps you currently have installed. ● In addition to this, it will track what apps you searched for ● Location is also stored and logged. This is done especially if one is connected to the internet. Connecting to a WiFi network creates a detailed log of you device including device name, operating system, start time, end time Ip address, number of bytes sent and received. This information can be heavily extended by the WiFi administrator without any legal consequence. Some networks will also track what websites one visits, cookies or even device usage. 21 Project 1 - Big Data Management adbi, vali, sosr, peol Web browser tracks: ● Device used, operating system it has and it’s name ● Search history ● Depending on the websites one visits, the data collected can vary extremely. This can rage from just the IP address (which automatically gives your location) and time of accessing the website to things such as: where you clicked, what links you used, how many times you accessed that website, etc. I am aware that a lot of information is tracked. As a general rule of thumb is that: if you have to sign in with an account, your activity is heavily tracked. The reason for this, is that most companies and websites have some form of KPI or indicator that is used for knowing how successful their website is and whether or not they should make changes. Another big reason, is for marketing and targeted advertisement purposes. Moreover, a lot of websites employ some form of analytic tracker which is used for collecting as much information as possible in regards to how the website is used; in the hope that more data means better data and that at some point, somebody in the organisation will be able to analyse it and extract some form of information out of it. Most of the times, data tracking is not done with an explicit intent. I knew that one’s activity is heavily tracked, but i found it surprising just how many third party software have access to your data. Not all of these are used for the purpose of logging your activity but also to make sure the website functions as intended. Another surprising aspect, is that without some kind of data tracking or ad blocking software, one is so much more vulnerable to predatory ads, malware and phishing schemes. In order to track all the traces of my online activity, i tried turning off any kind of protection (AdBlock, AdBlock Plus, Ghostly, data tracking protection, activity anonymization as well as the real-time protection by Microsoft and Google ) and within minutes my antivirus detected a large amount of malware and i immediately started getting pop-ups such as: “hot singles in your area”, “you are the 1 millionth person to access this website”, “your computer is infected, put your microsoft id and password here”, etc.. All of this, just make me want to have more security and protection layers against malicious 3rd party websites. Apart from this, my opinion of standard activity logs has not changed and i find 22 Project 1 - Big Data Management adbi, vali, sosr, peol the tradeoff between giving away my location and finding the information i wanted to be acceptable. B. Data Limiting Since i am not a big social media user and i rarely if ever post anything, I feel like i do not need to limit my internet activity that much. To reduce your digital tracks, one would have to stop using any electronic smart device. It is also a good idea to use offline accounts for everything you can since that generally limits the amount of data tracking. Because of my current situation, being part of the Software Development track at ITU, using the internet and googling various coding related questions is absolutely vital. To avoid direct activity tracking, one can use incognito / private mode. Even if your activity is not tracked directly by the web browser, it is still racked by the internet service provider. This might not be done with the intent of analysing or selling that information but it has tobe done to ensure that all people who pay for an internet subscription can actually access and receive data packages they ask for when googling. Trying to limit the amount of digital tracks i produce, i limited my online activity to things strictly related to my study. With all the online protection activated, i did not encounter any suspicious third party data aggregators and no malware or dubious pop-ups. I would not be able to simply not use any smart device for 24 hours but taking the necessary steps to anonymise my activity and block most suspicious 3rd party apps, I feel confident that my online activity cannot be used against me. C. Technical data control? I would say that the Databox project is a nice idea but a rather utopic one. It’s mission statement is to “ form the heart of an individual’s personal data processing ecosystem, providing a platform for managing secure access to data and enabling authorised third parties to provide the owner with authenticated services, including services that may be accessed while roaming outside the home environment”. I am against any kind of software that attempts to centralise all you data in one place under the pretext of keeping it secure. I believe it is better to let your data be chaotic 23 Project 1 - Big Data Management adbi, vali, sosr, peol and no curated and not centralised in any way. Sure it is secure from the publisher’s competitors but it is under no means secure from the organisation itself. I do not know the how Databox works so I cannot comment about it too much, it seems like a difficult software to set up and get to work efficiently. I find it to be in the same category as all the data tracking software that need to have access to your entire device to be able to work. ______________________________________________________________________________ Tried different extensions for Google Chrome in Opera, netsniffer, websniffer. Which track all network requests, but not from other applications. I need to actively go to my browser and activate, and there is no log file saved automatically, meaning I will have to clear and save the log myself. Looking for an application that will track all traffic on my laptop - so I will not forget to activate once using the browser. And there will of course also be quite some data, which is not generated from browser usage. Found an app called bandwidth+ which tracks the total upload and download. Started the monitoring 12/9 1748 Downloading RadioSilence app, network monitor and firewall for mac. Only works for 24 hours free. Maybe found a better app : PeakHour 4 - 10 days free trial. Seems that it can monitor everything. Started the more extensive monitoring with PeakHour at 12/9 1800 Deleted websniffer Stopping RadioSilence 13/9 Two different types of data, content; text message, meta-data; data about the data, when was the message sent and between whom. Obvious data trace would be comments on social media, status updates or emails. I knew that when visiting websites the user agent shares its IP (essentially also physical location), but when monitoring the requests when browsing the web it is clear that when visiting almost any website there is an additional POST requests (i.e. my client posting data to the server) 24 Project 1 - Big Data Management adbi, vali, sosr, peol and not just GET requests. Bruce Schneier - six different types of digital traces (https://myshadow.org/how-much-control-do-we-have-over-our-data) - Service data: provided to receive a service such as name, age, country, credit card number - Disclosed data´: content like photos, messages, comments ^obvious data - blogposts etc. - Entrusted data: posted on a platform but not controlable - what is done with data afterwards? - Incidental data: about us shared by others without them realising it. E.g. friend allows Whatsapp to get access to his phonebook on the phone which has your name and phone number - Behavioural data: when we interact with computer - meta-data, logs, what, where, when - Derived data: inferred from other data - individual profiles can get tied to on or more group profiles binding the group characteristics to us. Classification The level to which companies like Google, Facebook or Apple can trace and track me and my behaviour does not surprise me. I have been aware of this and become the pragmatic user, where I'm willing to trade my data for their services. I have changed my preferred search engine from Google to duckgogo, since they are claiming not to be tracking its users. It is however not everything that can be found in duckgogo, so in these cases I have to use Google. I am very aware about the illegal Telelogging going on in Denmark - initiated and prolonged by the Danish government and the minister of justice, even though there is a case from Tele2 in Sweden where the European Union condemn it to be illegal to track and log every single connection to one of the telephone towers without a warrant from the court. They hide behind "We do it to prevent crimes and terror". Therefore, I'd rather be on wifi with my mobile phone than on 4G, since I feel surveilled by always being trackable by the government - specially when they do under false claim that they are preventing crimes and it is one of their most important tools against crime. I am not able to limit my data trace since this is what i do: work with computers, spend my spare 25 Project 1 - Big Data Management adbi, vali, sosr, peol time often with a computer. Databox - complex to have this implemented, or maybe not implemented, but enforced that their solution will be used. The idea is great to enable both more personal control of ones' data and also enable services to use data that otherwise will be out of reach. To have this in a small local setup I see as no issue, but to scale it for everyday use, and for use for IT novices I see as a big challenge. We need a huge change in awareness of data privacy and to what extent one can benefit from providing data in order for this project to be fulfilled. In the current world I only see it work on a very small scale with few participating services. It was interesting to look into, but I was not much enlightened on what data is created when we use computers. It is something that I have been aware of for a long period of time, which also influences the way I use my computer, or at least which services I use. 26 Project 2 - Wifi Analysis Project Group 10: Andreas Devald Bisgaard Pernille Gross Olesen Sofie Sung Mee Rømer Vlad Alexandru Ilie Critical Big Data Management, 2018 Technical Big Data Management, 2018 IT University of Copenhagen Word Count: 6.814 Symbols: 39.891 Table of Contents 1. Introduction 1 2. The Architecture 1 2.1. Layers of the Lambda Architecture 1 3. Data & Pre-processing 2 3.1. Use of data 6 4. Ethical and personal implications 7 5. The Views 8 6. GDPR Considerations 11 6.1. Considerations for legal ground 13 7. Impact analysis 14 9. Conclusion 18 10. LOG 19 11. References 19 Project 2 - Big Data Management adbi, vali, sosr, peol 1. Introduction This paper is based on a given project description in which we have been asked to join a diverse consultant team to develop a new service and potential applications for data management and processing for the IT University of Copenhagen. Based on a dataset with different information that has been made available to us, the project will include our process of developing this service, include the technical challenges as well as the legal and ethical implications it has, as well as different approaches of how to tackle them. The project will end with our shared reflections on the challenges and opportunities that we have experienced during the process from the diverse collaboration within the research area between Technical - and Critical Big Data Management. 2. The Architecture The data on which this project is based around is stored on a Hadoop Distributed FIle System (hereinafter HDFS), the cluster is accessed via a secure shell with encrypted private keys. The group has been granted permission to the HDFS cluster by the IT University Of Copenhagen only after the parties involved have signed a joint controller agreement regarding the processing of personal data in accordance with the General Data Protection Regulations (hereinafter GDPR). 2.1. Layers of the Lambda Architecture The WiFi Data recorded by ITU can be used in a variety of manner and for a range of purposes. Based on this, the data has not been collected with a highly specific end goal but rather it was generated, in accordance with the GDPR, as a by-product of services offered by ITU. As a result, depending on the information that is searched for and the manner in which the search is performed, only part of Lambda Architecture layers could be used, however in order to obtain consistent and stable results all three layers should be used: The Lambda Architecture is defined by Nathan Marz and James Warren in their book “Big Data :Principles and best practices of scalable real-time data systems ” as a “generic, scalable and fault-tolerant data processing architecture” ( Marz & Warren, 2015) where low-latency reads 1 Project 2 - Big Data Management adbi, vali, sosr, peol and updates are vital. The concept of Lambda Architecture can be summarised by the interconnected functions of its five elements: 1. Data: is processed by both the Batch and Speed layers 2. Batch Layer: stores all data and precomputes batch views 3. Serving Layer: indexes batch views so they can be visualised in constant time 4. Speed Layer: processes new data that has not been pre-computed 5. Queries: are answered by by combining batch and real-time views Figure 1: Lamba Architecture and process. From (Hausenblas & Bijnens) 3. Data & Pre-processing The following paragraphs describe the three types of data sets, their attributes, their immediate uses as well as overlooked aspects. The three types of data sets can be described as follows: 2 Project 2 - Big Data Management adbi, vali, sosr, peol 1. Meta data: is overwritten daily and works as a register for all devices, in this case routers; it keeps track of their location, function, mode and the up-time since. 2. Rooms data: is a record of the various types of scheduled activities in the building, such as lectures, exercises, meetings, events or facility work. It contains the following attributes: data for the start and end time and date, the name of the main person involved as well as, if applicable, the name and program of attending participants. 3. Readings data: keeps track of all users connected to each device provided by ITU at a given timestamp. Information about each user is also stored as a tuple containing the operating system, the user’s device ID, the name of the network the user is connected to and most importantly, for the purposes of this project, the received signal strength indication (rssi) and the signal-to-noise ratio (snRatio). Every 24 hours at around 5:00 in the morning, the room booking data as wells as the time series data is uploaded to the storage layer on the cluster from where it is accessed and processed. The room booking data have the following naming convention; “rooms-yyyy-MM-dd.json”, whereas the time series data, which contains the access point connections have the following naming convention; “readings-dd-MM-yyyy.json”. Since it follows this convention, historic readings data as well as room bookings data is available, which is not the case for the “meta.json” data. This dataset is updated, i.e. overwritten every time it is uploaded. The sampling interval for this dataset is somewhat inconsistent, but according to the data provider it will be updated once a day. The dataset readings has very few missing values present in the readings elements within the time series objects, i.e. the details of each recorded connection. e.g. the dataset from the 24th of October has 49469 number of rows with a total of 306025 readings (individual recorded connections), but only 5 readings with missing values - a very low ratio. The 5 instances which contain missing values all have in common that both rssi and snRatio are missing. Intuitively this also makes sense, since rssi is a relative measure of signal strength, whereas the snRatio is signal-to-noise-ratio, which is difficult to establish without a measure of signal strength. In addition to the actual missing values for the above-mentioned instances, there are also readings with os denoted as unknown. 3 Project 2 - Big Data Management adbi, vali, sosr, peol For the rooms dataset there are relatively more missing values than in the readings dataset. If we again take the set from the 24th of October, we have a total of 23 records with missing values, which results in 34 percent of instances in the dataset with 67 records. This is a big portion of the dataset, thus the chosen approach for handling missing data will have a bigger influence on the cleaned rooms dataset than the cleaned readings dataset. The most common missing values are name and programme (20 instances), followed by type (3 instances). Name and programme are related and seems to be dependent on each other, since if one is missing the other is too. It should be noted that the October 24th rooms dataset is the one containing the highest amount of missing values of all. On the 30th of October, 29 rooms datasets are available and 28 readings datasets are available. Running our program for checking missing values on all available sets reveal that for the readings there is a total of 128 missing values, which is a very low ratio in relation to number of records. Actually the “missing-value-to-record-ratio” never exceeds 0.0004 %, that is if missing values are related to one record each, which was not the case; snRatio is dependent on the rssi. On the rooms dataset there is a total of 280 missing values, which is a considerable high number compared to the size of the datasets. Once the new data is available, the process of cleaning the data sets can be initiated. There are many different approaches on how to handle missing data. These approaches might introduce biases in the data so that it is skewed due to the inconsistencies in the dataset. For instance, if one chooses to ignore rows based on missing values, and if the data is from a particular access point, that has not logged the operative system (OS), the analysis would totally ignore this access point. The data could have told an interesting story, even without a more or less redundant attribute such as the OS - the access point could be connecting a high amount of users causing instability, indicating that an additional access point in the area is a good idea. Additional approaches could be to fill missing values manually, use a global constant to fill missing values, use mean or median values to fill missing values or to use the most probable value to fill in missing values (Han et. al. 2012). Each will mold the data set in different ways; therefore the approach should be considered thoroughly. 4 Project 2 - Big Data Management adbi, vali, sosr, peol Taking into account the relatively few missing values for the readings data, it will have little to no impact on the dataset that the records with missing values are removed. These instances can be discarded due to the low ratio, i.e. highest count of missing values is from the 21st of October with 15 values missing and a total of 37823 readings. Since the missing values from the rooms dataset are of the type it is; mostly name and programme, it makes little sense to remove these records. This is due to the nature of booked room data - it is totally legit that a room is booked without being associated to a programme and course-name. Eg. instance 55 from 24th of October: endDate 2018 endTime 14:00 lectures anonymised name programme Room 4A54 startDate 2018-10-24 startTime 12:00 type Study and Carrer Guidance Table 1: Instance 55 from 24-10-18-rooms.json It is therefore dependent on the type if it is acceptable that name and programme are missing. Unique values of type are the following: Lecture, FM-Works, Exercises, Meeting, Study and Carrer Guidance, Workshop, Staff activity, Other, Ordinary Exam, Student educational activity. Inspecting the possible values of the attribute, it might also seem obvious that not all instances can have a name and programme. Hence it is accepted that instances in this dataset have missing values. The rooms dataset will remain as-is, whereas the readings datasets are cleaned by deleting records containing missing values. The dataset is loaded into a dataframe and processed row by row. First, the deviceName, ts, and readings are verified to not being null, if not null, the individual readings will be assessed. If 5 Project 2 - Big Data Management adbi, vali, sosr, peol there are missing values the instance id will be stored in a list. Each value of the readings are initially loaded as strings, thus the length of the string is matched not to be zero. If the amount of the missing values is not zero, the instance id will be stored in the list. Once the whole dataset has been processed this way, another dataset is generated where the stored instances, with missing values, are excluded. The cleaned dataset will be stored in a new .json file that is stored in our group repository, thus we are the only who can access it. Prior to processing the cleaned data it can be moved in to HDFS to utilise the enhanced computational properties of it. Since the whole readings dataset is already processed in this step, it could be beneficial to take into consideration the future process, in order to avoid potential redundant steps. E.g: if one knows that the rssi value should be used in any arithmetic operation it is necessary to process it as an integer, and not as a string containing ‘dB’. It was decided not to initially include this in the cleaning, since it could potentially constrain the generation of different views in the future. That said, if the end-to-end process and use of the data is clearly defined, this step is certainly an area where one would be able to optimise processing time. 3.1. Use of data In order to extract useful information from the data and to suggest potential uses that can bring a surplus of value for teachers, students and ITU’s administration this project suggests the following questions: 1. How many unique users have been connected to the network each day? 2. How many users are connected to each router? 3. What is the quality of the wifi network produced by each router? While the answer to the first question could be valuable for ITU’s IT department as a key performance indicator; the answers to the latter two questions can be used to offer users suggestions of where to go in order to receive the best wifi signal. This information can be conveyed via the building’s InfoScreen system rather than giving users a pop-up notification on their device. Hence, the suggestion is that ITU IT staff solely controls which results of the data - processing and analysis should be presented to the staff and students. An automated view can be 6 Project 2 - Big Data Management adbi, vali, sosr, peol displayed on the screens, based on thresholds of relevance. e.g: it would be relevant for users, which routers/areas performs below a certain threshold so as to avoid these areas, and not relevant to get a complete list of router names and their current quality of signal. Analysing the overall structure of the datasets, it is clear which ITU staff member is involved with what activity, where and when it will take place as well as who should participate. This data can easily be used to closely monitor the activity of each staff member as well as provide students with the necessary information about their courses. 4. Ethical and personal implications The necessity for a joint controller agreement comes from the fact that one could theoretically, reverse engineer a user’s ITU tag and as a result a specific person can be identified. However, in order to achieve this, one would need insider knowledge from ITU about the connection schema between a user’s unique id as it appears in the datasets and the actual person behind a device. Even though, each user is registered as connected to a specific router, and each router has a specific location or a series of very close locations in the building; wifi networks overlap and one cannot find out the exact location of a user but only the general area. However, it would be possible to modify the readings data set so that a more accurate location of each user can be calculated. However, a series of changes have to be made to the manner in which each router functions: instead of connecting to one router and staying connected to it as long as possible, the network could be set up so that the connection to the user is cycled through all routers. In such a way, the network can collect the rssi and snRatio for all routers that produces a wifi network that can reach the user. Based on the fact that the rssi is heavily influenced by distance to receiver as well as the number of connections, with an sufficient sample size accurate estimations of a device location can be calculated (Papamanthou, Preparata, & Tamassia, 2008). It has already been discussed that staff members can have their activity monitored but from a student’s point of view a range of aspects can be estimated. For example, it is highly unlikely that a student attending a lecture does not have a smart device capable of connecting to the wifi network. Due to the pervasive nature of smart devices, the readings dataset could be used to track a student’s movement through the building at various times throughout the day, to check 7 Project 2 - Big Data Management adbi, vali, sosr, peol whether or not a student is present at ITU and if so whether or not they are attending their lecture or skipping classes. 5. The Views In order to answer the raised questions, we will aim to generate three different views. The first view seeks to answer the question of how many unique users have been connected to the access point on a daily basis. This demands that all individual connections are accessed and cid stored if not done already. This is done by iterating over all readings in the cleaned data set and generating a new view, which potentially could be stored, for the use of other operations. For our case, we keep it in memory. An instance of the resulting view looks like this: Figure 2: View 1, generated to calculate the unique count of connected users This way, we can perform operation on the column as a whole, and the unique users are easily extracted from the dataframe. For our cleaned dataset from the 6th of October we found a total of 306 unique users, which is plausible since it was a Saturday and from Wednesday the 24th of October there is a total of 2473 unique users. Second view seeks to answer the question of how users are connected to each access point. This too involves accessing the individual readings, but since there is a high number of unique timestamps, it is chosen to provide the capability to see number of users per access point on a daily basis and not on a specific timestamp. The same pre-process, as the one utilized in the previous view, is used, but here, each instance are counted and anchored to the specific access point. This is then stored in a temporary dataframe. Where the first view only used the readings data, this view will also make use of the details in the meta file. Since, this contain information of each access point and its location, it will be joined with the temporary stored access point count. This initially provide more valuable information than only showing the access point name, because the location name is interpretable, whereas the deviceName is not. Important to 8 Project 2 - Big Data Management adbi, vali, sosr, peol emphasise is that the two dataframes are joined after the process of counting, thus there is no point where one directly sees the location of one specific connection. Moreover, the join is done on two frames containing metadata. Below is a snapshot of the resulting sorted list from the 24th of October. Figure 3: View 2, List of unique logins for each access point including physical location Third and final view seeks to answer the question of which access points provides the best average signal to noise ratio. Here it is required that the exploded view is grouped by the access point in order to average over all the signal to noise ratio. But first, the initial strings are converted to integers in order to perform the arithmetic operation on the column. The snRatio column is averaged and sorted in descending order to show the “best” performing access point on top. Again, after the computations the resulting frame is joined with the metadata to match the actual physical location of the access point. Below the 20 best and 20 worst performing access points are shown. 9 Project 2 - Big Data Management adbi, vali, sosr, peol Figure 4: View 3. Left is the 20 best performing access points measured in signal to noise ratio. On the right is the 20 worst performing access points. One could enhance the view by taking into consideration the amount of readings per access point, since there are also access points with very few readings, resulting in a less significant measure. We compared between mean and median signal to noise ratios, they had little difference between them. The data analysis has revealed which are the ten best and worst WiFi routers at ITU, as a result a heat map visualisation can be automatically created, assuming that a building information management system exists and can be connected to the HDFS. The manner in which the application has been conceptualized has been exemplified in the following figure: Figure 5: Example of how the data can be displayed to students and staff at ITU 10 Project 2 - Big Data Management adbi, vali, sosr, peol 6. GDPR Considerations This sections will analyse if the data processing and service provided in compliance with the GDPR regulations. Looking at the dataset it is first of all important to be aware of the nature of the data. In article 4(1) of the GDPR personal data is defined as: “[..]personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person (Article 4, GDPR. 1)”. Taking this definition under consideration it becomes clear that the data provided in this project is to be considered as personal data in accordance with the GDPR, as the data itself reveals informations such as location data that indirectly could lead to the identification of one or more data subjects, in this case the ITU-students or teachers. For example, it could be argued that if additionally data was added about the students such as ID information, one could identify the exact (room) location of an actual student or teacher, by looking up the different schedules for each individual. As such, the data provided for the project, have been defined as personal data, meaning that all processing of the data must be GDPR compliant. Since it is personal data that is distributed for the service, several considerations have been implemented to ensure that the data cannot be accessed or shared to foreign parties. One legal consideration that have been the main parameter for the data processing is in relation to article 5 in the GDPR about principles relating to processing of personal data: “1(c) adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed ('data minimisation' (Article 5, GDPR.1c)”. As mentioned earlier, the motive for this project is to optimize ITU internal operations using Big Data strategies. Meaning, that when processing the data, the data is retrieved from the source provided from ITU. Here, it is important to note that the data is not stored on any local platform. 11 Project 2 - Big Data Management adbi, vali, sosr, peol Instead the cleaned data is returned into the repository where the original data in collected. In this way, access to the data will be strictly reduced to individuals with an access key. Therefore, we argue that in terms of data process in this project is GDPR compliant. This is also based on the consideration that have been done when selecting the data controllers for the project. Article 29 of the GDPR states: “The processor and any person acting under the authority of the controller or of the processor, who has access to personal data, shall not process those data except on instructions from the controller, unless required to do so by Union or Member State law (Article 29, GDPR). This also corresponds to the limited access that were provided by ITU for the project. No other individual than the primary group received access to the dataset. Meaning that all individuals that were part of the process phase were ITU, acting as data controller, and this group, acting as data processors. Furthermore, the service will only be able to share information on the public screens on ITU. Meaning, that the information is surveilled by the data controller after it has been provided by the processor. In this way the data will not be available through an application to the individual user but as a public assed. To make sure we stayed within the spectrum of the GDPR, is was important to identify the different roles for this overall collaboration. Given the fact that ITU provided the data makes them the actual data controller, whereas we function as the data processors of the data which is indirectly given from the data subjects - the different students and employees of ITU. As the role of the processor have been established, the role of the data controller also have certain expectations in terms of regulation. One being the processing of the raw data that is used for the project. The principle for data processing described in article 5 states: “(a) processed lawfully, fairly and in a transparent manner in relation to the data subject ('lawfulness, fairness and transparency')(Article 5, GDPR.1a)”. It is expected that ITU have processed the data in compliance with GDPR before access is given for further processing. If not, it can no longer be considered GDPR compliant since many elements of the data is unknown. An example can be who has access and the very structure of the collected data. If the data processing is done by applying tools for cleaning the data, a corruption may occur that can lead to the data becoming incorrect or wrong assumptions being made, or used in a non compliant manner. 12 Project 2 - Big Data Management adbi, vali, sosr, peol This is also the foundation for the legal grounds of the project. In relation to article 6 of the GDPR it is stated that: (b) processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract (Article 6, GDPR.b). From the perspective of the data processor the consent is given prior to the data being used and analysed by the data controller, ITU. As long as other regulations are not breached, the project’s method for processing, storing and usage of the data is in compliance with the GDPR regulations. 6.1. Considerations for legal ground As the legal bases have been determined, it seems that the legal principles presented in article 6 of the GDPR all make the assumption that the processing of data must be “vital” for one reason or other. Of course it is an important aspect that when handling personal data, it must be for a more essential reason than just it being useful for an organization. In other words, the purpose must be specific to argue for the use of the data and the risk that follows. That being said, the notion of importance is very subjective and may not be as important to the data subject if the purpose does not involve them. This makes article 6 of the GDPR very broad in the sense of hierarchy. On one hand, a legal ground can be a public interest or a vitality, but what about optimization of an internal process? This assumptions is based on the legal grounds that have been estimated for this project. Since the motive is neither vital nor a public interest, the legal ground is then fitting for using personal data. The service that is provided by processing the data is capable of bringing value in the sense of optimizing student life with better access to wifi. This may not seem as being beneficial enough to implore personal data, but as processors it is believed that the reward outweighs the risk, due to above mentioned values. In accordance with legal grounds is the aspect of consent also considered in terms of being legitimate ground for the project. The regulation, which this is being based upon is from article 7 of the GDPR stating that: “If the data subject's consent is given in the context of a written declaration which also concerns other matters, the request for consent shall be presented in a manner which is clearly distinguishable from the other matters, in an intelligible and easily 13 Project 2 - Big Data Management adbi, vali, sosr, peol accessible form, using clear and plain language. Any part of such a declaration which constitutes an infringement of this Regulation shall not be binding(Article 7.2, GDPR)”. This is based on the legal role as data controller that ITU should be under. The assumption is, that when a student or staff member is enrolled or hired by ITU they give consent to ITU being in possession of their personal data. However, in terms of legal basis, there should prior to the project be a notion informing the data subjects of the intentions with their data. This could potentially provide an opening for individuals who do not wish to participate or share data to be excluded. This would of course come with the consequence of a student or staff member being denied access to the ITU wifi. 7. Impact analysis In this impact analysis, we will assess some potential winners and losers based on the scenario if our service were to be implemented by the IT-University of Copenhagen. Since the term “losers” is somewhat broad, we have discussed this from both a theoretical as well a a legal GDPR perspective. From a theoretical perspective, when using the term "losers", we considered it in relation to Zook's (2017) definition of data as being a person. The understanding of data as being a person is in the meaning that the hand-over of one's personal data would correspond to a hand-over of parts of our private lives and, in particular, the power over them to other stakeholders through their potential aggregation of data (Kitchin, 2014). Looking from the perspective that data is a form of power, one can argue that stakeholders who have control over students’ data would be the winners and the students would be the losers. However, looking from a legal GDPR perspective, a major risk of implementing the service is that ITU are in possession of all the data which would make the students as data subjects the winners. In this case, the possibility of an outside force accessing the data cannot be denied. As such, there can be many winners from the values that the service generates; e.g. a notion of better wifi reception, but the loser can only be ITU if the data is stolen or hacked. As already mentioned is it in compliance with GDPR that consent is requested, in order to be able to access and further store and process personal data (European Commission, 2018). Likewise, it is a requirement that we draw attention to how we handle this data, for what purpose, and who has access to them (Baker & McKenzie, 2018). By providing that information 14 Project 2 - Big Data Management adbi, vali, sosr, peol to the data subjects could potentially change how they act and perceive ITU. When knowing of the surveillance, students may choose to alter their routes or how much time they stay at ITU. Especially if the information is visualized on the public screens, the students will constantly be reminded about it. This is by no means the intention of the service but it could potentially be an outcome. Returning to the possible winners and losers of the provided service, the general idea was to make the students winners in all scenarios. Making it possible for the students and staff to determine a location to use the wifi, may also create new accountabilities. One being the expectations of ITU staff repairing the access points that have a bad range. This can be discussed in the sense of imposing on the IT staff’s existing workload as it once a day will be visible where a bad connection is. If the staff have problems with fixing the bad connection, other staff members and students will be capable of following the development. Another aspect is the expectations of students seeking the best wifi reception. This however, may create large clusters of students which may not all have a space to sit and work. Also, another impact when discussing accountabilities in regards to the service is that one has to be critical about the fact that we might be providing a service that could be beneficial to the students who wants to access the wifi, when being at ITU. One could ask if it is the student's responsibility to find another wifi, in case they decide not to be part of this project. As such, should it be up to the students to find alternative ways to have access to wifi at ITU, if they do not accept the service? Another aspect is that all parties have access to these data and how that will be affected when considering security measures regarding the database security layer. However, it should be noted that some features of the proposed service in this paper may cause complications if it is implemented. In the sense that an application could be developed to evaluate participation in each course based on the number of internet connections in each room. Here, it is of course important to note that the amounts of connections does not necessarily match the actual amount of people attending the course, which means that there would be a data error. But a service based evaluation of such incorrect assumption could possibly be used for other evaluation purposes by board members of the ITU. 15 Project 2 - Big Data Management adbi, vali, sosr, peol 8. Discussion of Ethical Issues As part of our discussion in relation to some of the legal implications, a discussion of some of the ethical issues were also made. Specifically, it was discussed that if the data were used for tracking a particular person's location on a daily basis, a possible problem could be that outlier detection could bring insight into whether this person differs from their normal time schedule. Potentially, this will lead to a breach of privacy in terms of legal and ethical implication as such data could be used as a way to monitor both students and professors. This is also problematic when the balance of power is considered among the data subjects. In the impact analysis, the presumption of ITU using the service to evaluate courses may also lead to professors using it. Being able to estimate a student attendance could have the effect of a professor having a personal opinion of a specific student before an examination. The question is the amount of access that the service can provide. Will professors be able to access the data because they are employees at ITU, and will that exclude the students to be able to do the same. Turning it around, will a student choice of courses be based on the amount of attendance within a course? This assumption is based on how the historic data is used by ITU and how far an individual is able to trace the data. This is however, important notions that could become a complication if the technological aspect is changed into an application for more personal use. As mentioned earlier in this project, with GDPR, as data processors we are legally bound to ask for consent to process personal data generated by private individuals. However, according to GDPR, there are also other possibilities in relation to this. This is expressed in (Kavanagh, McGandle & White, 2017) as following: "An organization should choose the lawful basis that most closely reflects the true nature of its relationship with the individual and the purpose of the processing.". Meaning, that when creating the notion asking for permission it must be very clear what the project will be creating and in what sense the data will be used. If this is not achieved, data subject may agree to terms they do not understand and later not wishing to be part of. As article 7 of the GDPR states: “The data subject shall have the right to withdraw his or her consent at any time. The withdrawal of consent shall not affect the lawfulness of processing based on consent before its 16 Project 2 - Big Data Management adbi, vali, sosr, peol withdrawal. Prior to giving consent, the data subject shall be informed thereof. It shall be as easy to withdraw as to give consent (Article 7, GDPR. 3)”. This means that the data from the data subject should be extracted which is not part of the original design. This is a complication, that can have major impact on the design since it would require access to both the main database, but also all data used in the service. Meaning it would involve both data controller (ITU) and processor (the designers). Here, a possible solution could be to focus more on legal implications in the design, but that could also have an impact of value creation since the full usage of the data would become limited in the sense that major calculation can not be done on the collected data. In addition, this would require further data processing from the data controller, to make it possible to remove data with ease. In general, offering access to the internet for students is an extremely important service, but by not providing internet access, this can prevent the development of, for example, educational institutions such as the IT-University of Copenhagen, especially in the western part of the world where the Internet has become a crucial part of people's everyday life rather than an option (Taylor, 2016). As some of the group's members have discussed in connection with one of the seminars, it is important to pay attention to and further critically discuss systems dealing with Big Data as such systems creates their own standards and thereby creates behaviors that we need to pay attention to and relate to (Metcalf & Crawford, 2016). Here, a possible approach could be to separate the use of the service itself and access to the Internet. For example, if a student is to use this service as well as the underlying services it provides, they disclose their personal data without consent. If some students would not prefer to use the service, they would still use wifi and have the opportunity to ask their respective institution not to store their personal data. A different approach could be to limit the timeframe for storing the data in relation to GDPR even in the situation where the ITU would claim that the data is required in order to offer the service in the public interest, they are still required to disclose the timeframe for storing the data. For a longer period, the data may decrease in value and could further pose a risk to individuals' privacy. In view of the law and business interests, the IT-University of Copenhagen must form a framework for data retention and must ensure that they are technically able to comply with their retention policy (Arla Propertymark, 2018). However, if the work of the service system is based on the student's request, it may result in an implication relative to the students who do not submit their verbs are not included in the service. 17 Project 2 - Big Data Management adbi, vali, sosr, peol If we predicted a situation where most of the total number of students does not submit their piece of content, we would be left with a service whose foundation would be based on a significantly limited volume of data which would furthermore not reflect the actual number of students and other employees. One could argue that this would be an extreme case, but through our discussion we all agreed that such a problem could arise. In connection with the above, we have further discussed the scenario more generally and in a larger perspective as to people and their privacy in general. Thus, we have talked about the case of a security breach of such data, this would reveal large amounts of personal information about ITU staff and students. This could, for instance, be information that could reveal different patterns of behavior through individuals' daily lives based on information about the individual's daily whereabouts. This would be a serious challenge not only in relation to the ethical issues, but also the GDPR. As Zook (2017) argues, we must acknowledge data as being persons and therefore harm can be done to it, and therefore we argue that we are co-responsible to try to avoid such cases and to the best of our ability to design a system offering, a secure service for students and other ITU employees. 9. Conclusion Through the efforts of providing ITU with big data service, not only technical but also legal and ethical questions were answered to a varying extent. From the two datasets, readings and meta it was obvious that a lot of potential questions could be raised and answered. Much effort was put into simply cleaning the dataset to a sufficient degree in preparation for the view generation. The data processing and the nature of the data raised questions regarding compliance, in terms of use of personal data under the GDPR. The way the data was processed in both the pre-processing step and the batch view generation is argued to be compliant under the GDPR. This leads us to an equally important subject, namely the possible ethical implications. With insider knowledge, one is able to reverse engineer CIDs to an actual ITU affiliate, as well as surveilling that individual’s movement around the ITU building. Though the potential route one is able to map from this will not be totally accurate, it might still give away individual activity - both attendance and location. 18 Project 2 - Big Data Management adbi, vali, sosr, peol 10. LOG In the beginning of the project the most challenging aspect were the different ideas and approaches the different members had for the assignment. We all had different opinions on how to use the data and what service there should be focused on. Finding an agreement proved to be a longer process than what we all anticipated and explaining our different viewpoints seemed to create more debate than results. This can be attributed to the different levels of understanding of what the data represents and a general unwillingness to internalise theoretical concepts from the other course track. However, after group discussion it became easier to distribute work and everyone knew the field they should research. Another challenge was the technical part of the assignment. Since the Scala lecture covered primarily theoretical concepts, the technical part of the group had difficulties in implementing the various batch and real-time views needed in order to process the data. Furthermore, the critical part of the group could not offer any support in regards of the implementation to the technical, since they also lacked knowledge of the tools. On a side note, have all these challenges been discussed as a group and we all strive to not repeat them in the future. 19 Project 2 - Big Data Management adbi, vali, sosr, peol 11. References Art. 4 GDPR - Definitions. Retrieved from https://gdpr-info.eu/art-4-gdpr/ Art. 5 GDPR - Principles relating to processing of personal data. Retrieved from http://www.privacy-regulation.eu/en/article-5-principles-relating-to-processing-of-personal-data -GDPR.htm Art. 6 GDPR - "Lawfulness of processing". Retrieved from http://www.privacy-regulation.eu/en/article-6-lawfulness-of-processing-GDPR.htm Art. 7 GDPR - Conditions for consent. Retrieved from http://www.privacy-regulation.eu/en/article-7-conditions-for-consent-GDPR.htm Art. 29 GDPR - Opinion 05/14 on Anonymization Techniques. Retrieved from https://www.pdpjournals.com/docs/88197.pdf ARTICLE 29 Newsroom - News overview - European Commission.(n.d.). Retrieved October 31, 2018, from http://ec.europa.eu/newsroom/article29/news-overview.cfm . Arla Propertymark (2018). “Does your retention policy comply with the GDPR? Last accessed October 28, 2018 on World Wide Web: http://www.arla.co.uk/news/june-2018/does-your-retention-policy-comply-with-the-gdpr.aspx Baker & McKenzie (2018). “EU General Data Protection Regulation in 13 Game Changers”. Last accessed October 28, 2018 on World Wide Web: https://www.bakermckenzie.com/en/-/media/files/insight/publications/2018/05/bk_uk_eugenera ldataprotection_mar2018.pdf Data Processing and Data Management - strategy, organization ..(n.d.). Retrieved October 31, 2018, from https://www.referenceforbusiness.com/management/Comp-De/Data-Processing-and-Data-Mana gement.html. 20 Project 2 - Big Data Management adbi, vali, sosr, peol EU Fundamental Rights Agency & Council of Europe. (2018). Handbook on European data protection law. European Commission (2018). Guidelines on Consent under Regulation 2016/679 (wp259rev.01): Article 29 Working Party – Guidelines on consent under Regulation 2016/679. Last accessed October 28, 2018 on World Wide Web: file:///C:/Users/zmd363/Downloads/20180416_Article29WPGuidelinesonConsent_publishpdf(1 ).pdf Han, J., Kamber, M., Pei, J. (2012) Data Mining Concepts and Techniques - Elsevier Edition 3 pp. 88 Hausenblas, M., & Bijnens, N. (n.d.). Lambda Architecture. Retrieved from http://lambda-architecture.net/ Kavanagh, P., McGandle, J., & White, M. (2017). Consent under the General Data Protection Regulation: what are the alternatives for employers? JDSUPRA. Last accessed October 28, 2018 on World Wide Web: https://www.jdsupra.com/legalnews/consent-under-the-general-data-89446/ Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences. London Sage Publications. Marz, N., & Warren, J. (2015). Big data: Principles and best practices of scalable realtime data systems. Sebastopol: Oreilly media. Metcalf, J., & Crawford, K. (2016). Where are human subjects in Big Data research? The emerging ethics divide. Big Data & Society, 3(1), 1-14. Papamanthou, C., Preparata, F. P., & Tamassia, R. (2008). Algorithms for Location Estimation Based on RSSI Sampling. Algorithmic Aspects of Wireless Sensor Networks Lecture Notes in Computer Science,72-86. doi:10.1007/978-3-540-92862-1_7 21 Project 2 - Big Data Management adbi, vali, sosr, peol Taylor, L. (2016). The ethics of big data as a public good: which public? Whose good? Philosophical Transactions of the Royal Society: A. 374 (2083), 20160126. Zook, M., Barocas, S., Boyd, D., Crawford, K., Keller, E., Gangadharan, A. Goodman, et al. (2017). Ten simple rules for responsible big data research. PLoS Computational Biology, 13(3), E1005399. 22 Project 3 - Emergency Response Service Group 10: Andreas Devald Bisgaard Pernille Gross Olesen Sofie Sung Mee Rømer Vlad Alexandru Ilie Critical Big Data Management, 2018 Technical Big Data Management, 2018 IT University of Copenhagen Word Count: 6.627 Symbols: 39.826 1. Introduction 1 2. Defining Goals 1 3. Speed Layer 2 3.1. Count of most visited streets 2 3.2. Ratio of street capacity versus flow of traffic 3 3.3. Updated directed graph with traffic status 3 4. Data Set 3 5. Data Processing 5 5.1. Map and Simulation Data 5 5.2. Processing 7 5.3. Analysis 10 5.4. Technical Review & Reflections 15 6. Legal Considerations 16 7. Selling It 18 7.1. Benefits of the service 19 8. Reflections 20 9. Log 22 10. References 24 11. Appendix 26 Project 3 - Big Data Management adbi, vali, sosr, peol 1. Introduction Most metropolises struggle with the issue of heavy traffic on the roads. This is both inefficient and resourceful since cars and humans are worth little when they are stuck in traffic jams. That is “just” a matter of resources and optimising the output of inactive individuals, but if an emergency crew were to be stuck in a traffic congestion during an emergency response call it might be a matter of life and death. Therefore it is important that the emergency unit selects the correct route to its destination, but in such a situation it is assumed that they will select the route that is suggested by the navigation device. This can be an issue because it might not take into consideration the current traffic status. Hence, we as the city incident and emergency response services seek to develop a system and tool, which will be to the benefit of both the emergency crew and the possible injured people through the selection of an optimal route between the location of the emergency unit and the accident location. 2. Defining Goals This project attempts to improve the city’s incident and emergency response services. Based on the available data, the project proposes the following two situations that can be viably addressed: ● How can an emergency response unit find the best route to solve an emergency? ● How can city residents benefit from the underlying data and processes used by emergency services? This represents an attempt to optimize the manner in which emergency response units are dispatched and the route they follow. Improvements can take a variety of forms such as: shorter response time, better chances of overcoming an emergency situation, better resource allocation and better planning for possible situations. For this to be even possible, two technical aspects regarding the data need to be fulfilled. First, it has to be highly accurate and reliable so that information about each vehicle on any 1 Project 3 - Big Data Management adbi, vali, sosr, peol street needs to be collected. Due to the fact that this project based on simulated data rather than real-life sensor data, the physical challenges of such a system can be overlooked. Secondly, with a sufficient amount of data, the simulation can become more complex than it’s real-life counterpart; thus once the simulation has proven beneficial it can be scaled up and tested in real-life situations. 3. Speed Layer In order to be able to answer the first research question and subsequently use that knowledge for the later situation; the following information needs to be extracted from the data: ● Correctly identify vehicle movement across the city and update statistics for each lane and intersection it passes through. ● Correctly identify traffic jams based on vehicle short historic data. ● Suggest the fastest route using the A* algorithm that uses the statistics for each node and edge in the directed graph as heuristics. Moreover, to provide the recipients of the service with useful insight, three different views are defined as; 1. Count of most visited streets 2. Ratio of street capacity versus flow of traffic 3. Updated directed graph with traffic status 3.1. Count of most visited streets First view provided by the speed layer will be a ranking of the most visited streets throughout the city. It should count the amount of vehicles that passes through all lanes and then present the most busy streets. This view is intended for the public safety answering point or the emergency central from where emergency vehicles are dispatched, so that the respondent are aware of current load on major roads. 2 Project 3 - Big Data Management adbi, vali, sosr, peol 3.2. Ratio of street capacity versus flow of traffic Second view provides an even more interesting view, as this will take into consideration the capacity of the street in the view provided. For this to succeed, classification of the streets are necessary, in order to differentiate among them and measure their capacity. Roads that have multiple lanes and higher speed limit obviously have greater capacity than those with a single lane that is shared with cyclists. This is essentially and extension of the first view, since now the count is compared with street capacity to provide a more insightful measure of traffic. 3.3. Updated directed graph with traffic status Third and final view is intended to provide an improved graph of the city, in order to be able to suggest optimal route that has taken current traffic status into consideration. This “view” is dependent on the previous information, and will be the primary view and therefore also what the project evolves around. 4. Data Set The implementation presented in this paper represents a proof-of-concept that was designed and tested locally by analysing a subset of the overall traffic simulations. The data is streamed as a simulation of real-life city traffic to three kafka topics at specific times throughout the day. An in depth explanation for the reason why this has been done locally can be seen in sections 8 and 9, Reflections and Log. This section describes the manner in which the solution can be expanded to represent a distributed system. Based on the fact that incoming data is streamed using the Apache Kafka software to a Hadoop DFS, it represents a good idea to restrict the software used to the Apache series of products. One possible configuration for the overall setup is to stream the data using Kafka, process it immediately and continuously update an intermediary data-set of short car / individual history. The reason for saving this history is so that the application can deal with incomplete or incorrect data, handle teleporting vehicles. This step is vital in order to accurately calculate the map statistics from the point of view of each intersection. In order to best utilize the HDFS set-up and combined with the timeframe of incoming data, for each 3 Project 3 - Big Data Management adbi, vali, sosr, peol second, the total number of incoming readings can be split equally among all nodes so that each of them have similar workloads. A high-level visualisation can be seen in Figure 1. Figure 1: High-level architecture with Kafka streaming A different way of integrating all these components is to use a different streaming software, for example the Amazon Kinesis Services. Following this approach, the incoming data to the Kafka topic has to be streamed outside of the DFS and into the Amazon Web Service (AWS) platform. AWS attempts to provide a full range of products and services that are designed to streamline the big data management process and allow users to focus on data analysis rather than maintaining the infrastructure of the system (Amazon. 2018). On a technical level, data is first simulated using the SUMO software and is then streamed to a Kafka topic, this is the manner in which the data has been provided, thus, it remains unchangeable. If the entire architecture could be changed, the SUMO data would be streamed directly into AWS storage service using the Kinesis Data Streams. Next, the data can be pre-processed, transformed or modified and analysed using a variety of AWS tools, for example: AWS Lambda, the Kinesis Data Analytics or via Spark on EMR. 4 Project 3 - Big Data Management adbi, vali, sosr, peol Figure 2: Flow if SUMO simulation and Kafka streaming was to be used together with Amazon Kinesis system. 5. Data Processing 5.1. Map and Simulation Data Analysing the source of the Simulation of Urban Mobility (SUMO) data, reveals that the entire map is composed from three elements: edges, junctions and connections. First, edges represent all the different types of roads, in this case the map has 33 different types of roads, each having its own set of attributes such as priority, number of lanes, allowed types of transportation, etc. In the same manner, junctions can be one of many types, in this case, the map has five different types of junctions. A complete list of the types of roads and their attributes can be seen in Appendix A.1, A.2 and A.3. It is important to note that edges and junction can be grouped into clusters in order to keep large intersections and streets that have more than one type of road organised in a clean manner. The final type of infrastructure, connections, represents the metadata that keeps track of the manner in which all edges, junctions and clusters are connected to each other. The map for the simulation of urban mobility can be see in Figure 3. 5 Project 3 - Big Data Management adbi, vali, sosr, peol Figure 3: xml map visualisation Figure 4: xml map visualisation with real map overlay 6 Project 3 - Big Data Management adbi, vali, sosr, peol In this project that map has been designed in such a way that it tries to mimic as close as possible the city of Copenhagen, Denmark. While it does not match its real life counterpart 1:1 it is a very close approximation (Figure 4). It is important to note that the map contains data strictly about the infrastructure of the city, information about the various public transport stations is omitted. In addition to this, the traffic simulation data contains information strictly about pedestrians and automobiles, ignoring bicycles and public transport. Due to the limited nature in the variety of data, one cannot identify the correlation or account for the impact between bicycle and automobile traffic, furthermore, possible improvements such as optimizing the manner in which public transportation schedules all routes is unachievable with the current data. 5.2. Processing Before the best route can be calculated given the current status of the traffic across the city, the most suited emergency service station needs to be selected. At this point, the project makes the assumption that a third-party governmental dispatcher is in charge of selecting the best hospital, police or firefighter station, etc. to respond to an incident. Next to be able to select the best route to an incident given a predefined starting location, the current status of traffic status for each street, lane and intersection needs to be known. Even though the map and the traffic can be visualised using the Simulation of Urban Mobility graphical user interface, the software is too rigid and does not allow for complex computation using the underlying edges and junctions. Because of this, an initial idea was to interpret the map as a directed graph where each node is a junction and each graph edge represents a lane extracted from the SUMO file. However, this approach proved infeasible from an implementation point-of-view due to the high computational requirements that exceed the capabilities of a single workstation computer. In addition to this, creating a single directed graph where each lane is represented by exactly one edge was unreliable due to the manner in which the lane IDs are related to the overall street within the original XML file. Without an in-depth description of the manner in which lane and street IDs are related. In practice this meant that correctly identifying each lane and it’s direction was much more challenging than 7 Project 3 - Big Data Management adbi, vali, sosr, peol simply parsing each incoming data tuple, choosing the vehicle_lane column and retrieving the correct street. To find a solution around this challenge, interpreting the map had to be done in a different manner. As part of the Simulation of Urban Mobility software the NETEDIT application is available by default and represents a network visual editor. Similar to the SUMO-GUI, this application also enables users to modify and then export the underlying XML map to different formats that are more efficient and straightforward to use for collecting statistics. Following this approach, the map was separated into two XML files, one containing all nodes and one with all edges. In effect this meant, that the proposed application ignores individual lanes because the lane IDs are not consistently referable to our static map data, and thus our solution computes statistics for both driving direction for all streets. Therefore we have to interpret and anchor traffic status to nodes, hence junctions are used and not edges/lanes. (SUMO homepage). This is achieved by parsing the lane ID and identifying the starting special characters. For example, a colon character “:” in front of the lane ID, means that it refers to a node, if no colon in front it refers to an edge and if there is a “-” character, it refers to the edge in the opposing direction. In this way, the data structure for each node and edge in the graph could have been directly controlled and updated in real-time with information that is not collected directly by sensors. For example, following such an approach one can calculate the average number of cars that pass through a junction every minute even though there is no sensor that actually counts passing cars. In this manner, the graph interpretation of the city map can be modified to collect information about its intersections such as: number of cars and average speed of cars that recently passed through; and information about each lane such as: number of cars, average speed, average distance between cars. In order to tackle this big data challenge, a Hadoop Distributed File System is used for storing and processing all the information. The incoming traffic data is parsed at each timestep using a Python script which identifies each tuple based on their unique ID. It then compares each object with a master data-set which holds the historical information for each vehicle and either created a new entry or updates the history. A vehicle's short history needs to be recorded in order to figure out it’s dynamics. It is important to note that the vehicle 8 Project 3 - Big Data Management adbi, vali, sosr, peol history is stored only for the duration of the driving session, once it has stopped in legal and safe conditions for a long period of time, it is considered inactive and has no impact on traffic This has to be done in order to keep the computation time as low as possible and to ensure that the readings accurate, old data has to be discarded thus creating the issue of higher accuracy versus faster processing time. In addition to this, the map data can be saved in a master file and continuously appended however, the emphasis is placed on real-time data and old statistics have no impact on the current traffic status. As a result, depending on the desired statistic, a variety of visualisation methods can be tested. For example, it would be useful for city planners to have a daily view over which street has the most traffic or which ones have average speeds well under the limit. For city residents a faster update time could be beneficial, seeing the level of traffic for the last 10 minutes across the city through an interface similar to a heat-map can offer great insights into how to best plan your route. Finally, emergency services could utilise real-time data to identify inefficient routes and to be able to select a longer route that will offer a much higher average speed with minimal extra distance. A disadvantage of using this approach is that a different data set, containing the statistics for each map edge and node needs to be created in order to accurately calculate the expected travel time and select the most optimal route. In an emergency situation, even a difference of one minute or a few seconds can make the difference between success and failure. Thus, the data needs to be highly accurate and reliable. The proposed systems would have the following two components: ● vehicle short historic data ● city map interpreted as a directed graph with dynamic statistics As stated previously, the historic data does not record all movements of a vehicle but rather the latest driving session and is then deleted. The data can be further reduced to the latest four timesteps without losing much accuracy, the reason for this is that one cannot make an estimation about a vehicle’s speed and trajectory from a single timestep. Two consecutive timesteps could be used in order to calculate the various attributes, however no data 9 Project 3 - Big Data Management adbi, vali, sosr, peol collection method can ensure complete 100% fidelity to the real world environment. To reduce the risk of incomplete data, which could in turn be identified as a false positive of an accident or traffic congestion; more samples have been collected and analysed. The above mentioned process is used to identify the traffic status for each vehicle, while the second process utilises this information to update the map statistics in order to reflect the current traffic status across the city. To provide useful information to multiple entities there should be three different map views: ● Daily view: statistics with traffic stats for the latest 24 hours ● Short history view: statistics about the last 10 minutes so that city resident can plan their journey ● Real-time view: real-time statistics containing the most recent information. Depending on the emergency service this can vary between 30 seconds and 5 minutes. An in depth technical description of each function within the algorithm can be read from the README.txt file which can be found in Appendix B or from the zip file submitted alongside this report. 5.3. Analysis The following paragraphs present a series of figures as well as a number of technical considerations in order to better describe the proposed system. In figure 5, a bar plot of the most visited streets from a snippet of the simulation can be seen. Specifically, the data stems from timestep 0 to 2454, which has a total of 9,981,223 records (vehicles and pedestrians) and 6094 uniquely recorded vehicles. For individuals familiar with the street names of Copenhagen, the most visited street names intuitively corresponds with what one would 10 Project 3 - Big Data Management adbi, vali, sosr, peol Figure 5: Bar plot of 10 most visited streets Figure 6: Highlighted 10 most visited streets on map visualisation 11 Project 3 - Big Data Management adbi, vali, sosr, peol expect from the traffic flow in the city. However, for “Copenhagen-novices”, just looking at the names might not bring much insight into which areas of the city are visited by many cars, therefore an accompanying map, marking the 10 most visited streets, has been produced. The map, which can be seen in figure 6, follows the color-convention of the bar plot and reveals something different. Namely that the majority of these roads function as main connections between separated parts of the city, i.e. from and to Amager and inner city, and from and to inner city and the faubourgs of Copenhagen. This does not necessarily denote anything meaningful, because most likely these roads also have a greater capacity in terms of how many cars they can carry before it has a direct impact on traffic flow. Anyway, it is an important aspect to have included, when one is analysing which route the emergency response unit should take. The bar plot as well as the maps are examples of how the data could be presented to the individuals in the emergency dispatch central. The second view of the speed layer is intended to present information on capacity versus traffic flow. This is a measure of how heavy the traffic is on all streets, and is considered comparable between different types of roads, eg. regional - and local streets. In order to define capacity the categorisation from the municipal of Copenhagen is extracted (“Byens vejnet”, 2015). Here, they have classified the streets in five different types; regional (regionale veje), distribution (fordelingsgader), district (bydelsgader), small (strøggader), and in addition two traffic zones; inner-city (trafikzone indre by) and 40 km/h zone (40 km/t hastighedszoner). Below a visualisation of the types can be seen. The types are ranked in capacity; Regional, distribution, district, small, 40 km/h zone, inner-city. Figure 7: Copenhagen Municipality categorisation of roads 12 Project 3 - Big Data Management adbi, vali, sosr, peol If the correct capacity numbers for the different road types were available, we would be able to fairly precise define which streets are most congested. To showcase the convenience of the ratio, we tried to estimate somewhat realistic capacity numbers1. In the stacked bar plot below are the estimated ratio of level of congestion showed, it depicts the 10 most visited roads’ ratio. Figure 8: Stacked bar plot of ratio between traffic and capacity of the road Compared to the bar plot, we clearly see that in terms of capacity, “Torvegade” is no longer the most busy since it has much greater capacity than “Dronning Louises Bro”2. With this type of measure, we now have the possibility to define a threshold to which the “level of congestion” can rise to without impacting the weights of the edges. The direct graph can hold multiple properties about its edges, which can be used to evaluate the optimal route in the A* search. Some of these properties are used as weights, which in our 1 Estimated capacity of cars per hour: regional: 3600, distribution: 2500, district: 2000 small: 1200 Classification of depicted roads: Torvegade: district, Dronning Louises Bro: small, Børsgade: distribution, 2 Amager Boulevard: regional, Langebro: regional, Knippelsbro: district, Gothersgade: distribution, Gyldenløvesgade: regional, Nørre Søgade: regional, Danasvej: district 13 Project 3 - Big Data Management adbi, vali, sosr, peol case is priority and allowed speed, this is information that can be extracted from the static map data. Other, and maybe more precise, properties that might enhance the evaluation, are average speed on edges and the duration of time it takes to pass the edge (e.g. a ratio between length of edge and time spent passing). Figure 9a: Initial route selected by the A* search, without traffic status Figure 9b: Optimised route selected by the A* search with traffic status 14 Project 3 - Big Data Management adbi, vali, sosr, peol As an example we have chosen the police station “Station City” located at “Halmtorvet” as the source, i.e. from where the path will start, and “Prinsessegade” has been chosen as the target. Looking at figure 9a, the path selected seems like the most optimal route between the two locations, which passes over the bridge “Knippelsbro”. However, considering that “Børsgade” and “Knippelsbro” are among the three most busy streets, one might want to account for that when selecting the optimal route at a given time-step. Figure 9b is a result of excluding the “Knippelsbro”-edge from the graph, which enforces the search to find another route. This is done when the simulation data reveal that the estimated time of arrival exceeds that of the second most optimal route. Looking at the data from the two different routes, figure 9a distance is 2.713 kilometers whereas the altered route is 3.154 kilometers. 5.4. Technical Review & Reflections First of all the static map file will be processed into a direct graph, which has the capability of holding information of current traffic statistics. Unused types of edges will not be processed, as they serve no purpose to our application. These types are those not allowing vehicles, but bikes and pedestrians. There will always be a baseline graph that only has weighted edges in terms of speed limit and priority lanes. This graph will be copied and updated depending on the current traffic status, where nodes that are unavailable due to heavy traffic are weighted more poorly so that the search will give less priority to these streets and will be left out of any new A* search. There are multiple ways of assessing the best route; as explained earlier, the aim is to provide capability to take into consideration the 24 hour traffic data, the latest 10 minutes or just the last 30 seconds to 5 minutes. To provide the assessment on a 24 hours basis would provide a more robust model that is less affected by noise, but one might not get a precise evaluation as there will be a lot of data included that are not relevant. I.e. one has to assume that the traffic is more intense between 15:00 and 18:00 than it is between 02:00 and 05:00, therefore the model should account for the fluctuations and prioritise more recent data higher than “old” data. 15 Project 3 - Big Data Management adbi, vali, sosr, peol 6. Legal Considerations As a part of this project, this section will contain the legal considerations for the overall handling of the given data set. The dataset includes records of movements of different vehicles in Copenhagen, as well as records for pedestrian movement. As such, this service uses data with location origins, which in turn makes it personal data according to GDPR article 4: “‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;(Article 4, GDPR. 1)”. The dataset used for the service cannot rule out the possibility of a person being identified by the traces the data creates. Even though the data is location data, it is unspecified in terms of how the data is distributed for processing. That however, does not mean that the processing must be without critical considerations for data protection. As stated in article 9: “processing is necessary for reasons of substantial public interest, on the basis of Union or Member State law which shall be proportionate to the aim pursued, respect the essence of the right to data protection and provide for suitable and specific measures to safeguard the fundamental rights and the interests of the data subject; (Article 9, GDPR.g). The collection of the dataset was initially meant for the purpose of improving emergency wayfinding in critical situations and thereby optimize citizens safety through a service of providing the least populated route. This is provided for both emergency transport but also average citizens. Therefore, it is argued that the processing of the dataset is necessary for reasons of substantial public interest, as stated above. This statement of the purpose for collecting the data, is also in correspondence with the legal grounds for this service. As this service is made from the perspective of a public interest, is also the legal ground for the entire project. This is derived from article 6: “processing is 16 Project 3 - Big Data Management adbi, vali, sosr, peol necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller (Article 6, GDPR.e)”. This is derived from the assumption that the data controller for the project, is Danish Government and that both aspect such as consent and distribution of data is done in compliance with GDPR regulations. As such will we as processors have the authority, provided from the data controller, to distribute and analyse the data to achieve the goals for the project. This in turn also obligates the data processing to be done in correspondence with GDPR regulations that the Danish Government is under. As stated in article 29: “The processor and any person acting under the authority of the controller or of the processor, who has access to personal data, shall not process those data except on instructions from the controller, unless required to do so by Union or Member State law (Article 29, GDPR)”. This in turn means that serious considerations must be done to ensure the compliance with GDPR but also the responsibility that we possess as the incident and emergency response team, to ensure safety and optimization of infrastructure within the cities. Other legal issues In addition to the legal considerations in the section above, another legal issue was discussed in relation to make sure that the data is processed correctly and within the GDPR if reused for other purposes, as the data in present is not reused. One aspect that came to mind when reusing the dataset, was making it a pseudonymized dataset. This is important when considering the case of actual movements of real individuals in the city, because the identifiable attributes are replaced by the Vehicle ID and Person ID. According to Maldoff (2016), pseudonymization under GDPR can be defined as: "(...) the separation of data from direct identifiers so that linking to an identity is not possible without additional information that is held separately." As processors, we know that great awareness must be achieved so that the data is handled carefully, so that datasets about the identity of actual individuals cannot potentially reveal the movements of actual individuals. If such data is not handled with respect, this can mean a breach in the privacy of individuals. Therefore, if reused for other purposes, a solution for the dataset could be pseudonymization in order to strive to secure information that could potentially lead to identifying the location of an actual individual. However, one important 17 Project 3 - Big Data Management adbi, vali, sosr, peol note is that to make sure that an individual is no longer identifiable, one would have to completely anonymise the dataset, so that it would not be GDPR applicant. As explained in article 29 in Opinion 05/2014 in relation to anonymization techniques, pseudonymization and anonymisation is not the same. Whereas “pseudonymization” is defined in the action of processors:”(...)replacing one attribute (typically a unique attribute) in a record by another. The natural person is therefore still likely to be identified indirectly; accordingly, pseudonymisation when used alone will not result in an anonymous dataset.” (ibid: 20). In comparison, anonymization is defined as the action made by the processors in order to make sure that information: “(...)can no longer be used to identify a natural person by using “all the means likely reasonably to be used” by either the controller or a third party. An important factor is that the processing must be irreversible.” (ibid: 5). As such, the article 29 points to the common mistake among processors when making the assumption that pseudonymization in the sense of replacing one or more attributes is sufficient when trying to anonymize the dataset (ibid: 21). In relation to this project, we are aware that a complete anonymization would be the most desirable solution in terms of handling a possible reusage of the dataset for other purposes. However, to keep the dataset usable in the sense of being able to track relations and reference data points to one another, the process will be limited to pseudonymization. 7. Selling It As the city's incident and emergency services, we aim to improve the timeframe for how quickly emergency workers reach the affected places of accidents. We strongly believe that providing information of the current infrastructure may be of crucial importance to the citizens that lives within city limits. In times of rush hour, our solution will help the emergency response unit find the best route to address an emergency situation. This can be helpful in situations or during periods of day when there are overload of other vehicles in a particular area or there are specific roads that are characterized by construction work. The service will provide multiple ways of assessing the best route, as it provides capability of assessing; 24 hour traffic data, the latest 10 minutes or just the last 30 seconds to 5 minutes, 18 Project 3 - Big Data Management adbi, vali, sosr, peol in terms of traffic status overview and optimal route search. We aim to create a technology that will benefit the government and be implemented into future tools for emergency institutions, that not only will provide greater security for the rescues but also higher safety for the personal of the emergency vehicles. As our stakeholder, but also the power that determines financial aid, we will provide how our service will be beneficial for the Danish Government and its relevant workers. 7.1. Benefits of the service The dream scenario will be to implement the technology and data visualization into all emergency vehicles GPS systems. Having all vehicles connected to the data will also be a major asset if a larger emergency response must be present at the same location, by creating an overview of the routes available. As a result precious minutes may be recovered, which could be the difference between life and death. Figure 10:: Example of route optimisation Above is a demonstration of how the route will be optimised and visualised for the personnel driving the vehicles as well as the personnel in the dispatch center who have the overview. Left picture displays the most direct route in terms of directions and length, but this is without taking into consideration the traffic status. It turns out “Knippelsbro” and “Børsgade” are very busy and has a fairly low capacity. With this knowledge the map behind will change, so that it is more likely to pick roads that are not as occupied, but still is directed towards the end goal. Therefore, the route suggested now goes via “Langebro” and “Amagerbrogade”, which has lower level of congestion. As such, the service will not focus on the shortest route, but rather the route with fewer occupation of other cars and pedestrians. 19 Project 3 - Big Data Management adbi, vali, sosr, peol Furthermore, is a benefit from a business perspective that the service can offer valuable insights for various actors that are involved with the project; city residents, city planners and emergency services. The easiest way of describing the system would be as a google maps replica specifically tailored at the infrastructure level for Copenhagen. As such we believe that the technology also can provide a social benefit. By creating transparency and making calculation of the most populated streets may provide the necessary information for how to optimize the infrastructure or other possible improvements that will benefit all individuals. As seen in the figure above, can calculations be made if a route is constantly over populated. Optimizing access to other streets or change of public transportation could provide a safer infrastructure. On a technical level, data will be stored and processed using a hadoop distributed system. A major benefit of this course of action is that the system can be implemented and tested on a small scale first and it then simply be scaled up to increase the capacity of the cluster to match the complexity required. One of the challenges of such a system is the accuracy of data collected, which is directly related to how many sensors are placed around the city and how fast, good and reliable the network connection is. 8. Reflections Thinking in terms of what the service are and what it can provide, it will not come without complications. As the service in many ways are similar to Google maps, even though it is specified for emergency transport, creates problem when considering the perspective of hardware. The hardware requirements of fully implementing such a project cannot accurately be estimated at this point in time due to the sheer scale of the infrastructure needed to ensure optimal functioning. There are three vital requirements for such a system to be considered feasible: first, there has to be a city wide network infrastructure that can connect all sensors to a centralised aggregation server. Secondly, the sensors themselves have to be designed, tested and placed all over the city to even be able to collect data about vehicles and pedestrians. Thirdly, the project assumes the existence of a database that monitors and updates all vehicle and person IDs so that they match to exactly one vehicle or person in real-life. Fourthly, to get a set of streaming data similar to the one provided in the kafka topics, the data has to be 20 Project 3 - Big Data Management adbi, vali, sosr, peol taken cameras rather individual sensors. This adds a complex layer of data processing methods and algorithms so that data can be extracted from raw video input. As such, major resources must be used to accommodate these issues if the project is funded. Another issue is the ethical implications that occurs when a system using this level of “surveillance” is needed for it to function. One aspect is how a system such as our service, transform location into tradable objects and disruptive elements. As such, the service could potentially create networks of accountability that are based on power structures. This assumption is derived from the article by Irina Shklovski , Janet Vertesi , Emily Troshynski , and Paul Dourish; The Commodification of Location: Dynamics of Power in Location-Based Systems. In the case of the service, we see the power relationship being disrupted between government and citizens. As the government collection and usage of the collected data may not be with privacy in mind. This mean that the ethical issue is how the system may reshape the institutional relationship between them and how it becomes a question of safety vs privacy (Shklovski, Vertesi, Troshynski & Dourish, 2009). An aspect of social impact, will be how citizens feel about the government being in possession of such data. Even knowing that the legal ground is of a public interest, it may create uncertainty. Especially considering how certain municipalities in Denmark have been unable to protect themselves from data leaks and not being compliant with data process regulation (Kjær, 2018). Considering whom the intended customer is, the government, it is unlikely that the complex system and the data gathered will not be used for anything else than our solution, meaning that we can not guarantee that it could be used for other purposes. To implement the dependencies required for this project to succeed would be at this point infeasible and one could reckon, an utopian dream for the “big brother state”, but a nightmare for those who oppose it. 21 Project 3 - Big Data Management adbi, vali, sosr, peol 9. Log One of the main challenges of this project was not analysing the data but combining all the different software, libraries and packages so that streaming data can be viewed and processed in real-time. Probably the reason for this is that, the Big Data Management lectures did not go deep enough into the technical details of using Scala, Spark, Hadoop, Kafka or SUMO from a software development perspective. The task of presenting the intricacies of these software was left to the TAs, who in the later stages of the course, were unable to deal with the large amount of questions and misunderstandings. Furthermore, as part of the projects, all groups had to deploy their application to the cluster in order to test them. In practice, it was often the case that if the application does not function successfully and optimally, it can create serious cluster bottlenecks that severely reduce the entire processing power. In worse situations, an unsuccessful application can completely crash one of the nodes and thus rendering the cluster useless until it is fixed by the TAs or ITU’s IT support department. In this case, the students could not do anything because they did not have the necessary permission settings. Furthermore, it was stated at the beginning of the course that any programming language can be used for analysing the data however that is not true. While it might be admissible from a course curriculum point of view, in practice, the necessary libraries and dependencies between Scala and the preferred programming language had to be installed. However, as stated previously, the students do not have the necessary permission to install these dependencies by themselves. Since we, of course, do not have admin permission on the HQ, we were unable to install packages used in our application. This we will aim to fix in an improved version of this project, since also the TA had been helpful saying that we could reach out if any assistance in installing packages was needed. Due to the troublesome process of combining the components to test our solution on the cluster, we extracted parts of the topics to work on it locally. One thing that we found difficult was how to interpret the lane_ids and relate it to the osm.net.xml file, which was the case due to several reasons. In order to be able to understand 22 Project 3 - Big Data Management adbi, vali, sosr, peol the content of the map, we went through the process of producing the plain xml files. Since we only process two of the files in order to generate our direct graph, we lacked some of the information in the original osm file. This ultimately meant that the ability to directly relate a lane id to an edge/node disappeared, since conversion removed the detailed referencing. Therefore, we were obliged to translate the lane ids into some id that is either relatable to the node or edge. This caused much frustration and also a lot more work was required in order to mix the two types of data sources. Because we seemed to be the only ones going with this approach it resulted in little to no guidance on how to relate the two references. 23 Project 3 - Big Data Management adbi, vali, sosr, peol 10. References Amazon. (2018). Retrieved from https://aws.amazon.com/kinesis/?fbclid=IwAR2Ot0-znuulxXZU6fMZ6N89F7NJRg7RWxR NS2T1XFVahM3XFy2nWK-eICk Amazon. (2018) Whitepaper: Streaming Data Solutions on AWS with Amazon Kinesis. Retrieved from https://aws.amazon.com/kinesis/whitepaper/?fbclid=IwAR1dfMtjrFb5ZIcdcY6c5d83Jyqx_V JwooBWscgA1bdRbdvZMOJ3z_sJw-k Art. 4 GDPR - Definitions. Retrieved from https://gdpr-info.eu/art-4-gdpr/ Art. 6 GDPR - . "Lawfulness of processing". Retrieved from http://www.privacy-regulation.eu/en/article-6-lawfulness-of-processing-GDPR.htm Art. 9 GDPR – Processing of special categories of personal data. Retrieved from https://gdpr-info.eu/art-9-gdpr/ Art. 29 GDPR - Opinion 05/14 on Anonymization Techniques. Retrieved from https://www.pdpjournals.com/docs/88197.pdf Art. 29 GDPR - Processing under the authority of the controller or processor. Retrieved from http://www.privacy-regulation.eu/en/article-29-processing-under-the-authority-of-the-controll er-or-processor-GDPR.htm Byens vejnet. (2015). Retrieved from https://kp15.kk.dk/artikel/byens-vejne Hausenblas, M. (2014). Lambda Architecture with Apache Spark. Retrieved from https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark?slide=24 Kjær, J. S. (2018). Borgere over hele landet er ramt af datalæk: Gladsaxe indrømmer mere ulovlig databehandling. Retrieved from https://politiken.dk/indland/art6918190/Gladsaxe-indr%C3%B8mmer-mere-ulovlig-databeha 24 Project 3 - Big Data Management adbi, vali, sosr, peol ndling?fbclid=IwAR3eKpnYtXHAFS-PlnoiOEO1UTj56ZqT2VG_sm4tQ2foSCHRepJNv3v KTdY Maldoff, G. (2016). Top 10 operational impacts of the GDPR: Part 8 – Pseudonymization. Accessed November 23 on WWW: https://iapp.org/news/a/top-10-operational-impacts-of-the-gdpr-part-8-pseudonymization/ Mobility | Definition of mobility in English by Oxford Dictionaries. (n.d.). Retrieved from https://en.oxforddictionaries.com/definition/mobility Networks/PlainXML. (n.d.). Retrieved from http://sumo.dlr.de/wiki/Networks/PlainXML Shklovski, I., Vertesi, J., Troshynski, E., & Dourish, P. (2009). The commodification of location: dynamics of power in location-based systems. In Proceedings of the 11th international conference on Ubiquitous computing, Orlando, Florida, USA: ACM. 25 Project 3 - Big Data Management adbi, vali, sosr, peol 11. Appendix Appendix A.1: Edges with allowed types of vehicles: 26 Project 3 - Big Data Management adbi, vali, sosr, peol Appendix A.2: Edges with disallowed types of vehicles: Appendix A.3: Junctions and their attributes: Appendix B: Readme.txt: readme file for project 3 in big data management group10 Used packages: ● pyspark (will be used when streaming and executing on the cluster) ● numpy ● pandas ● networkx ● xml 27 Project 3 - Big Data Management adbi, vali, sosr, peol ● regex ● folium #only for visualizations ● matplotlib #only for visualizations ● operator library.py functions: 1. vehicle class is used to store the simulated data, so that the data is associated with a specific vehicle id. It has a update method which is used by the data_parser if the vehicle is already constructed. 2. data_parser will parse through the provided pandas dataframe and check if vehicle is already constructed and update to the current status of the car. 3. node_parser takes the plain xx.nod.xml file and add each node, along with x,y on the map plane and the type. 4. edge_parser takes the plain xx.edg.xml file and add an edge between the nodes added to the graph in the previous function. Hence it takes the graph that includes nodes already. This function also generates a dictionary that works well as a quick reference when looking up edges and to which nodes they're connected. 5. the rest of the library contains a bunch of helper functions to perform the intended processing and analysis of the simulated data. Here is an overview of their purpose and use: 6. lane_statistics: parses given data and produces a ranked list of most visited streets 7. find_nearest_node: given x,y coordinates and graph, it will return the node that is closest. 8. distance: heuristic function used by the astar algorithm to determine to which node is shortes. Takes two nodes and return the distance between them in meters 9. plain_distance: measures the distance in meters between to x,y coordinates 10. total_distance: measures the distance of a path. Given nodes list and graph it calculates the total distance 11. x_y_to_lat_long: Used for map visualization. Takes path (nodes_list), a graph, lat_model: a multiple regression model trained on latitude, long_model: a multiple regression model trained on longtitude 28 Project 3 - Big Data Management adbi, vali, sosr, peol 12. retrieve_veh_path: takes a vehicle’s history and graph, and returns its path in nodes. This is not a 100% accurate since it might pick nodes that are not connected to the lane that the vehicle actually travels - uses find_nearest_node 13. add_path: used for visualization. takes a folium map and a list of latitude and longitude coordinates and add the path to the map 14. name_to_nodes: search function to find nodes with the given name. returns a list of all nodes that have edges with the particular name 15. class linear_regression: multiple linear regression function, trained in order to approximate the latitude and longitude from x and y coordinates. Surprisingly accurate 16. translate: takes a lane_id, edge_to_node dictionary produced in the parsing, the graph. This function was created in order to relate the lane ids with nodes and edges 17. traffic_status: takes a dataframe of simulated data, the graph, and edges_to_node dictionary, and returns a dictionary of average speed on given lanes 18. remove_duplicate_streets: only used for visualization purposes. probably not relevant 29

(PDF) Big Data Management - Technical Portfolio