Apache Spark and Scala Project question

JC724 · Nov 11, 2018

I am working on my first Apache Spark/Scala project for my class. Just looking for some guidance. I have never done this before.

I am working on the ST WITHIN. I am just trying to get a clear idea on what it is supposed to do and how I should attack this. And maybe some good youtube videos or websites where I can quickly learn how to code in scala.

A major peer-to-peer taxi cab firm has hired your team to develop and run multiple spatial queries on their large database that contains geographic data as well as real-time location data of their customers. A spatial query is a special type of query supported by geodatabases and spatial databases. The queries differ from traditional SQL queries in that they allow for the use of points, lines, and polygons. The spatial queries also consider the relationship between these geometries. Since the database is large and mostly unstructured, your client wants you to use a popular Big Data software application, SparkSQL. The goal of the project is to extract data from this database that will be used by your client for operational (day-to-day) and strategic level (long term) decisions.

The project has two phases. In each phase, you will be given data and a template code written in SparkSQL. In the first phase, you will write two user defined functions ‘ST_Contains’ and ‘ST_Within’ in SparkSQL and use them to run the following four spatial queries. Here, a rectangle R represents a geographical boundary in a town or city, and a set of points P represents customers who request taxi cab service using your client firm’s app.

1. Range query: Given a query rectangle R and a set of points P, find all the points within R. You need to use the ‘ST_Contains’ function in this query.

2. Range join query: Given a set of rectangles R and a set of points P, find all (point, rectangle) pairs such that the point is within the rectangle.

3. Distance query: Given a fixed point location P and distance D (in kilometers), find all points that lie within a distance D from P. You need to use the ‘ST_Within’ function in this query.

4. Distance join query: Given two sets of points P1 and P2, and a distance D (in kilometers), find all (p1, p2) pairs such that p1 is within a distance D from p2 (i.e., p1 belongs to P1 and p2 belongs to P2). You need to use the ‘ST_Within’ function in this query.

In the second phase of the project, you will implement two major tasks using template codes in SparkSQL: ‘hot zone analysis’ and ‘hot cell analysis’. The hot zone analysis uses a rectangle dataset and a point dataset. For each rectangle, the number of points located within the rectangle will be obtained. The more points a rectangle contains, the hotter (and more profitable) it will be. The ‘hot cell analysis’ applies spatial statistics to spatio-temporal Big Data in order to identify statistically significant hot spots using Apache Spark.

Requirements for ST WITHIN the part I have to write:

Requirement

In this phase you need to write two User Defined Functions ST_Contains and ST_Within in SparkSQL and use them to do four spatial queries:

Range query: Use ST_Contains. Given a query rectangle R and a set of points P, find all the points within R.
Range join query: Use ST_Contains. Given a set of Rectangles R and a set of Points S, find all (Point, Rectangle) pairs such that the point is within the rectangle.
Distance query: Use ST_Within. Given a point location P and distance D in km, find all points that lie within a distance D from P
Distance join query: Use ST_Within. Given a set of Points S1 and a set of Points S2 and a distance D in km, find all (s1, s2) pairs such that s1 is within a distance D from s2 (i.e., s1 belongs to S1 and s2 belongs to S2).

A Scala SparkSQL code template is given. You must start from the template. The main code is in "SparkSQLExample.scala"

The detailed requirements are as follows:

1. ST_Contains

Input: pointString:String, queryRectangle:String

Output: Boolean (true or false)

Definition: You first need to parse the pointString (e.g., "-88.331492,32.324142") and queryRectangle (e.g., "-155.940114,19.081331,-155.618917,19.5307") to a format that you are comfortable with. Then check whether the queryRectangle fully contains the point. Consider on-boundary point.

2. ST_Within

Input: pointString1:String, pointString2:String, distance

ouble

Output: Boolean (true or false)

Definition: You first need to parse the pointString1 (e.g., "-88.331492,32.324142") and pointString2 (e.g., "-88.331492,32.324142") to a format that you are comfortable with. Then check whether the two points are within the given distance. Consider on-boundary point. To simplify the problem, please assume all coordinates are on a planar space and calculate their Euclidean distance.

3. Use Your UDF in SparkSQL

The code template has loaded the original data (point data, arealm.csv, and rectangle data, zcta510.csv) into DataFrame using tsv format. You don't need to worry about the loading phase.

Range query:

select *

from point

where ST_Contains(point._c0,'-155.940114,19.081331,-155.618917,19.5307')

Range join query:

select *

from rectangle,point

where ST_Contains(rectangle._c0,point._c0)

Distance query:

select *

from point

where ST_Within(point._c0,'-88.331492,32.324142',10)

Distance join query:

select *

from point p1, point p2

where ST_Within(p1._c0, p2._c0, 10)

Apache Spark and Scala Project question

JC724

Weaksauce