CSCI 585 Saty’s definition - Next Show
spatial database Keywords national spatial data infrastructure (NSDI) GIS is a specific application architecture built on top of a [more general purpose] SDBMS. Minimum Bounding Rectangles (MBRs) A spatial database is a database that is optimized to store and query data related to objects in space, including points, lines and polygons. Characteristics of geographic data has location has size is auto-correlated scale dependent might be temporally dependent too two conceptions of space: entity view space as an area filled with a set of discrete objects field view space as an area covered with essentially continuous surfaces db types operators indices From CAD: user creation CAD: reverse engineering maps: cartography (surveying, plotting) maps: satellite imagery maps: 'copter, drone imagery maps: driving around maps: walking around relationship 'intersects', 'crosses', 'within', 'touches' SQL polygon linestring Spatial operators, functions inside/contains touch/disjoint cover/coverby equal/overlap within distance nearest neighbor convexhull Oracle SDO_Geometry Object SDO_Gtype point linestring polygon heterogeneous collection multipoint SDO_SRID SDO_Point SDO_ELEM_INFO SDO_ORDINATES example sdo_geometry(2002,null,null,sdo_element_info_array(1,2,1),sdo_ordinate_array()); Operators SDO_FILTER sdo_filter(c.location, sdo_geometry....) rectangular area = 'TRUE' never use 'FALSE' or = 'true' SDO_RELATE sdo_relate(c.geom, s.geom, 'mask=inside+coverby' = 'TRUE' SDO_TOUCH(a.geom, b.geom) = 'TRUE' SDO_WITHIN_DISTANCE sdo_with_distance(c.location, i.geom, 'distance=15 unit=mile') = 'TRUE' SDO_NN nearest sdo_nn(c.location, i.geom, 'sdo_num_res=5 unit=mile', 1) = 'TRUE' Spatial Indexing optimal spatial query performance R-tree Indexing MBR for 2D data MBV for 3D data Indexes two three four dimensions an exclusive and exhaustive coverage of spatial objects Indexes all elements with a geometry including points, lines and polygons B-tree indexing z ccurve indexing scheme Postgres PostGIS distance(geo,geo) equals(geo,geo) disjoint(geo,geo) intersects(geo, geo) touches(geo, geo) crosses(geo,geo) overlap(geo,geo) contains(geo,geo) length(geo) area(geo) centroid(geo) Create ST_Polygon(ST_GeomFromText('Linestring(,,,75.15 29.53 1)'),4326) B-Tree Indexing organizing spatial data with space filling curves sorting is not naturally defined on spatial data many efficient search method are based on sorting datasets R-Tree Indexing R trees use MBRs to create a hierarchy of bounds R+ tree, R* tree, Buddy trees, Packed R trees k-d tree, k-d-b trees Quadtree and octree Each node is either a leaf node, with indexed points or null, or an internal (non-leaf) node that has exactly 4 children. Indexing evolution Visualization Dot map Diagram map Choropleth maps Google KML Keyhole makeup language Simple placemark Attached to the ground. Intelligently places itself at the height of the underlying terrain. -122.0822035425683,37.42228990140251,0 Absolute Extruded Transparent green wall with yellow outlines #yellowLineGreenPoly 1 1 absolute -112.2550785337791,36.07954952145647,2357 -112.2549277039738,36.08117083492122,2357 -112.2552505069063,36.08260761307279,2357 -112.2564540158376,36.08395660588506,2357 The Pentagon 1 relativeToGround -77.05788457660967,38.87253259892824,100 -77.05465973756702,38.87291016281703,100 -77.05315536854791,38.87053267794386,100 -77.05668055019126,38.87154239798456,100 -77.05542625960818,38.87167890344077,100 -77.05485125901024,38.87076535397792,100 GIS Platform OpenLayers ESRI Arc, ArcGIS QGIS MapBox NoSQL Not only sql RDBMs are not suitable for big data, big flux rapid change huge variety in type of data lots of data lot of people Character flexible efficient available scalable high performance, high availability at a large scale NoSQL DB Schema-less flexible scalable fast Driver of the NoSQL movement avoidance of complexity high throughput easy quick scalability run on commodity hardware avoidance of need for OR mapping avoidance of complexity associated with cluster setup JSON db == json array of objects row: each object is a set of kv pairs rdb->json array object "":[,], {key:value},[value,value],[key:{},key:{}] XML db BASE not acid BASE basic availability db is up most of the time soft state of consistency not guaranteed eventual consistency at a later point in time by push or pull consistent across nodes and replicas ACID Atomicity Consistency Isolation Durability Schema-less do not have an explicit schema the schema resides in the application code that reads and writes data db itself doesn't care about what it is storing use intuitive data structure JSON datatype number string boolean array object null NoSQL Databases Key value dynamoDB(Amazon) Project Voldemort Redis Tokyo Cabinet/Tyrant Column-family store Cassandra HBase document store MongoDB CouchDB MarkLogic Graph store Neo4j HypeerGraphDB Seasame one comparison metric Architecture Topology Consistency Fault tolerance (and partition tolerance) Structure and format Administration Logging and Statistics Configuration management Backup Disaster recovery Maintenance Recovery Monitoring functionality Security Deployment Availability Stability Development Documentation Integration Support Usability Performance and scalability Performance Scalability No SQL disadvantage too many options limited query cap eventual consistency is not intuitive to program lacks joins, group by, order by acid limited guarantee of support Polyglot persistence Martin Fowler (of 'Code Refactoring' fame) the use of different storage technologies to store different parts of a single app individual data stores are then tied together by the application, using DB API or web API Example Gannett Marriott Ryanair Why noSQL to improve programmer productivity by using a database that better matches an app's needs to improve data access performance via some combination of handling larger data volumes, reducing latency and improving throughput Cluster a collection of nodes holds an entire db node a db instance kv dbs data storage paradigm designed for storing, retrieving and managing associative arrays(a data structure more commonly known today as a dictionary or hash) whole db is a dictionary record =>rows field => columns key in the k-v db pk in the table value is aggregate of all non-pk columns query only occurs on keys Memcached high performance in memory data caching system set get Redis social media number, string, boolean list, set, dictionary Redis & Instagram Set encode subkeys quite efficiently Amazon's Dynamo manage the state of services need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance kv design values stored as blobs operator limited to a single kv pair at a time get(key) put(key, context, object) context metadata version number keys are hashed using md5 to ensure availability and durability k/v is replicated N times N = 3 store in N adjacent nodes pros no master highly available for write knobs for tuning read simple cons proprietary to amazon client need to be smart no compression Column family big table hbase cassandra simpledb hypertable Terminology Column name key and a value super column a grouping of columns row key consist of a kv pair where key is used to index value username:{firstname:'Cath',lastname:'Yoon'} Column family related data user db column might be name age do as a 'namespace' for a group of columns super-column family a supercolumn has a name (key) a 'namespace' for a group of supercolumns supercolumn UserList = () Column family vs Super-column family in a column family, row keys directly contain (as values) column data. Storing tweets Google Big Table managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers a sparse, distributed, persistent multidimensional sorted map Derivative Hypertable C++ based on HDFS distributed lock managing HBase clone in java hadoop Cassandra Facebook used by twitter digg rackspace data model rows column families occur in number per row columns have a name and store a number of values per row cannot be thought of as a rectangle supercolumns name columns column may different per row get() insert() delete() Cassandra CQL Document DB collection of documents documents => row collection => table Keys values more sophisticated version of k-v store a key is paired with a document document itself can contain multiple kv pair, key array pairs EG CouchBase CouchDB MongoDB OrientDB Entire documents are stored, there is no need to perform JOIN operation queries can be parallelized using MapReduce query non-sql simple kv range query geo-spatial text queries aggregation max count MapReduce Graph DB A graph is a data structure comprised of vertices and edges index free each node directly stores pointers to its adjacent nodes. linked data eg Neo4J OrientDB FlockDB Giraph TinkerPop a graph offers a compact, normalized, intuitive way of expressing connections/relationships. Each row in a table becomes a node, and columns (and their values), node properties. triple store store triples (subject, attribute, value) (class, attribute, value) a quad store named graph a triple store DB is a restricted form of a graph DB implementation allegroGraph MarkLogic SparkleDB Stardog query rdql sqarql rql :human rdfs:subClassOf :mammal DBpedia Architecture(3 modes) in-memory: triples are stored in main memory native store: persistence provided by the DB vendors, eg. AllegroGraph non-native store: uses third-party service, eg. Jena SDB uses MySQL as the store Big data variety velocity volume TMD too much data what we can do combine multiple source of data exploit unstructured data provide insight to frontline managers in near real time experiment with the marketplace as often as needed Data Science Here are areas you can focus on (courses, tools, coding, domain knowledge): Tools "From scratch" analyses can be performed using homegrown and open-source libraries in R, Python, Java etc. Weka Knime matlab The following companies aim to provide productivity and efficiency boosts, for data scientists: DataRobot: model development SqlStream: fast stream processing system Bansai: a high-level, Pythonic language for deep learning applications Qubole: quick warehouse deployment MapReduce Computational machinery to process large volumes of data massively parallel way to process data MapReduce is a programming paradigm invented by Google data is split into file segments, held in a computer cluster made up of nodes a mapper task is run in parallel on all the segments; each mapper produces output in the form of multiple pairs related k-v pairs from all mappers are forwarded to a shuffler, each shuffler consolidate its values into a list Shuffling can start even before the map phase has finished, to save some time. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key shuffler forward their kv to reducer taskers; each reducer process its list of value and emits a single value Before shufflers, a combiner operation in each node can be set up to perform a local key reduction GFS file system is implemented as a process in each machine's OS striping is used to split each file and store the resulting chunks on several 'chunkservers', details of which are handled by a single master. Hadoop, HDFS MR examples: Word count java python ecosystem ambari managing, monitoring sqoop relational dbs Flume Chukwa log data mahout ml R statistics Pig data flow Hive data warehouse Oozie workflow Zookeeper coordination HBase distributed Table Store HDFS Hadoop dfs Hive SQL like sl HiveQL better than sql hive translates most queries to MR jobs Pig Pig provides an engine for executing data flows in parallel on Hadoop Pig Latin express data flow Pig Latin script are compiled into MR jobs, then run on the cluster This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. Oozie is used to coordinate execution of multiple MapReduce jobs Musketeer MRv2 Yarn Hadoop: Batch oriented mapping shuffling&reducing Yarnmakes non-MR applications (eg. graph processing, iterative modeling) possible offers better scalability and cluster utilization IBM Apache Spark Spark makes Big Data real-time and interactive in-memory data processing engine Better efficiency: general execution graphs, in-memory data storage. SparkSQL MR could not deal with complex (multi-pass) processing, interactive (ad-hoc) queries or real-time (stream) processing. Spark addresses all these. RDD resilient distributed datasets distribute collection of objs that can be cached in memory across cluster manipulated through parallel operators automatically recomputed on failure Beyond MR: Flink Flink is a parallel data processing platform Flink's data model is more generic than MapReduce's key-value pair model and allows to use any Java (or Scala) data types. Storm Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Kafka is a distributed pub-sub real-time messaging system that provides strong durability and fault tolerance guarantees. Storm does stream processing. A stream is a sequence of tuples. Streams are originated at spouts, that read data We have Hadoop+Yarn for batch processing, Spark for batch+stream processing, Storm+Flink also for stream processing, and Kafka for handling messages BSP The Bulk Synchronous Parallel alternative to MR executed on a set of processors which are connected in a communication network but work independently by themselves. concurrent computation performed LOCALLY by a set of processors communication, during which processors send and receive messages synchronization which is achieved by setting a barrier - when a processor completes its part of computation and communication, it reaches this barrier and waits for the other processors to finish. Google's implementation of BSP is Pregel think like a vertex Giraph open source version of Pregel Hama large scale graph processing a general-purpose Bulk Synchronous Parallel (BSP) computing engine on top of Hadoop. HAMA is not merely a graph computing engine - instead it is a general purpose BSP platform Data Mining model or pattern catagories Classification: involves LABELING data Clustering: involves GROUPING data, based on similarity Regression: involves COUPLING data Rule extraction: involves RELATING data supervised unsupervised Classification and regression trees (aka decision trees) are machine-learning methods for constructing prediction models from data. Ask->prepare->explore->model->implement->evaluate predict/decide Decision tree The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition If the outcome (dependent, or 'target' or 'response' variable) consists of classes (ie. it is 'categorical'), the tree we build is called a classification tree if the target variable is continuous (a numerical quantity), we build a regression tree. algorithms that create classification trees and regression trees are referred to as CART algorithms k-means clustering start with 'n' random locations ('centroids') in/near the dataset; assign each input point (our data) to the closest centroid; compute new centroids (from our data); iterate (till convergence is reached). SVM partitions data (classifies) them into TWO sets - uses a 'slicing' hyperplane The hyperplane maximizes the gap on either side the equidistant data points closest to the hyperplane are the 'support vectors' the separating line/plane/hyperplane needs to be calculated by finding two parallel hyperplanes with no data in between them, and maximizing their gap. The goal is to achieve "margin maximization" in our simplistic setup, we make two assumptions: that our two classes of data are indeed separable (not inter-mingled), and that they are linearly separable. FYI, it is possible to create SVMs even if both the assumptions aren't true. A priori Looking for hidden relationships in large datasets is known as association analysis or association rule learning. the support count (number of times the itemset occurs, divided by the total # of data items) the confidence (conditional probability of an item being in a datum, given another item). we'd consider co-locating or co-marketing (A,C) and also (B,C) - this is the actionable result we get, from mining the basket data. kNN k Nearest Neighbors Used for classification or regression attempts to classify the unlabeled point using the neighbors' type data just stores input data, uses it only when classifying an unlabeled (new) input. Hierarchical clustering it is helpful to separate items in a dataset into hierarchical clusters the merging of smaller clusters into bigger superclusters, or dividing of larger clusters into finer scale ones. pick clusters that lead to the smallest increase in sum of squared distances Difference Cluster take a new piece of data, and place it into a pre-existing group these groups are UN-LABELED, ie. don't have names or ranges. each cluster would be distinct from all other clusters Classification given a piece of new data, to place it into one of several pre-exising, LABELED "buckets" - these labels could be just names/nonimal with classification, just 'gaps' don't need to exist Ensemble Learning we used a training dataset to train several different algorithms This method of combining learners' results is called 'boosting', and resulting combo learner is called an 'ensemble learner'. AdaBoost Bagging Linear regression Parametric methods (eg. regression analysis which we just looked at) involve parameter estimation (regression coefficients). Non-parametric methods no assumptions on what our surface (the dependent variable) would look like; we would need much more data to do the surface fitting, but we don't need the surface to be parameter-based Logistic regression compute regression coeffs (linear) corresponding to a 'decision boundary' (line dividing two classes of training data) use the derived regression coeffs to compute outcome for new data transform the outcome to a logistic regression value, and use the 0..1 result to predict a binary outcome Naive Bayes a probability-based, supervised, classifier algorithm each of the x1 ..xn features are statistically independent of each other For each training feature set, probabilities are assigned for each possible outcome (class) a classification corresponding to the max of the most probable value of each class the algorithm calculates, using the 'maximum a posteriori' posterior probability = prior probability modified by likelihood EM (Expectation Maximization) a statistical model that has some parameters, regular variables, and some 'latent' (hidden) variables. The model can operate on new data (predict/classify), given training data. We do not know the values for the parameters or latent variables! We DO have observed/collected data [for the regular variables], which the model should be able to act on. Step start with guess for values of your model parameters E-step: For each datapoint that has missing values, use your model equation to solve for the distribution of the missing data given your current guess of the model parameters and given the observed data M-step: Now that we've got an expected likelihood function with no unobserved variables in it, maximize the function as you would in the fully observed case, to get a new estimate of your model parameters. repeat until convergence. RandomForest run our new feature through all of them, then use a voting or averaging scheme to derive an ensemble classification result keep each tree small - use sqrt(k) features for it, chosen randomly from the overall 'k' samples Machine Learning focuses on the construction and study of systems that can learn from data to optimize a performance function, such as optimizing the expected reward or minimizing loss functions. The goal is to develop deep insights from data assets faster, extract knowledge from data with greater precision, improve the bottom line and reduce risk. Neural nets ML framework TensorFlow CAFFE Caffe2 Matlab NN Toolbox IPT CVST NN in hardware Nvidia fujitsu inspur cloud gpu used to recognize/classify features detect anomalies predict exchange rate rundown Sigmoid neuron binary threshold if we had multiple neurons (one for each class we want to identify), we'd like their probabilities to sum up to 1.0 - we'd then use a 'Softmax' classifier A Softmax classifier takes an array of 'k' real-valued inputs, and returns an array of 'k' 0..1 outputs that sum up to 1. NN+ Deep Learning, we have large numbers (even 1000!) of hidden layers, each of which learns/processes a single feature. Deep learning is currently one of the best providers of solutions regarding problems in image recognition, speech recognition, object recognition, and natural language with its increasing number of libraries that are available in Python. DNN GPUs (multi-core, high-performance graphics chips made by NVIDIA etc.) and DNNs seem to be a match made in heaven CNN Convolutional Neural Network, aka ConvoNet the 'Input' function [with discrete array-like values] is convolved with a 'Kernel' function [also with a discrete set of values] to produce a result; Convolution is used heavily in creating image-processing filters for blurring, sharpening, edge-detection, etc. In essence, a CNN is where we represent a neuron's weights as a matrix (kernel), and slide it (IP-style) over an input (an image, a piece of speech, text, etc.) to produce a convolved output. When is a CNN **not** a good choice? Answer: when data is not spatially laid out, ie. scrambling rows and columns of the data would still keep the data intact (like in a relational table) but would totally throw off the convolutional neurons! Tensor Flow docker Because CNNs involve pipelined neuron processing, where each neuron (a node in TF) processes arrays of inputs and weights (tensors in TF).\ TF makes it possible to express neural networks as graphs (flexible), as opposed to a collection of chained function calls (rigid). Data visualization histogram/bar-chart Bubble plots choropleth map spatio-temporal data network