|Welcome to issue 260 of NoSQL Weekly.
From Our Sponsor
||Hired gives top Software Engineers more power in their job search. In 1 week you'll get 5+ preliminary offers from top tech companies around the US and UK. You'll see salary & equity upfront & you're under no obligation to accept any offer. Want to learn more? Check out Hired today!
Articles, Tutorials and Talks
ZADevChat Episode 17 - CouchDB with Garren Smith
Machine Learning with a Data-Unfriendly Stack
Stripe processes billions of dollars in payments a year on behalf of tens of thousands of businesses, using machine learning to detect and stop fraudulent transactions and fraudulent merchants. Our modeling workflow involves the typical "data science" tools: R and IPython for exploratory analysis, Hadoop for batch data processing, and scikit-learn for model building. However, Stripe's production backend is written in Ruby and uses MongoDB as its data store, and this has introduced difficulties for both model training and production scoring. In this talk, I'll describe the various choices we've made to bridge "main land" and "data land" and how, in the process, our model development process has gone from terrible to "ok."
Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin
In the previous posts about Amazon ML, we built various ML models, such as numeric regression, binary classification, and multi-class classification. Such models can be used for features like recommendation engines. In this post, we are not going to implement the complete set of algorithms that were used in the Amazon solution. Instead, we show you how to use a simpler algorithm that is included out-of-the-box in Spark MLlib for collaborative filtering, called Alternating Least Squares (ALS).
Build a simple distributed system using AWS Lambda, Python, and DynamoDB
In this post, we'll present a complete example of a data aggregation system using Python-based Lambda functions, S3 events, and DynamoDB triggers; and configured using the AWS command-line tools (awscli) wherever possible.
Pyro: A Spatial-Temporal Big-Data Storage System
This paper presents Pyro, a spatial-temporal bigdata storage system tailored for high resolution geometry queries and dynamic hotspots. Pyro understands geometries internally, which allows range scans of a geometry query to be aggregately optimized. Moreover, Pyro employs a novel replica placement policy in the DFS layer that allows Pyro to split a region without losing data locality benefits. Our evaluations use NYC taxi trace data and an 80-server cluster. Results show that Pyro reduces the response time by 60X on 1kmx1km rectangle geometries compared to the state-of-the-art solutions. Pyro further achieves 10X throughput improvement on 100mx100m rectangle geometries.
N1QL - Typed and Untyped JSON Schemas in GO
Developed by Couchbase for use with Couchbase Server, N1QL provides a common query language and JSON-based data model for distributed document-oriented databases. N1QL is a powerful and expressive query language. Among the numerous benefits N1QL provides, it allows the developer a rich ad hoc query experience. In Go, it's easiest to interact with JSON when the schema/structure of the document is known in advance--what happens when queries are built dynamically within the application, at run time? What happens if the results are not strongly typed into a well defined schema? What are the strategies for interacting with an unknown JSON structure in Go? Let's answer these questions by examining three common usage patterns for issuing queries with N1QL in Couchbase.
Navigating Unstructured Data - Availability vs. Analytics in NoSQL
Understanding types of data workloads requires a fundamental appreciation of distributed systems. We will explore what factors affect your choice in database technology and particularly how to prioritize the choice in core architectural underpinnings present in NoSQL designs. We will also explore what these technologies solve and suggestions for how to align them with your application's objectives for data insights. You'll leave this session with an understanding of the principles separating NoSQL databases, frameworks like Hadoop, and projects that are top of mind like Apache Spark and Kafka. You'll also gain a deeper understanding of the considerations when identifying a distributed system to handle both availability and analytics for your active workloads.
Improving My CLI's Autocomplete with Markov Chains
For a while I've been working on cycli, a command line interface (CLI) for Neo4j's Cypher query language. As demonstrated below, it autocompletes on your node labels, relationship types, property keys, and Cypher keywords. The autocompletion of the lattermost in this list, Cypher keywords, is the focus of this post.
Secure and Scalable Data Collection Using OpenDOF
Security and scalability are critical elements of any Internet of Things solution. Unfortunately, most engineers are not experts in security and have no experience in architecting large-scale systems. This presentation will discuss two open-source solutions to these problems, covering the device, gateway, and cloud. The presentation will briefly discuss object and security models, and then discuss issues surrounding time-series data collection. Finally we will demonstrate an open-source toolkit for securely gathering data and storing it a variety of cloud storage options including AWS DynamoDB and MongoDB.
ToroDB Internals: How to Create a NoSQL Database on Top of SQL
Ebola Twitter Network Analysis
Distributed Search in Riak: Integrating search in a NoSQL database
An Automated Market of Cypher-Annotated Microservices, Part 2
Cassandra Design Patterns
This book starts with strategies to integrate Cassandra with other legacy data stores and progresses to the ways in which a migration from RDBMS to Cassandra can be accomplished. The journey continues with ideas to migrate data from cache solutions to Cassandra. With this, the stage is set and the book moves on to some of the most commonly seen problems in applications when dealing with consistency, availability, and partition tolerance guarantees.
Interesting Projects, Tools and Libraries
An open-source distributed graph database.
An in-memory database implementing a large subset of the CouchDB REST API. AvanceDB has blistering fast document lookup and map/reduce performance. If you are currently using CouchDB and struggle with view build times then AvanceDB should be a seamless replacement for your view workload.
Lock and cache using redis! Most caching libraries don't do locking, meaning that >1 process can be calculating a cached value at the same time. Since you presumably cache things because they cost CPU, database reads, or money, doesn't it make sense to lock while caching?
Redis cache cluster system in Python.
Basic implementation of RedisArray for NodeJS to be compatible with phpredis.
A lightweight, drop-in replacement for the Couchnode module with added support for A+ Promises.
Designed to handle large amounts of data across many commodity servers, Apache Cassandra provides high availability with no single point of failure. v3.0 is a new milestone in the database's evolution with performance optimizations, improved data consistency operations, an average of 50% data storage savings, and numerous important developer enhancements, such as new materialized views, that greatly simplify application development.
RethinkDB 2.2 includes over 120 enhancements, significantly improves performance, memory usage and scalability, adds new ReQL commands, and ships with atomic changefeed support.
Upcoming Events and Webinars
Neo4j London Meetup November 2015 - London, United Kingdom
There will be following talks
- Neo4j Full Stack Applications - Lessons and Disasters from the field.
- Python, R and Neo4j - The Data Science Stack
Share NoSQL Weekly