Apache Super set . A feature packed open source tool for interactive BI visualization.
Apache Superset is actually a big data BI visualization tool and is a modern big data exploration and visualization platform that allows users to build dashboards quickly and easily using a simple code, a free visualization builder and the most advanced SQL editor.
The project was launched on Airbnb in 2015 and entered the Apache incubator in May 2017.
BI tools are a powerful weapon for data analysis. At present, there are many BI software on the market. The Superset backend is based on python, so…
Engineering Manager: What a Designation. Looks very heavy right ?
Well not so much . There are multiple dimensions associated with this role . All the information in this blog which I am going to provide is from my personal experience and the ladder i followed . Perceptions might differ but the core aspects , I am sure are common and most of you (my buddy managers) will be able to resonate with it . So let’s get started .
Engineering Manager : A person who acts as the baseline resource who drives the growth of any organization. In the…
Data Vault : Sounds like some safe with lot of jewels in it . Isn’t it . Well terms can be jazzy in data engineering field . So what exactly is this Data Vault ..?
Data Vault is a methodology for large scale Data Warehouse implementations. It is a way to accelerate the flow of data in an enterprise level implementation where you are dealing with huge no of source systems . Now before going into details of this methodology let’s first have a look on the issues ,existing Data Warehouse implementations are facing .
Enterprise Data Warehouse approach
Welcome to Episode 2 of the Apache Beam series . In this episode we will further dive deep into Apache Beam applications in fraud detection as well as how it can be used in streaming solutions .
If you want to watch episode 1 please refer this link :
OK . So let’s get started . In fraud detection we will cover three scenarios .
i)Credit card default
ii)Personal loan default
iii)Medical loan default
Credit Card defaulter
i)Assign 1 point to customer for short payment, where a short payment means when customer fails to clear atleast 70% of its monthly…
So this is a hot topic in the way Integrated Data platforms are designed these days . Data Mesh . Every one wants to get under neath it and extract the maximum benefits out of this paradigm shift .
So let’s get into the detail of this concept and some real life instances where it has been implemented.
Data Mesh : A concept first conceptualized by Zhamak Dehghani . Main highlight which concept shows is that how we can move from centralized Monolithic Data Warehouse/Data Lake mindset to a more distributed mindset . Note , I have used the terminology…
Apache beam the latest open source project of Apache is a unified programming model for expressing efficient and portable Big Data pipelines.How ?
i)Unified : 1 unified API to process both Batch and Streaming data
Batch + Stream → Beam
ii)Beam pipeline once created in any language can be able to to run on any of the execution frameworks like Spark, Flink , Apex , CloudDataFlow etc.
It was started in 2016 and has become top level project for Apache.
It was has been developed by Google . Usually Google releases Whitepaper on lot of distributed data engineering systems …
So we all know that Kafka is a messaging system . Well this definition looks fancy but actually it is a very powerful high throughput and easily configurable streaming solution . It can be made to operate in real time and in near real time , depending on your use case . For simplicity and making people to easily understand this use case i would use the term “data” in place of “message”.
Data in Kafka flows in the forms of bytes . That’s the very reason you need to mention the serializer you opt in your client or custom…
Kubernetes is a contained management solution . Slowly this is becoming a platform of choice for hosting spark on top of it . Why? Let’s get to it .
In this post I would like to cover following points :
If you’re already familiar with k8s and why Spark on Kubernetes might be a fit for you, feel free to skip the first couple of sections and…
Fan of Apache Spark? I am too. The reason is simple. Interesting APIs to work with, fast and distributed processing, unlike map-reduce no I/O overhead, fault tolerance and many more. With this much, you can do a lot in this world of Big data and Fast data. From “processing huge chunks of data” to “working on streaming data”, Spark works flawlessly in all. In this blog, we will be talking about the streaming power we get from Spark.
Spark provides us with two ways to work with streaming data
Let’s discuss what are…
Engineering Manager and a technologist having 12+ years of experience in Data Analytics. Passionate to share new concepts and learning in Data Analytics domain.